AHV/IV pipeline¶
Index the website https://www.ahv-iv.ch/ for RAG data.
- class indexing.pipelines.ahv.AHVParser¶
A class used to parse and clean documents.
- pdf_converter¶
An instance of PyPDFToDocument to convert PDF content to Document objects.
- Type:
PyPDFToDocument
- cleaner¶
An instance of DocumentCleaner to clean documents.
- Type:
DocumentCleaner
- splitter¶
An instance of DocumentSplitter to split documents into chunks.
- Type:
DocumentSplitter
- parse_html(html: bytes) List[str] ¶
Extracts URLs from the given HTML content.
- convert_pdf_to_documents(content: List[ByteStream]) List[Document] ¶
Converts PDF content to Document objects.
- split_documents(documents: List[Document]) List[Document] ¶
Splits the given documents into chunks.
- contains_tag(tag)¶
Checks if a tag contains a memento URL.
- Parameters:
tag (bs4.element.Tag) – The tag to check.
- Returns:
True if the tag contains a memento URL, False otherwise.
- Return type:
bool
- get_pdf_paths(soup)¶
Extracts the paths of PDF documents from a BeautifulSoup object.
- Parameters:
soup (BeautifulSoup) – The BeautifulSoup object to extract PDF paths from.
- Returns:
The list of PDF paths.
- Return type:
list of str
- parse_urls(content: str) List[str] ¶
Extracts URLs from the given content.
- Parameters:
content (str) – The content to extract URLs from.
- Returns:
A list of URLs extracted from the content.
- Return type:
List[str]
- class indexing.pipelines.ahv.AHVIndexer(scraper, parser)¶
A class used to index PDF Merkblätter documents into a VectorDB.
- parser¶
An instance of Parser to parse and clean documents.
- Type:
Parser
- index(sitemap_url: str) dict ¶
Scraps, parses and indexes PDF content from the given sitemap URL into the VectorDB.
- async from_pages_to_content(pages: List[ByteStream]) List[Any] ¶
Abstract method to convert URLs to content.
- Parameters:
pages (List[ByteStream]) – The HTML pages to convert to content.
- Returns:
The content extracted from the URLs.
- Return type:
List[Any]