AHV/IV pipeline¶

Index the website https://www.ahv-iv.ch/ for RAG data.

class indexing.pipelines.ahv.AHVParser¶

A class used to parse and clean documents.

pdf_converter¶

An instance of PyPDFToDocument to convert PDF content to Document objects.

cleaner¶

An instance of DocumentCleaner to clean documents.

splitter¶

An instance of DocumentSplitter to split documents into chunks.

parse_html(html: bytes) → List[str]¶: Extracts URLs from the given HTML content.

convert_pdf_to_documents(content: List[ByteStream]) → List[Document]¶: Converts PDF content to Document objects.

clean_documents(documents: List[Document]) → List[Document]¶: Cleans the given documents.

split_documents(documents: List[Document]) → List[Document]¶: Splits the given documents into chunks.

contains_tag(tag)¶

Checks if a tag contains a memento URL.

get_pdf_paths(soup)¶

Extracts the paths of PDF documents from a BeautifulSoup object.

Parameters:: soup (BeautifulSoup) – The BeautifulSoup object to extract PDF paths from.
Returns:: The list of PDF paths.
Return type:: list of str

parse_urls(content: str) → List[str]¶

Extracts URLs from the given content.

convert_to_documents(content: List[Any]) → List[Any]¶

Abstract method to convert some content to documents.

class indexing.pipelines.ahv.AHVIndexer(scraper, parser)¶

A class used to index PDF Merkblätter documents into a VectorDB.

scraper¶

An instance of Scraper to scrape URLs and extract content from them.

parser¶

An instance of Parser to parse and clean documents.

index(sitemap_url: str) → dict¶: Scraps, parses and indexes PDF content from the given sitemap URL into the VectorDB.

async from_pages_to_content(pages: List[ByteStream]) → List[Any]¶

Abstract method to convert URLs to content.

Parameters:: pages (List[ByteStream]) – The HTML pages to convert to content.
Returns:: The content extracted from the URLs.
Return type:: List[Any]