AHV/IV pipeline¶

Index the website https://www.ahv-iv.ch/ for RAG data.

class indexing.pipelines.ahv.AHVParser¶

A class used to parse and clean documents.

pdf_converter¶

An instance of PyPDFToDocument to convert PDF content to Document objects.

Type:

PyPDFToDocument

cleaner¶

An instance of DocumentCleaner to clean documents.

Type:

DocumentCleaner

splitter¶

An instance of DocumentSplitter to split documents into chunks.

Type:

DocumentSplitter

parse_html(html: bytes) List[str]¶

Extracts URLs from the given HTML content.

convert_pdf_to_documents(content: List[ByteStream]) List[Document]¶

Converts PDF content to Document objects.

clean_documents(documents: List[Document]) List[Document]¶

Cleans the given documents.

split_documents(documents: List[Document]) List[Document]¶

Splits the given documents into chunks.

contains_tag(tag)¶

Checks if a tag contains a memento URL.

Parameters:

tag (bs4.element.Tag) – The tag to check.

Returns:

True if the tag contains a memento URL, False otherwise.

Return type:

bool

get_pdf_paths(soup)¶

Extracts the paths of PDF documents from a BeautifulSoup object.

Parameters:

soup (BeautifulSoup) – The BeautifulSoup object to extract PDF paths from.

Returns:

The list of PDF paths.

Return type:

list of str

parse_urls(content: str) List[str]¶

Extracts URLs from the given content.

Parameters:

content (str) – The content to extract URLs from.

Returns:

A list of URLs extracted from the content.

Return type:

List[str]

convert_to_documents(content: List[Any]) List[Any]¶

Abstract method to convert some content to documents.

Parameters:

content (list of Any) – The content to convert.

Returns:

The converted documents.

Return type:

list of Document

class indexing.pipelines.ahv.AHVIndexer(scraper, parser)¶

A class used to index PDF Merkblätter documents into a VectorDB.

scraper¶

An instance of Scraper to scrape URLs and extract content from them.

Type:

Scraper

parser¶

An instance of Parser to parse and clean documents.

Type:

Parser

index(sitemap_url: str) dict¶

Scraps, parses and indexes PDF content from the given sitemap URL into the VectorDB.

async from_pages_to_content(pages: List[ByteStream]) List[Any]¶

Abstract method to convert URLs to content.

Parameters:

pages (List[ByteStream]) – The HTML pages to convert to content.

Returns:

The content extracted from the URLs.

Return type:

List[Any]