EAK pipeline¶

Index the website https://www.eak.admin.ch/ for RAG data.

class indexing.pipelines.admin.AdminParser¶

A class used to parse, clean and split documents from a *.admin.ch website.

html_converter¶

An instance of HTMLToDocument to convert HTML content to Document objects.

cleaner¶

An instance of DocumentCleaner to clean documents.

splitter¶

An instance of DocumentSplitter to split documents into chunks.

parse_xml(sitemap: bytes) → List[str]¶: Extracts URLs from the given XML sitemap.

convert_html_to_documents(content: List[ByteStream]) → List[Document]¶: Converts HTML content to Document objects.

clean_documents(documents: List[Document]) → List[Document]¶: Cleans the given documents.

split_documents(documents: List[Document]) → List[Document]¶: Splits the given documents into chunks.

parse_urls(content: str) → List[str]¶

Extracts URLs from the given content.

convert_to_documents(content: List[Any]) → List[Any]¶

Abstract method to convert some content to documents.

class indexing.pipelines.admin.AdminIndexer(scraper, parser)¶

A class used to index documents from *.admin.ch into a VectorDB.

scraper¶

An instance of Scraper to scrape URLs and extract content from them.

parser¶

An instance of Parser to parse and clean documents.

index(sitemap_url: str) → dict¶: Scraps, parses and indexes HTML webpage content from the given sitemap URL into the VectorDB.

async from_pages_to_content(pages: List[ByteStream]) → List[Any]¶

Abstract method to convert URLs to content.

Parameters:: pages (List[ByteStream]) – The HTML pages to convert to content.
Returns:: The content extracted from the URLs.
Return type:: List[Any]