EAK pipeline¶

Index the website https://www.eak.admin.ch/ for RAG data.

class indexing.pipelines.admin.AdminParser¶

A class used to parse, clean and split documents from a *.admin.ch website.

html_converter¶

An instance of HTMLToDocument to convert HTML content to Document objects.

Type:

HTMLToDocument

cleaner¶

An instance of DocumentCleaner to clean documents.

Type:

DocumentCleaner

splitter¶

An instance of DocumentSplitter to split documents into chunks.

Type:

DocumentSplitter

parse_xml(sitemap: bytes) List[str]¶

Extracts URLs from the given XML sitemap.

convert_html_to_documents(content: List[ByteStream]) List[Document]¶

Converts HTML content to Document objects.

clean_documents(documents: List[Document]) List[Document]¶

Cleans the given documents.

split_documents(documents: List[Document]) List[Document]¶

Splits the given documents into chunks.

parse_urls(content: str) List[str]¶

Extracts URLs from the given content.

Parameters:

content (str) – The content to extract URLs from.

Returns:

A list of URLs extracted from the content.

Return type:

List[str]

convert_to_documents(content: List[Any]) List[Any]¶

Abstract method to convert some content to documents.

Parameters:

content (list of Any) – The content to convert.

Returns:

The converted documents.

Return type:

list of Document

class indexing.pipelines.admin.AdminIndexer(scraper, parser)¶

A class used to index documents from *.admin.ch into a VectorDB.

scraper¶

An instance of Scraper to scrape URLs and extract content from them.

Type:

Scraper

parser¶

An instance of Parser to parse and clean documents.

Type:

Parser

index(sitemap_url: str) dict¶

Scraps, parses and indexes HTML webpage content from the given sitemap URL into the VectorDB.

async from_pages_to_content(pages: List[ByteStream]) List[Any]¶

Abstract method to convert URLs to content.

Parameters:

pages (List[ByteStream]) – The HTML pages to convert to content.

Returns:

The content extracted from the URLs.

Return type:

List[Any]