EAK pipeline¶
Index the website https://www.eak.admin.ch/ for RAG data.
- class indexing.pipelines.admin.AdminParser¶
A class used to parse, clean and split documents from a *.admin.ch website.
- html_converter¶
An instance of HTMLToDocument to convert HTML content to Document objects.
- Type:
HTMLToDocument
- cleaner¶
An instance of DocumentCleaner to clean documents.
- Type:
DocumentCleaner
- splitter¶
An instance of DocumentSplitter to split documents into chunks.
- Type:
DocumentSplitter
- parse_xml(sitemap: bytes) List[str] ¶
Extracts URLs from the given XML sitemap.
- convert_html_to_documents(content: List[ByteStream]) List[Document] ¶
Converts HTML content to Document objects.
- split_documents(documents: List[Document]) List[Document] ¶
Splits the given documents into chunks.
- parse_urls(content: str) List[str] ¶
Extracts URLs from the given content.
- Parameters:
content (str) – The content to extract URLs from.
- Returns:
A list of URLs extracted from the content.
- Return type:
List[str]
- class indexing.pipelines.admin.AdminIndexer(scraper, parser)¶
A class used to index documents from *.admin.ch into a VectorDB.
- parser¶
An instance of Parser to parse and clean documents.
- Type:
Parser
- index(sitemap_url: str) dict ¶
Scraps, parses and indexes HTML webpage content from the given sitemap URL into the VectorDB.
- async from_pages_to_content(pages: List[ByteStream]) List[Any] ¶
Abstract method to convert URLs to content.
- Parameters:
pages (List[ByteStream]) – The HTML pages to convert to content.
- Returns:
The content extracted from the URLs.
- Return type:
List[Any]