BSV pipeline¶
Index the website https://www.bsv.admin.ch/ for Autocomplete data.
- class indexing.pipelines.bsv.BSVIndexer(base_url: str, proxy: str = None)¶
Indexer that can extract FAQ questions from a specified website.
- Parameters:
base_url (str) – sitemap URL of the website to scrap
proxy (str, optional) – Proxy URL if necessary
- async run(k: int = 0, test: bool = False, embed: Tuple[bool, bool] | bool = False, db: Session = None)¶
Retrieves and processes FAQ data from base_url to insert into the database.
Each extracted FAQ entry is then upserted (inserted or updated if already exists) into the database, with detailed logging to track the operation’s progress and identify any errors.
If test>0, then extract only the specified number of articles and log them instead of upserting them.
Log a confirmation message upon successful completion of the process.
Todo
Consider implementing error handling at a more granular level to retry failed insertions or updates, enhancing the robustness of the data ingestion process.
Explore optimization opportunities in text extraction and processing to improve efficiency and reduce runtime, especially for large sitemaps.
- Parameters:
k (int, default 0) – Number of articles to scrape and log to test the method
test (bool, default False) – Flag to indicate whether to test the method by logging the extracted articles instead of upserting them
embed (Union[Tuple[bool, bool], bool], default False) – Flag to indicate whether to embed the source and/or answer documents in the question object
db (Session, optional) – Database session to use for upserting the extracted
- Returns:
list of urls which got extracted
- Return type:
list of str
- get_sitemap_urls() List[str] ¶
Extract URLs from the sitemap. The endpoint ‘/sitemap.xml’ is used to discover all relevant FAQ URLs.
- extract_article(url: str)¶
Given an url, extracts the primary question (denoted by the ‘h1’ tag) and its corresponding answer (within an ‘article’ tag).
Unnecessary boilerplate text will be removed for clarity and conciseness.
- Parameters:
url (str) – URL of the website where the article needs to be extracted.
- Returns:
The article language, its question and its answer.
- Return type:
str, str, str