BSV pipeline¶

Index the website https://www.bsv.admin.ch/ for Autocomplete data.

class indexing.pipelines.bsv.BSVIndexer(base_url: str, proxy: str = None)¶

Indexer that can extract FAQ questions from a specified website.

Parameters:
  • base_url (str) – sitemap URL of the website to scrap

  • proxy (str, optional) – Proxy URL if necessary

async run(k: int = 0, test: bool = False, embed: Tuple[bool, bool] | bool = False, db: Session = None)¶

Retrieves and processes FAQ data from base_url to insert into the database.

Each extracted FAQ entry is then upserted (inserted or updated if already exists) into the database, with detailed logging to track the operation’s progress and identify any errors.

If test>0, then extract only the specified number of articles and log them instead of upserting them.

Log a confirmation message upon successful completion of the process.

Todo

  • Consider implementing error handling at a more granular level to retry failed insertions or updates, enhancing the robustness of the data ingestion process.

  • Explore optimization opportunities in text extraction and processing to improve efficiency and reduce runtime, especially for large sitemaps.

Parameters:
  • k (int, default 0) – Number of articles to scrape and log to test the method

  • test (bool, default False) – Flag to indicate whether to test the method by logging the extracted articles instead of upserting them

  • embed (Union[Tuple[bool, bool], bool], default False) – Flag to indicate whether to embed the source and/or answer documents in the question object

  • db (Session, optional) – Database session to use for upserting the extracted

Returns:

list of urls which got extracted

Return type:

list of str

get_sitemap_urls() List[str]¶

Extract URLs from the sitemap. The endpoint ‘/sitemap.xml’ is used to discover all relevant FAQ URLs.

extract_article(url: str)¶

Given an url, extracts the primary question (denoted by the ‘h1’ tag) and its corresponding answer (within an ‘article’ tag).

Unnecessary boilerplate text will be removed for clarity and conciseness.

Parameters:

url (str) – URL of the website where the article needs to be extracted.

Returns:

The article language, its question and its answer.

Return type:

str, str, str