Scraper¶

class indexing.scraper.Scraper(base_url: str, proxy: str = '')¶

Scraper class that can extract FAQ questions from a specified website.

Parameters:
  • base_url (str) – sitemap URL of the website to scrap

  • proxy (str, optional) – Proxy URL if necessary

extract_article(url: str)¶

Given an url, extracts the primary question (denoted by the ‘h1’ tag) and its corresponding answer (within an ‘article’ tag).

Unnecessary boilerplate text will be removed for clarity and conciseness.

Parameters:

url (str) – URL of the website where the article needs to be extracted.

Returns:

The article language, its question and its answer.

Return type:

str, str, str

get_sitemap_urls() List[str]¶

Extract URLs from the sitemap. The endpoint ‘/sitemap.xml’ is used to discover all relevant FAQ URLs.

async run(test: int = 0)¶

Retrieves and processes FAQ data from base_url to insert into the database.

Each extracted FAQ entry is then upserted (inserted or updated if already exists) into the database, with detailed logging to track the operation’s progress and identify any errors.

If test>0, then extract only the specified number of articles and log them instead of upserting them.

Log a confirmation message upon successful completion of the process.

Todo

  • Consider implementing error handling at a more granular level to retry failed insertions or updates, enhancing the robustness of the data ingestion process.

  • Explore optimization opportunities in text extraction and processing to improve efficiency and reduce runtime, especially for large sitemaps.

Parameters:

test (int, default 0) – Number of articles to extract as a test

Returns:

list of urls which got extracted

Return type:

list of str