Scraper¶

class indexing.base.BaseScraper¶

Abstract base class for scraping models.

scrap_urls(url: List[str]) → List[Any]:¶: Abstract method to scrape HTML content from a list of URLs.

html_content_from_sitemap(sitemap_url: str) → List[Any]:¶: Abstract method to scrape HTML content from a sitemap.xml URL.

pdf_content_from_sitemap(sitemap_url: str) → List[Any]:¶: Abstract method to scrape PDF content from a specific sitemap URL.

async fetch(url: str) → str¶

Fetches the content from a given URL.

abstract scrap_urls(urls: List[str]) → List[Any]¶

Abstract method to scrape HTML content from a list of URLs.

class indexing.scraper.Scraper¶

A class used to scrap URLs from *.admin.ch websites.

fetcher¶

An instance of LinkContentFetcher to fetch the content of URLs.

scrap_urls(url_list: List[str]) → List[ByteStream]¶: Scrapes the given URLs and returns the content as a list of ByteStreams.

scrap_urls(urls: List[str]) → List[ByteStream]¶

Scrapes the given URLs and returns the content as a list of ByteStreams.