Scraper¶

class indexing.base.BaseScraper¶

Abstract base class for scraping models.

fetch(url: str) bytes:¶

Fetches the content from a given URL.

scrap_urls(url: List[str]) List[Any]:¶

Abstract method to scrape HTML content from a list of URLs.

html_content_from_sitemap(sitemap_url: str) List[Any]:¶

Abstract method to scrape HTML content from a sitemap.xml URL.

pdf_content_from_sitemap(sitemap_url: str) List[Any]:¶

Abstract method to scrape PDF content from a specific sitemap URL.

async fetch(url: str) str¶

Fetches the content from a given URL.

Parameters:

url (str) – The URL to fetch content from.

Returns:

The content of the URL.

Return type:

bytes

Raises:

aiohttp.ClientError – If the fetch operation fails.

abstract scrap_urls(urls: List[str]) List[Any]¶

Abstract method to scrape HTML content from a list of URLs.

Parameters:

urls (list of str) – The URLs to scrape content from.

Returns:

The scraped content.

Return type:

list of Any

class indexing.scraper.Scraper¶

A class used to scrap URLs from *.admin.ch websites.

fetcher¶

An instance of LinkContentFetcher to fetch the content of URLs.

Type:

LinkContentFetcher

scrap_urls(url_list: List[str]) List[ByteStream]¶

Scrapes the given URLs and returns the content as a list of ByteStreams.

scrap_urls(urls: List[str]) List[ByteStream]¶

Scrapes the given URLs and returns the content as a list of ByteStreams.

Parameters:

urls (List[str]) – A list of URLs to scrape.

Returns:

A list of ByteStreams containing the content of the scraped URLs.

Return type:

List[ByteStream]