Scraper¶
- class indexing.base.BaseScraper¶
Abstract base class for scraping models.
- fetch(url: str) bytes: ¶
Fetches the content from a given URL.
- scrap_urls(url: List[str]) List[Any]: ¶
Abstract method to scrape HTML content from a list of URLs.
- html_content_from_sitemap(sitemap_url: str) List[Any]: ¶
Abstract method to scrape HTML content from a sitemap.xml URL.
- pdf_content_from_sitemap(sitemap_url: str) List[Any]: ¶
Abstract method to scrape PDF content from a specific sitemap URL.
- async fetch(url: str) str ¶
Fetches the content from a given URL.
- Parameters:
url (str) – The URL to fetch content from.
- Returns:
The content of the URL.
- Return type:
bytes
- Raises:
aiohttp.ClientError – If the fetch operation fails.
- abstract scrap_urls(urls: List[str]) List[Any] ¶
Abstract method to scrape HTML content from a list of URLs.
- Parameters:
urls (list of str) – The URLs to scrape content from.
- Returns:
The scraped content.
- Return type:
list of Any
- class indexing.scraper.Scraper¶
A class used to scrap URLs from *.admin.ch websites.
- fetcher¶
An instance of LinkContentFetcher to fetch the content of URLs.
- Type:
LinkContentFetcher
- scrap_urls(url_list: List[str]) List[ByteStream] ¶
Scrapes the given URLs and returns the content as a list of ByteStreams.
- scrap_urls(urls: List[str]) List[ByteStream] ¶
Scrapes the given URLs and returns the content as a list of ByteStreams.
- Parameters:
urls (List[str]) – A list of URLs to scrape.
- Returns:
A list of ByteStreams containing the content of the scraped URLs.
- Return type:
List[ByteStream]