Parser¶
- class indexing.base.BaseParser¶
Abstract base class for parsing models.
- parse_xml(xml: bytes) List[str]: ¶
Abstract method to parse XML content.
- parse_html(html: bytes) List[str]: ¶
Abstract method to parse HTML content.
- convert_html_to_documents(content: List[Any]) List[Document]: ¶
Abstract method to convert HTML content to documents.
- convert_pdf_to_documents(content: List[Any]) List[Document]: ¶
Abstract method to convert PDF content to documents.
- split_documents(documents: List[Document]) List[Document]: ¶
Abstract method to split documents into chunks.
- remove_empty_documents(documents: List[Any]) List[Any] ¶
Remove documents from the list that have their data attribute set to None.
- Parameters:
documents (list) – Document objects to be filtered.
- Returns:
Document objects where the content attribute is not None.
- Return type:
list
- remove_duplicate_links(links)¶
Removes duplicate links from a list of tags.
- Parameters:
links (list of bs4.element.Tag) – The list of tags to remove duplicates from.
- Returns:
The list of tags without duplicates.
- Return type:
list of bs4.element.Tag
- contains_tag(tag)¶
Check if a tag contains a specific string.
- Parameters:
tag (bs4.element.Tag) – The tag to check.
- Returns:
True if the tag contains the string, False otherwise.
- Return type:
bool
- get_pdf_paths(soup)¶
Get the paths to PDF files from a BeautifulSoup object.
- Parameters:
soup (BeautifulSoup) – The BeautifulSoup object to extract PDF paths from.
- Returns:
The list of PDF paths.
- Return type:
list of str
- abstract parse_urls(content: str) List[str] ¶
Extracts URLs from the given content.
- Parameters:
content (str) – The content to extract URLs from.
- Returns:
A list of URLs extracted from the content.
- Return type:
List[str]
- abstract convert_to_documents(content: List[Any]) List[Any] ¶
Abstract method to convert some content to documents.
- Parameters:
content (list of Any) – The content to convert.
- Returns:
The converted documents.
- Return type:
list of Document
- clean_documents(documents: List[Document]) List[Document] ¶
Removes docs with None content and cleans the given documents.