Parser¶

class indexing.base.BaseParser¶

Abstract base class for parsing models.

parse_xml(xml: bytes) → List[str]:¶: Abstract method to parse XML content.

parse_html(html: bytes) → List[str]:¶: Abstract method to parse HTML content.

convert_html_to_documents(content: List[Any]) → List[Document]:¶: Abstract method to convert HTML content to documents.

convert_pdf_to_documents(content: List[Any]) → List[Document]:¶: Abstract method to convert PDF content to documents.

clean_documents(documents: List[Document]) → List[Document]:¶: Abstract method to clean documents.

split_documents(documents: List[Document]) → List[Document]:¶: Abstract method to split documents into chunks.

remove_empty_documents(documents: List[Any]) → List[Any]¶

Remove documents from the list that have their data attribute set to None.

remove_duplicate_links(links)¶

Removes duplicate links from a list of tags.

Parameters:: links (list of bs4.element.Tag) – The list of tags to remove duplicates from.
Returns:: The list of tags without duplicates.
Return type:: list of bs4.element.Tag

contains_tag(tag)¶

Check if a tag contains a specific string.

get_pdf_paths(soup)¶

Get the paths to PDF files from a BeautifulSoup object.

Parameters:: soup (BeautifulSoup) – The BeautifulSoup object to extract PDF paths from.
Returns:: The list of PDF paths.
Return type:: list of str

abstract parse_urls(content: str) → List[str]¶

Extracts URLs from the given content.

abstract convert_to_documents(content: List[Any]) → List[Any]¶

Abstract method to convert some content to documents.

clean_documents(documents: List[Document]) → List[Document]¶

Removes docs with None content and cleans the given documents.

split_documents(documents: List[Document]) → List[Document]¶

Removes docs with None content and splits the given documents into chunks.

Parameters:: documents (List[Document]) – The documents to split into chunks.
Returns:: A list of documents split into chunks.
Return type:: List[Document]