Parser¶

class indexing.base.BaseParser¶

Abstract base class for parsing models.

parse_xml(xml: bytes) List[str]:¶

Abstract method to parse XML content.

parse_html(html: bytes) List[str]:¶

Abstract method to parse HTML content.

convert_html_to_documents(content: List[Any]) List[Document]:¶

Abstract method to convert HTML content to documents.

convert_pdf_to_documents(content: List[Any]) List[Document]:¶

Abstract method to convert PDF content to documents.

clean_documents(documents: List[Document]) List[Document]:¶

Abstract method to clean documents.

split_documents(documents: List[Document]) List[Document]:¶

Abstract method to split documents into chunks.

remove_empty_documents(documents: List[Any]) List[Any]¶

Remove documents from the list that have their data attribute set to None.

Parameters:

documents (list) – Document objects to be filtered.

Returns:

Document objects where the content attribute is not None.

Return type:

list

Removes duplicate links from a list of tags.

Parameters:

links (list of bs4.element.Tag) – The list of tags to remove duplicates from.

Returns:

The list of tags without duplicates.

Return type:

list of bs4.element.Tag

contains_tag(tag)¶

Check if a tag contains a specific string.

Parameters:

tag (bs4.element.Tag) – The tag to check.

Returns:

True if the tag contains the string, False otherwise.

Return type:

bool

get_pdf_paths(soup)¶

Get the paths to PDF files from a BeautifulSoup object.

Parameters:

soup (BeautifulSoup) – The BeautifulSoup object to extract PDF paths from.

Returns:

The list of PDF paths.

Return type:

list of str

abstract parse_urls(content: str) List[str]¶

Extracts URLs from the given content.

Parameters:

content (str) – The content to extract URLs from.

Returns:

A list of URLs extracted from the content.

Return type:

List[str]

abstract convert_to_documents(content: List[Any]) List[Any]¶

Abstract method to convert some content to documents.

Parameters:

content (list of Any) – The content to convert.

Returns:

The converted documents.

Return type:

list of Document

clean_documents(documents: List[Document]) List[Document]¶

Removes docs with None content and cleans the given documents.

Parameters:

documents (List[Document]) – The documents to clean.

Returns:

Cleaned documents.

Return type:

List[Document]

split_documents(documents: List[Document]) List[Document]¶

Removes docs with None content and splits the given documents into chunks.

Parameters:

documents (List[Document]) – The documents to split into chunks.

Returns:

A list of documents split into chunks.

Return type:

List[Document]