Indexing¶

Create a pipeline to index data from a source to the database, using a parser and a scraper.

class indexing.base.BaseIndexer(scraper, parser)¶

Abstract base class for indexing models.

index(sitemap_url: str) dict:¶

Abstract method to index content from a URL into a vectorDB.

abstract async from_pages_to_content(pages: List[ByteStream]) List[Any]¶

Abstract method to convert URLs to content.

Parameters:

pages (List[ByteStream]) – The HTML pages to convert to content.

Returns:

The content extracted from the URLs.

Return type:

List[Any]

async get_content_from_pdf(content: List[Any]) List[Any]¶

Extract content from PDFs.

Parameters:

content (List[Any]) – The content to extract from.

Returns:

The extracted content chunked.

Return type:

List[Any]

async add_content_to_db(db: Session, content: List[Any], source: str, user_uuid: str, language: str, embed: bool)¶

Add content to the database.

Parameters:
  • db (Session) – The database session to use.

  • content (List[Any]) – Content to add to the database.

  • source (str) – The source of the content.

  • user_uuid (str) – The UUID of the user who added the content.

  • language (str) – Language of the content.

  • embed (bool) – Whether to embed the content.

Returns:

content: Success message

Return type:

dict