Indexing¶
Create a pipeline to index data from a source to the database, using a parser and a scraper.
- class indexing.base.BaseIndexer(scraper, parser)¶
Abstract base class for indexing models.
- index(sitemap_url: str) dict: ¶
Abstract method to index content from a URL into a vectorDB.
- abstract async from_pages_to_content(pages: List[ByteStream]) List[Any] ¶
Abstract method to convert URLs to content.
- Parameters:
pages (List[ByteStream]) – The HTML pages to convert to content.
- Returns:
The content extracted from the URLs.
- Return type:
List[Any]
- async get_content_from_pdf(content: List[Any]) List[Any] ¶
Extract content from PDFs.
- Parameters:
content (List[Any]) – The content to extract from.
- Returns:
The extracted content chunked.
- Return type:
List[Any]
- async add_content_to_db(db: Session, content: List[Any], source: str, user_uuid: str, language: str, embed: bool)¶
Add content to the database.
- Parameters:
db (Session) – The database session to use.
content (List[Any]) – Content to add to the database.
source (str) – The source of the content.
user_uuid (str) – The UUID of the user who added the content.
language (str) – Language of the content.
embed (bool) – Whether to embed the content.
- Returns:
content: Success message
- Return type:
dict