v2.0.0 b1 - Add Data Extract and Corpus Querying
2.0.0 Beta 1
Added Grid-based Data Extraction and Corpus Querying
This update extends the analytical capabilities of the application, allowing for automated and background extraction of structured data from documents, improving efficiency and scalability.
We've added a couple models on the backend:
Extract: Represents a headless, background annotation task linked to a Corpus and Fieldset.
Fieldset: Defines a reusable set of fields for Extracts, linked to Columns.
Column: Represents a discrete data structure to extract from a document, with various properties like query, match_text, output_type, and more.
Datacell: Represents extracted data for each column and document, storing data as JSON.
LanguageModel: Represents a language model to be used in the extraction process.
Improved Test Suite
- LlamaIndex is being tested with vcr.py so we actually have realistic tests and mocks for corpus query and corpus extract tasks
- Added a lot of graphql query and endpoint tests
New GUI Elements
- There is now an extract tab and a number of GUI elements to make it easy to construct an extract grid made up of documents, corpora and re-usable columns.
- Within the Corpus view, there is a query tab you can use to ask questions of the corpus
What's Changed
Full Changelog: v1.3.0...v2.0.0b1