Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added MMR algorithm to search #28

Merged
merged 1 commit into from
Aug 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 179 additions & 70 deletions docs/how-to/Predict-Missing-Data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
"\n",
"The framework is designed to support different kinds of inference, including rule-based and LLMs. This notebooks shows simple ML-based inference using scikit-learn DecisionTrees.\n",
"\n",
"We will use the Iris dataset:"
"This how-to walks through the basic operations of using the `linkml-store` command line tool to perform training and inference using scikit-learn DecisionTrees. This uses the command line interface, but the same operations can be performed programmatically using the Python API, or via the Web API.\n",
"\n",
"We will use a subset of the classic [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), converted to jsonl (JSON Lines) format:"
],
"metadata": {
"collapsed": false
Expand All @@ -18,7 +20,18 @@
},
{
"cell_type": "code",
"execution_count": 18,
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl describe"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-23T22:15:36.754913Z",
"start_time": "2024-08-23T22:15:33.366042Z"
}
},
"id": "d2ef6e85292b5a20",
"outputs": [
{
"name": "stdout",
Expand All @@ -33,25 +46,111 @@
]
}
],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl describe"
],
"execution_count": 2
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## The Infer Command",
"id": "335516b2c129363a"
},
{
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T20:08:06.401967Z",
"start_time": "2024-08-12T20:08:03.933123Z"
"end_time": "2024-08-23T22:20:41.635957Z",
"start_time": "2024-08-23T22:20:38.428284Z"
}
},
"id": "d2ef6e85292b5a20"
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store infer --help"
],
"id": "e38efeb1addfe697",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Usage: linkml-store infer [OPTIONS]\n",
"\n",
" Predict a complete object from a partial object.\n",
"\n",
" Currently two main prediction methods are provided: RAG and sklearn\n",
"\n",
" ## RAG:\n",
"\n",
" The RAG approach will use Retrieval Augmented Generation to inference the\n",
" missing attributes of an object.\n",
"\n",
" Example:\n",
"\n",
" linkml-store -i countries.jsonl inference -t rag -q 'name: Uruguay'\n",
"\n",
" Result:\n",
"\n",
" capital: Montevideo, code: UY, continent: South America, languages:\n",
" [Spanish]\n",
"\n",
" You can pass in configurations as follows:\n",
"\n",
" linkml-store -i countries.jsonl inference -t\n",
" rag:llm_config.model_name=llama-3 -q 'name: Uruguay'\n",
"\n",
" ## SKLearn:\n",
"\n",
" This uses scikit-learn (defaulting to simple decision trees) to do the\n",
" prediction.\n",
"\n",
" linkml-store -i tests/input/iris.csv inference -t sklearn -q\n",
" '{\"sepal_length\": 5.1, \"sepal_width\": 3.5, \"petal_length\": 1.4,\n",
" \"petal_width\": 0.2}'\n",
"\n",
"Options:\n",
" -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n",
" Output format\n",
" -o, --output PATH Output file path\n",
" -T, --target-attribute TEXT Target attributes for inference\n",
" -F, --feature-attributes TEXT Feature attributes for inference (comma\n",
" separated)\n",
" -Y, --inference-config-file PATH\n",
" Path to inference configuration file\n",
" -E, --export-model PATH Export model to file\n",
" -L, --load-model PATH Load model from file\n",
" -M, --model-format [pickle|onnx|pmml|pfa|joblib|png|linkml_expression|rulebased|rag_index]\n",
" Format for model\n",
" -S, --training-test-data-split <FLOAT FLOAT>...\n",
" Training/test data split\n",
" -t, --predictor-type TEXT Type of predictor [default: sklearn]\n",
" -n, --evaluation-count INTEGER Number of examples to evaluate over\n",
" --evaluation-match-function TEXT\n",
" Name of function to use for matching objects\n",
" in eval\n",
" -q, --query TEXT query term\n",
" --help Show this message and exit.\n"
]
}
],
"execution_count": 5
},
{
"cell_type": "markdown",
"source": [
"## Training and Inference\n",
"\n",
"We can perform training and inference in a single step:"
"We can perform training and inference in a single step. \n",
"\n",
"For feature labels, we use:\n",
"\n",
"- `petal_length`\n",
"- `petal_width`\n",
"- `sepal_length`\n",
"- `sepal_width`\n",
"\n",
"These can be explicitly specified using `-F`, but in this case we are specifying a query, so\n",
"the feature labels are inferred from the query.\n",
"\n",
"We specify the target label using `-T`. In this case, we are predicting the `species` of the iris.\n"
],
"metadata": {
"collapsed": false
Expand All @@ -60,7 +159,18 @@
},
{
"cell_type": "code",
"execution_count": 9,
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" "
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-23T22:17:38.972690Z",
"start_time": "2024-08-23T22:17:35.558907Z"
}
},
"id": "4984aeb4016df154",
"outputs": [
{
"name": "stderr",
Expand All @@ -76,29 +186,27 @@
"text": [
"predicted_object:\n",
" species: setosa\n",
"confidence: 1.0\n"
"confidence: 1.0\n",
"\n"
]
}
],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" "
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T19:35:08.172872Z",
"start_time": "2024-08-12T19:35:05.095856Z"
}
},
"id": "4984aeb4016df154"
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The data model for the output consists of a `predicted_object` slot and a `confidence`. Note that for standard ML operations, the predicted object will typically have one attribute only, but other kinds of inference (OWL reasoning, LLMs) may be able to predict complex objects.",
"id": "dfcbdae846f56ada"
},
{
"cell_type": "markdown",
"source": [
"## Saving the Model\n",
"\n",
"Performing training and inference in a single step is convenient where training is fast, but more typically we'd want to save the model for later use:"
"Performing training and inference in a single step is convenient where training is fast, but more typically we'd want to save the model for later use.\n",
"\n",
"We can do this with the `-E` option:"
],
"metadata": {
"collapsed": false
Expand Down Expand Up @@ -181,48 +289,29 @@
},
{
"cell_type": "code",
"execution_count": 15,
"outputs": [],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -L \"tmp/iris-model.joblib\" -E \"tmp/iris-model.png\""
"linkml-store --stacktrace -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E input/iris-model.png"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T19:57:43.145521Z",
"start_time": "2024-08-12T19:57:40.441893Z"
"end_time": "2024-08-23T22:23:18.451362Z",
"start_time": "2024-08-23T22:23:15.571984Z"
}
},
"id": "d7d14edd77e9e1fe"
"id": "d7d14edd77e9e1fe",
"outputs": [],
"execution_count": 9
},
{
"cell_type": "markdown",
"source": [
"![img](tmp/iris-model.png)"
],
"source": "![img](input/iris-model.png)",
"metadata": {
"collapsed": false
},
"id": "cca55edf629f8c26"
},
{
"cell_type": "code",
"execution_count": 29,
"outputs": [],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -L tmp/iris-model.joblib -E tmp/iris-model.rulebased.yaml"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T21:59:26.805316Z",
"start_time": "2024-08-12T21:59:24.343197Z"
}
},
"id": "acb7c57ecb3be9b"
},
{
"cell_type": "markdown",
"source": [
Expand All @@ -244,8 +333,20 @@
"id": "3ef8a6bc39b5e667"
},
{
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-23T22:24:16.457340Z",
"start_time": "2024-08-23T22:24:13.977990Z"
}
},
"cell_type": "code",
"execution_count": 30,
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E tmp/iris-model.rulebased.yaml\n",
"cat tmp/iris-model.rulebased.yaml"
],
"id": "acb7c57ecb3be9b",
"outputs": [
{
"name": "stdout",
Expand All @@ -266,17 +367,13 @@
]
}
],
"source": [
"%%bash\n",
"cat tmp/iris-model.rulebased.yaml"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2024-08-12T21:59:52.936844Z"
}
},
"id": "4fdea226f501455e"
"execution_count": 10
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We can then apply this model to new data:",
"id": "50f9cd9df60b41c9"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -310,14 +407,26 @@
"id": "4df0d87dff96e667"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## More advanced ML models\n",
"\n",
"Currently only Decision Trees are supported. Additionally, most of the underlying functionality of scikit-learn is hidden.\n",
"\n",
"For more advanced ML, you are encouraged to use linkml-store for *data management* and then exporting to standard tabular ot dataframe formats in order to do more advanced ML in Python. linkml-store is *not* intended as an ML platform. Instead a limited set of operations are provided to assist with data exploration and assisting in construction of deterministic rules.\n",
"\n",
"For inference using LLMs and Retrieval Augmented Generation, see the how-to guide on those topics.\n"
],
"id": "d1b583ce2d75c0e0"
},
{
"metadata": {},
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
},
"id": "cef5b6e4ee9cb5f5"
"execution_count": null,
"source": "",
"id": "c8d9e36761d3088d"
}
],
"metadata": {
Expand Down
Binary file added docs/how-to/input/iris-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion src/linkml_store/api/collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,7 @@ def search(
where: Optional[Any] = None,
index_name: Optional[str] = None,
limit: Optional[int] = None,
mmr_relevance_factor: Optional[float] = None,
**kwargs,
) -> QueryResult:
"""
Expand Down Expand Up @@ -534,7 +535,7 @@ def search(
index_col = ix.index_field
# TODO: optimize this for large indexes
vector_pairs = [(row, np.array(row[index_col], dtype=float)) for row in qr.rows]
results = ix.search(query, vector_pairs, limit=limit)
results = ix.search(query, vector_pairs, limit=limit, mmr_relevance_factor=mmr_relevance_factor, **kwargs)
for r in results:
del r[1][index_col]
new_qr = QueryResult(num_rows=len(results))
Expand Down
Loading
Loading