-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: semantic search on documents #812
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # search/docker-compose.yml # search/search/es.py # search/search/harvester.py # search/search/schemas/pieces.py
BREAKING CHANGE: changes docker-compose
.env.example
Outdated
@@ -32,6 +32,9 @@ S3_PREFIX= | |||
S3_ACCESS_KEY=minioadmin | |||
S3_SECRET_KEY=minioadmin | |||
|
|||
S3_START_PATH=annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly these S3_START_PATH
and S3_TEXT_PATH
are doing? Is it required for all services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are folder names in Minio. I guess if service works directly with S3 it should have known it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khyurri
I have cleaned up all unnecessary properties. here's the scope of changes:
- moved search configs from root env to search env
- extracted embeddings service
- adjust docker-compose & README notes
.env.example
Outdated
MANIFEST=manifest.json | ||
TEXT_PIECES_PATH=/pieces | ||
INDEXATION_PATH=/indexation | ||
JOBS_URL=http://jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use JOBS_SERVICE_*
in search microservice to connect co jobs microservices
.env.example
Outdated
ES_HOST=badgerdoc-elasticsearch | ||
ES_PORT=9200 | ||
APP_TITLE=Badgerdoc Search | ||
ANNOTATION_URL=http://annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ANNOTATION_SERVICE_*
in search microservice to connect to annotation microservice
.env.example
Outdated
# Search Service | ||
ES_HOST=badgerdoc-elasticsearch | ||
ES_PORT=9200 | ||
APP_TITLE=Badgerdoc Search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use APP_TITLE
in search microservice? Or it's badgerdoc configuration?
.env.example
Outdated
ES_PORT=9200 | ||
APP_TITLE=Badgerdoc Search | ||
ANNOTATION_URL=http://annotation | ||
ANNOTATION_CATEGORIES=/categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move URI inside search microservice, we really don't need to configure path for all services globally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'd agree. Am I right to put in .env for microservice is enough?
.env.example
Outdated
ANNOTATION_URL=http://annotation | ||
ANNOTATION_CATEGORIES=/categories | ||
ANNOTATION_CATEGORIES_SEARCH=/categories/search | ||
MANIFEST=manifest.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this MANIFEST exists globally? Is it for search microservice only?
.env.example
Outdated
JOBS_URL=http://jobs | ||
JOBS_SEARCH=/jobs/search | ||
COMPUTED_FIELDS=["job_id", "category"] | ||
EMBED_URL=http://embeddings:3334/api/use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If EMBED is new microservice, ensure configuration aligned with other host configuration, for example:
EMBEDINGS_SERVICE_SCHEME=http
EMBEDINGS_SERVICE_HOST=badgerdoc-embeddings
EMBEDINGS_SERVICE_PORT=8080
docker-compose-dev.yaml
Outdated
@@ -268,6 +270,65 @@ services: | |||
devices: | |||
- "/dev/fuse" | |||
|
|||
badgerdoc-elasticsearch: | |||
container_name: badgerdoc-elasticsearch | |||
image: amazon/opendistro-for-elasticsearch:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to bind exact version here
docker-compose-dev.yaml
Outdated
@@ -11,6 +11,7 @@ | |||
# file for local needs) | |||
# - 8082 for Keycloak. | |||
# - 8083 for BadgerDoc web | |||
# 8084 for BadgerDoc search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly exposed in 8084?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is search API microservice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
web talks to that service via RP
docker-compose-dev.yaml
Outdated
networks: | ||
- badgerdoc | ||
ports: | ||
- "0.0.0.0:3334:3334" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you expose port here? Do we really need this service outside of reproxy
?
docker-compose-dev.yaml
Outdated
- ROOT_PATH=/search | ||
working_dir: /opt/search | ||
ports: | ||
- 8084:8080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All microservices proxies from reproxy
to keep the same port for all microservices. You need to remove ports bindings here and add EXPOSE
to dockerfile
search/create_dataset.py
Outdated
import csv | ||
from opensearchpy import OpenSearch, helpers | ||
|
||
PATH_PRODUCTS_DATASET = "data/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All hardcode must be moved to configuration or global configuration
search/create_dataset.py
Outdated
annotation_dataset = load_annotation_dataset() | ||
|
||
#### Use the embedding model to calculate vectors for all annotation texts | ||
print("Computing embeddings for %d sentences" % len(annotation_dataset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logging instead of print
search/create_dataset.py
Outdated
es = OpenSearch([{"host": ES_HOST, "port": ES_PORT}]) | ||
|
||
#### load test data set | ||
annotation_dataset = load_annotation_dataset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected, that we load annotation_dataset globally here?
search/embeddings/main.py
Outdated
@@ -0,0 +1,66 @@ | |||
from flask import request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use flask in Badgerdoc, use Fast API instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job and great progress, however PR must be changed according current Badgerdoc requirements for development!
Thank you!
No description provided.