Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding chat history to RAG app and refactor to better utilize LangChain #648

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

alpha-amundson
Copy link
Collaborator

See commit log for full description. tl;dr: added chat history to rag-frontend app.

…eep track of and retrieve chat history from Cloud SQL.

main.py - removed old langchain and logic to retrieve context. replaced with new chain from rag_chain.py. Introduced browser session with 30 minute ttl. Storing session ID in the session cookie. Session ID is then used to retrieve chat history. Chat history is cleared when timeout is reached.
cloud_sql.py - now includes a method to create a PostgresEngine for storing and retrieving history, plus a CustomVectorStore to perform the query embedding and vector search. Old code paths no longer needed were removed.
rag_chain.py - contains helper method create_chain to create, update and delete the end-to-end RAG chain with history.
various tf files: increased max input and total tokens on HF TGI for mistral. threadded through some parameters needed to instantiate the PostgresEngine.
requirements.txt - added some dependencies needed for langchain
@alpha-amundson alpha-amundson requested a review from imreddy13 May 3, 2024 23:15
@alpha-amundson alpha-amundson changed the title Also introduced a basic session history mechanism in the browser to k… Adding chat history to RAG app and refactor to better utilize LangChain May 3, 2024
@imreddy13
Copy link
Collaborator

/gcbrun

@alpha-amundson
Copy link
Collaborator Author

/gcbrun

1 similar comment
@alpha-amundson
Copy link
Collaborator Author

/gcbrun

* Working on improvements for rag application:
    - Working on missing TODO
    - Fixing issue with credentials
    - Refactoring vector_storages so you can add different vector storages
      TODO: Vector Storage factory
    - Unit test will be added on future PR

* Updating changes with db

* refactoring app so can be executed using gunicorn

* refactory of the code as flask application package

* Fixing Bugs
- Reviewing issue with IPtypes, currently the fix is to validate if there's an development environment so a public cloud_sql instance can be use.
- Fixing issue with Flask App Factory
@german-grandas
Copy link
Collaborator

/gcbrun

* Working on improvements for rag application:
    - Working on missing TODO
    - Fixing issue with credentials
    - Refactoring vector_storages so you can add different vector storages
      TODO: Vector Storage factory
    - Unit test will be added on future PR

* Updating changes with db

* refactoring app so can be executed using gunicorn

* refactory of the code as flask application package

* Fixing Bugs
- Reviewing issue with IPtypes, currently the fix is to validate if there's an development environment so a public cloud_sql instance can be use.
- Fixing issue with Flask App Factory

* Working on Custom HuggingFace interface
     - Adding a custom chat model to send request to HuggingFace TGI API
     - Applying formatting to code.
applications/rag/frontend/container/main.py Dismissed Show dismissed Hide dismissed
* Working on improvements for rag application:
    - Working on missing TODO
    - Fixing issue with credentials
    - Refactoring vector_storages so you can add different vector storages
      TODO: Vector Storage factory
    - Unit test will be added on future PR

* Updating changes with db

* refactoring app so can be executed using gunicorn

* refactory of the code as flask application package

* Fixing Bugs
- Reviewing issue with IPtypes, currently the fix is to validate if there's an development environment so a public cloud_sql instance can be use.
- Fixing issue with Flask App Factory

* Working on Custom HuggingFace interface
     - Adding a custom chat model to send request to HuggingFace TGI API
     - Applying formatting to code.

* Improving the CloudSQL vector vector_storage
applications/rag/frontend/container/main.py Dismissed Show dismissed Hide dismissed
main.py Fixed Show fixed Hide fixed
main.py Fixed Show fixed Hide fixed
@german-grandas
Copy link
Collaborator

/gcbrun

@german-grandas
Copy link
Collaborator

/gcbrun

@german-grandas
Copy link
Collaborator

/gcbrun

@german-grandas
Copy link
Collaborator

Some prompt answer examples using meta-llama/Llama-2-7b-hf

prompt_with_meta_llama_7b

Some prompt answer examples using meta-llama/Llama-2-7b-chat-hf
prompt_with_meta_llama_7b_chat

@german-grandas
Copy link
Collaborator

/gcbrun

@german-grandas
Copy link
Collaborator

Examples of answers.

Previous RAG without Chat History:
t1_main_img
t2_main_img

RAG with Chat History:
t3_rag_img
t3_rag_img_2

@gongmax
Copy link
Collaborator

gongmax commented Sep 19, 2024

@german-grandas, in the snapshot of "Previous RAG without Chat History", I saw there were some error thrown. I tried with the latest code on the main branch (i.e. Previous RAG without Chat History) and didn't see any error. Do you know what was going wrong? Below is my snapshot:
image

@german-grandas
Copy link
Collaborator

@gongmax In the example I'm deploying with the image "us-central1-docker.pkg.dev/ai-on-gke/rag-on-gke/frontend@sha256:ec0e7b1ce6d0f9570957dd7fb3dcf0a16259cba915570846b356a17d6e377c59 the same image used on main applications/rag/frontend/main.tf.

Which image did you use on the test you mentioned?

@gongmax
Copy link
Collaborator

gongmax commented Sep 20, 2024

@gongmax In the example I'm deploying with the image "us-central1-docker.pkg.dev/ai-on-gke/rag-on-gke/frontend@sha256:ec0e7b1ce6d0f9570957dd7fb3dcf0a16259cba915570846b356a17d6e377c59 the same image used on main applications/rag/frontend/main.tf.

Which image did you use on the test you mentioned?

I didn't make any change, I just pulled the main branch and deploy the whole application. The image is same as you mentioned above us-central1-docker.pkg.dev/ai-on-gke/rag-on-gke/frontend@sha256:ec0e7b1ce6d0f9570957dd7fb3dcf0a16259cba915570846b356a17d6e377c59

@german-grandas
Copy link
Collaborator

It's odd, reviewing the logs I see a warning from the database and that's what the frontend is showing as part of the prompt response.

Not sure how to track this, any idea?

@german-grandas
Copy link
Collaborator

/gcbrun

@gongmax
Copy link
Collaborator

gongmax commented Oct 1, 2024

/gcbrun

1 similar comment
@gongmax
Copy link
Collaborator

gongmax commented Oct 1, 2024

/gcbrun

@german-grandas
Copy link
Collaborator

/gcbrun

1 similar comment
@gongmax
Copy link
Collaborator

gongmax commented Oct 3, 2024

/gcbrun

@german-grandas
Copy link
Collaborator

/gcbrun

@german-grandas
Copy link
Collaborator

Comments about improving the quality of the answer generation:

Currently using the model Mistral-7B-Instruct-v0.1 is showing a lack of quality in the answering of a given question, a first sight analysis is showing that the model is having issues handling the long prompt due to the inclusion of a longer context and the chat history, the following is an example of the current prompt for the question what's kubernetes?:

System: 
### [INST]
Instruction: Always assist with care, respect, and truth. Respond with utmost utility yet securely.
Avoid harmful, unethical, prejudiced, or negative content.
Ensure replies promote fairness and positivity.
Here is context to help:


Kubernetes offers features to help you run highly available applications even when you
introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A
PDB limits the number of Pods of a replicated application that are down simultaneously from
voluntary disruptions. For example, a quorum-based application would like to ensure that the
number of replicas running is never brought below the number needed for a quorum. A web
front end might want to ensure that the number of replicas serving load never falls below a
certain percentage of the total.
Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets
by calling the Eviction API  instead of directly deleting pods or deployments.
For example, the kubectl drain  subcommand lets you mark a node as going out of service.
When you run kubectl drain , the tool tries to evict all of the Pods on the Node you're taking out

topologyKey : topology.kubernetes.io/zone
    podAntiAffinity :• 
•

kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#what-is-a-
replicationcontroller
availableReplicas  (int32)
The number of available replicas (ready for at least minReadySeconds) for this replication
controller.• 
• 
• 
• 
• 
•


[/INST]
Here's the previous messages so you can have context about what the user have ask you:

Human: What's kubernetes?

This is the answer for the previous prompt:

image

In a further exercise a most robust model like gemini-1.0-pro-002 from VertexAI where used showing a huge improvement on the managing of the given context and in the question answering:

Captura de pantalla 2024-10-21 113420

Might be worthing exploring the migration to VertexAI instead of continue using open source models from huggingface like Mistral.

@gongmax
Copy link
Collaborator

gongmax commented Oct 21, 2024

Can you adjust the length of the chat history included in the context and see how it can impact the response?
Besides, any insight why it always includes 'AI' at the beginning of the response? Anything to do with how you construct the prompt and instruction?

@german-grandas
Copy link
Collaborator

Can you adjust the length of the chat history included in the context and see how it can impact the response? Besides, any insight why it always includes 'AI' at the beginning of the response? Anything to do with how you construct the prompt and instruction?

There's not any improvement with the comments you suggested, I sustain the hypothesis about the LLM not able to support the context or respond to the given prompt.

Regarding the LLM including 'AI' at the beginning of the response, looks like is the way how the LLM generates the answer, I include instructions in the prompt to not do that and the LLM continues generating content as it was a conversational agent and not a generation LLM.

@german-grandas
Copy link
Collaborator

/gcbrun

@gongmax
Copy link
Collaborator

gongmax commented Oct 29, 2024

Quote from chat with @german-grandas to keep track: "I made an update into the inference service so the LLM can support the generation of a longer answer. that fixed the issue with the short generation of the rag system when you submitted a question."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants