Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195

Open
4 tasks done
sunank200 opened this issue Dec 1, 2023 · 10 comments
Assignees

Comments

@sunank200
Copy link
Collaborator

sunank200 commented Dec 1, 2023

Please describe the feature you'd like to see

The current approach to testing the accuracy of model responses is foundational but requires further development. We need a robust, industry-standard testing and automation system to ensure the accuracy of responses from our Large Language Model (LLM). This issue involves researching and proposing a method aligned with industry standards, planning its implementation, and incorporating these enhancements in the upcoming release.

Describe the solution you'd like

To enhance the accuracy testing of LLM responses, we propose implementing a comprehensive testing and automation framework. This would involve:

  • Researching industry standards for testing LLMs.
  • Investigate feasibility of developing a test suite that includes various accuracy metrics and automation tools.
  • Create issues for integrating this new testing approach in the upcoming release, with detailed documentation and examples for ease of understanding and adoption.
  • Create relevant issues if needed.
@sunank200 sunank200 changed the title Research: Accuracy monitoring DAG: automated tests to count docs, test queries, others Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others Dec 1, 2023
@phanikumv
Copy link
Collaborator

David to work on this

@davidgxue
Copy link
Contributor

davidgxue commented Dec 19, 2023

Feel free to assign this to me! Happy to take a look

@phanikumv
Copy link
Collaborator

Did we make any progress on this?

@davidgxue
Copy link
Contributor

Update on this issue

Summary

Since the inception of RAG-based LLM applications, there's been a constant pursuit for more efficient ways to automate accuracy testing. Unfortunately, even as of today, there's no straightforward method to automatically assess the accuracy of a RAG-based LLM application as commonly done for traditional machine learning models, such as through metrics like balanced accuracy, F1 score, or precision-recall score for classification models. While some might argue that token loss and token accuracy can be automatically computed, these metrics don't fully encapsulate the application's performance. For instance, two distinctly phrased paragraphs conveying the same meaning could yield significantly different token loss values. This fundamental issue necessitates human evaluation in appraising the performance of RAG-based applications. Therefore, we should explore methods to assist our human evaluation process rather than expect a way to completely replace human involvement.

Proposed Next Steps

We can continue method 1 outlined below while exploring method 2. These 2 paths aren't mutually exclusive, but eventually we can decide to adopt one approach.

  1. Continue/improve our current testing method: Spreadsheet + Slackbot + Human Evaluation
  • We can currently have a fixed list of questions stored in a Google Spreadsheet. We use a python script that automatically pull and posts questions to a test slack server and tags the Ask Astro slack bot to get the response. Then, a QA engineer manually looks over the slack bot's answers and copy over the responses into the spreadsheet, along with evaluation comments.
  • With Michael's PR here Automate Ask Astro QA  #210, we may be able to do this via Airflow DAGs to auto populate the spreadsheet, but human evaluation is still needed to look over the spreadsheet responses.
  1. LangSmith Dataset & Evaluation Pipeline: Semi-automate evaluation with LLM bot to help evaluation
  • We can employ another LLM to compare pre and post-changes responses head to head to decide which is a better response. This secondary LLM can also juxtapose the responses generated by the RAG application against a reference answer that we've prepared, determining whether it remains on the right track. It could also engage in a head-to-head comparison to discern the superior outcome. Here's some basic info on LangSmith's Dataset and Test feature: https://docs.smith.langchain.com/evaluation. I am currently exploring this method. This does not require immediate code changes and most likely would be locally ran script if there is anything code related.

Future Explorations

We can introduce beta and A/B testing by exposing changes related to LLM/Data-ingestion/RAG pipeline to a small fraction of the users, and check the feedback score of these responses to decide if they are getting better or worse. Additional feedback mechanisms such as document quality and relevancy of documents can also be explored.

@phanikumv
Copy link
Collaborator

@davidgxue can you mark in the issue description which tasks are done.

@sunank200
Copy link
Collaborator Author

@phanikumv this research task is done and we are waiting for a review from Steven. @davidgxue has created further issues which will be done in the next release as discussed with Steven yesterday.

@phanikumv
Copy link
Collaborator

Tasks in the description are still open. If they are done, please mark them as such

@davidgxue
Copy link
Contributor

I revised the original issue description and marked the items as completed. I am working on the new issues I have created. At this point I think we can mark this current issue as completed once Steven reviews this.

@shillion
Copy link
Collaborator

Thanks David. Let's leave this open until we've decided what to do with (1) auto-populating the test spreadsheet via a DAG, and (2) evaluating with a secondary LLM. We'll discuss tomorrow.

@shillion
Copy link
Collaborator

Discussed with @davidgxue — let's consider using another LLM to evaluate the quality of responses, and then add that to the DAG so that we can have it run regularly to generate accuracy metrics.

@davidgxue davidgxue modified the milestone: 0.3.0 Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants