Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195

sunank200 · 2023-12-01T11:36:07Z

Please describe the feature you'd like to see

The current approach to testing the accuracy of model responses is foundational but requires further development. We need a robust, industry-standard testing and automation system to ensure the accuracy of responses from our Large Language Model (LLM). This issue involves researching and proposing a method aligned with industry standards, planning its implementation, and incorporating these enhancements in the upcoming release.

Describe the solution you'd like

To enhance the accuracy testing of LLM responses, we propose implementing a comprehensive testing and automation framework. This would involve:

Researching industry standards for testing LLMs.
Investigate feasibility of developing a test suite that includes various accuracy metrics and automation tools.
Create issues for integrating this new testing approach in the upcoming release, with detailed documentation and examples for ease of understanding and adoption.
Create relevant issues if needed.

phanikumv · 2023-12-18T11:15:33Z

David to work on this

davidgxue · 2023-12-19T01:10:45Z

Feel free to assign this to me! Happy to take a look

phanikumv · 2023-12-27T06:55:45Z

Did we make any progress on this?

davidgxue · 2024-01-09T08:07:44Z

Update on this issue

Summary

Since the inception of RAG-based LLM applications, there's been a constant pursuit for more efficient ways to automate accuracy testing. Unfortunately, even as of today, there's no straightforward method to automatically assess the accuracy of a RAG-based LLM application as commonly done for traditional machine learning models, such as through metrics like balanced accuracy, F1 score, or precision-recall score for classification models. While some might argue that token loss and token accuracy can be automatically computed, these metrics don't fully encapsulate the application's performance. For instance, two distinctly phrased paragraphs conveying the same meaning could yield significantly different token loss values. This fundamental issue necessitates human evaluation in appraising the performance of RAG-based applications. Therefore, we should explore methods to assist our human evaluation process rather than expect a way to completely replace human involvement.

Proposed Next Steps

We can continue method 1 outlined below while exploring method 2. These 2 paths aren't mutually exclusive, but eventually we can decide to adopt one approach.

Continue/improve our current testing method: Spreadsheet + Slackbot + Human Evaluation

We can currently have a fixed list of questions stored in a Google Spreadsheet. We use a python script that automatically pull and posts questions to a test slack server and tags the Ask Astro slack bot to get the response. Then, a QA engineer manually looks over the slack bot's answers and copy over the responses into the spreadsheet, along with evaluation comments.
With Michael's PR here Automate Ask Astro QA #210, we may be able to do this via Airflow DAGs to auto populate the spreadsheet, but human evaluation is still needed to look over the spreadsheet responses.

LangSmith Dataset & Evaluation Pipeline: Semi-automate evaluation with LLM bot to help evaluation

We can employ another LLM to compare pre and post-changes responses head to head to decide which is a better response. This secondary LLM can also juxtapose the responses generated by the RAG application against a reference answer that we've prepared, determining whether it remains on the right track. It could also engage in a head-to-head comparison to discern the superior outcome. Here's some basic info on LangSmith's Dataset and Test feature: https://docs.smith.langchain.com/evaluation. I am currently exploring this method. This does not require immediate code changes and most likely would be locally ran script if there is anything code related.

Future Explorations

We can introduce beta and A/B testing by exposing changes related to LLM/Data-ingestion/RAG pipeline to a small fraction of the users, and check the feedback score of these responses to decide if they are getting better or worse. Additional feedback mechanisms such as document quality and relevancy of documents can also be explored.

phanikumv · 2024-01-11T08:45:50Z

@davidgxue can you mark in the issue description which tasks are done.

sunank200 · 2024-01-11T10:19:26Z

@phanikumv this research task is done and we are waiting for a review from Steven. @davidgxue has created further issues which will be done in the next release as discussed with Steven yesterday.

phanikumv · 2024-01-11T10:29:38Z

Tasks in the description are still open. If they are done, please mark them as such

davidgxue · 2024-01-12T00:37:48Z

I revised the original issue description and marked the items as completed. I am working on the new issues I have created. At this point I think we can mark this current issue as completed once Steven reviews this.

shillion · 2024-01-15T17:29:12Z

Thanks David. Let's leave this open until we've decided what to do with (1) auto-populating the test spreadsheet via a DAG, and (2) evaluating with a secondary LLM. We'll discuss tomorrow.

shillion · 2024-01-17T00:09:19Z

Discussed with @davidgxue — let's consider using another LLM to evaluate the quality of responses, and then add that to the DAG so that we can have it run regularly to generate accuracy metrics.

sunank200 assigned vatsrahul1001 Dec 1, 2023

sunank200 added this to the Phase 2.5 - Enhanced Community Release milestone Dec 1, 2023

sunank200 changed the title ~~Research: Accuracy monitoring DAG: automated tests to count docs, test queries, others~~ Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others Dec 1, 2023

sunank200 unassigned vatsrahul1001 Dec 13, 2023

sunank200 assigned davidgxue Dec 19, 2023

davidgxue mentioned this issue Jan 11, 2024

Research: Investigate Feasibility of Using LangSmith to Automate Test & Evaluation of RAG #257

Open

sunank200 removed this from the Phase 2.5 - Enhanced Community Release milestone Jan 15, 2024

davidgxue modified the milestone: 0.3.0 Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195

Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195

sunank200 commented Dec 1, 2023 •

edited by davidgxue

Loading

phanikumv commented Dec 18, 2023

davidgxue commented Dec 19, 2023 •

edited

Loading

phanikumv commented Dec 27, 2023

davidgxue commented Jan 9, 2024

phanikumv commented Jan 11, 2024

sunank200 commented Jan 11, 2024

phanikumv commented Jan 11, 2024

davidgxue commented Jan 12, 2024

shillion commented Jan 15, 2024

shillion commented Jan 17, 2024

Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195

Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195

Comments

sunank200 commented Dec 1, 2023 • edited by davidgxue Loading

phanikumv commented Dec 18, 2023

davidgxue commented Dec 19, 2023 • edited Loading

phanikumv commented Dec 27, 2023

davidgxue commented Jan 9, 2024

Update on this issue

Summary

Proposed Next Steps

Future Explorations

phanikumv commented Jan 11, 2024

sunank200 commented Jan 11, 2024

phanikumv commented Jan 11, 2024

davidgxue commented Jan 12, 2024

shillion commented Jan 15, 2024

shillion commented Jan 17, 2024

sunank200 commented Dec 1, 2023 •

edited by davidgxue

Loading

davidgxue commented Dec 19, 2023 •

edited

Loading