-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research:Enhance Testing and Automation for LLM Response Accuracy, accuracy monitoring DAG: automated tests to count docs, test queries, others #195
Comments
David to work on this |
Feel free to assign this to me! Happy to take a look |
Did we make any progress on this? |
Update on this issueSummarySince the inception of RAG-based LLM applications, there's been a constant pursuit for more efficient ways to automate accuracy testing. Unfortunately, even as of today, there's no straightforward method to automatically assess the accuracy of a RAG-based LLM application as commonly done for traditional machine learning models, such as through metrics like balanced accuracy, F1 score, or precision-recall score for classification models. While some might argue that token loss and token accuracy can be automatically computed, these metrics don't fully encapsulate the application's performance. For instance, two distinctly phrased paragraphs conveying the same meaning could yield significantly different token loss values. This fundamental issue necessitates human evaluation in appraising the performance of RAG-based applications. Therefore, we should explore methods to assist our human evaluation process rather than expect a way to completely replace human involvement. Proposed Next StepsWe can continue method 1 outlined below while exploring method 2. These 2 paths aren't mutually exclusive, but eventually we can decide to adopt one approach.
Future ExplorationsWe can introduce beta and A/B testing by exposing changes related to LLM/Data-ingestion/RAG pipeline to a small fraction of the users, and check the feedback score of these responses to decide if they are getting better or worse. Additional feedback mechanisms such as document quality and relevancy of documents can also be explored. |
@davidgxue can you mark in the issue description which tasks are done. |
@phanikumv this research task is done and we are waiting for a review from Steven. @davidgxue has created further issues which will be done in the next release as discussed with Steven yesterday. |
Tasks in the description are still open. If they are done, please mark them as such |
I revised the original issue description and marked the items as completed. I am working on the new issues I have created. At this point I think we can mark this current issue as completed once Steven reviews this. |
Thanks David. Let's leave this open until we've decided what to do with (1) auto-populating the test spreadsheet via a DAG, and (2) evaluating with a secondary LLM. We'll discuss tomorrow. |
Discussed with @davidgxue — let's consider using another LLM to evaluate the quality of responses, and then add that to the DAG so that we can have it run regularly to generate accuracy metrics. |
Please describe the feature you'd like to see
The current approach to testing the accuracy of model responses is foundational but requires further development. We need a robust, industry-standard testing and automation system to ensure the accuracy of responses from our Large Language Model (LLM). This issue involves researching and proposing a method aligned with industry standards, planning its implementation, and incorporating these enhancements in the upcoming release.
Describe the solution you'd like
To enhance the accuracy testing of LLM responses, we propose implementing a comprehensive testing and automation framework. This would involve:
The text was updated successfully, but these errors were encountered: