Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.4.0 Proposal Items #318

Open
davidgxue opened this issue Mar 7, 2024 · 0 comments
Open

Release 0.4.0 Proposal Items #318

davidgxue opened this issue Mar 7, 2024 · 0 comments

Comments

@davidgxue
Copy link
Contributor

davidgxue commented Mar 7, 2024

Items to explore/implement

HTML webpages sometimes have embedded hyperlinks e.g. mylife

  • We currently remove these links when before chunking/vectorizing/inserting the docs into vector db
  • We can attempt to keep these so that the bot can answer with links if needed

Astro Forum docs sometimes have bad formatting during ingestion

  • This issue was addressed for most of the other data sources, but forum docs has also shown this issue here and there

Re-increase document length to higher length from current 2.5k tokens per doc to previous higher number 4k

  • This was done as cost reduction which may have mildly hindered retrieval performance (but was not exactly observed)
  • With new GPT 4 turbo model that has 6X cheaper cost for input tokens, this is no longer a major concern

Build out evaluated_rag DAG to use a judge to score improvements/degradation from the previous answer/reference answer

  • Add quantitative measure as a judge such as cosine similarity distance from reference answer.
  • Add LLM as a judge

Explore/experiment with top k, alpha and other parameters used for reranking and hybrid search

Explore adding additional property in vector db

  • Add property/data on each document chunk with the title of the original page or something similar so that smaller chunks can retain semantic meaning better to the overall topic of the original full document

Train and add off-topic discussion text classifier (could be before QA or after QA)

  • If on-topic is true and off-topic is false, we want to be more lenient and allow more on-topic even if it maybe loosely related AKA allow more false positives (higher recall score vs precision)
  • We can tentatively weight recall to be 5x higher than precision. So for metric 𝐹𝛽, the value of 𝛽 = 5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant