You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HTML webpages sometimes have embedded hyperlinks e.g. mylife
We currently remove these links when before chunking/vectorizing/inserting the docs into vector db
We can attempt to keep these so that the bot can answer with links if needed
Astro Forum docs sometimes have bad formatting during ingestion
This issue was addressed for most of the other data sources, but forum docs has also shown this issue here and there
Re-increase document length to higher length from current 2.5k tokens per doc to previous higher number 4k
This was done as cost reduction which may have mildly hindered retrieval performance (but was not exactly observed)
With new GPT 4 turbo model that has 6X cheaper cost for input tokens, this is no longer a major concern
Build out evaluated_rag DAG to use a judge to score improvements/degradation from the previous answer/reference answer
Add quantitative measure as a judge such as cosine similarity distance from reference answer.
Add LLM as a judge
Explore/experiment with top k, alpha and other parameters used for reranking and hybrid search
Explore adding additional property in vector db
Add property/data on each document chunk with the title of the original page or something similar so that smaller chunks can retain semantic meaning better to the overall topic of the original full document
Train and add off-topic discussion text classifier (could be before QA or after QA)
If on-topic is true and off-topic is false, we want to be more lenient and allow more on-topic even if it maybe loosely related AKA allow more false positives (higher recall score vs precision)
We can tentatively weight recall to be 5x higher than precision. So for metric 𝐹𝛽, the value of 𝛽 = 5.
The text was updated successfully, but these errors were encountered:
Items to explore/implement
HTML webpages sometimes have embedded hyperlinks e.g. mylife
Astro Forum docs sometimes have bad formatting during ingestion
Re-increase document length to higher length from current 2.5k tokens per doc to previous higher number 4k
Build out
evaluated_rag
DAG to use a judge to score improvements/degradation from the previous answer/reference answerExplore/experiment with top k, alpha and other parameters used for reranking and hybrid search
Explore adding additional property in vector db
Train and add off-topic discussion text classifier (could be before QA or after QA)
The text was updated successfully, but these errors were encountered: