Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Ingestion Improvement/Cleanup/Bug Fix - Part 2 #307

Merged
merged 6 commits into from
Mar 4, 2024

Conversation

davidgxue
Copy link
Contributor

@davidgxue davidgxue commented Feb 28, 2024

Code ready for review, evaluations/testing is still in progress

Description

  • This is part 2 of the data ingestion improvements. The goal is to analyze and investigate any potential issues with the data ingestion process, remove noisy data such as badly formatted content or gibberish text, find bugs, and enforce standards on the process to have predictable cost estimation/reduction.

Technical Changes

  • Renamed the original split.py file to chunking_utils.py (multiple files changed due to import naming)
  • During the above change, found 2 DAGs somehow were NOT changed. Provider docs and SDK docs's incremental ingest DAGs somehow never had a chunking/splitting task (even though the bulk ingest does). I added chunking to these 2 DAGs
  • Airflow Docs extraction process has been improved(see code for details)
  • Provider Docs extraction process has been improved (see code for details)
  • Astro SDK Docs extraction process has been improved (see code for details) -> Astro SDK docs are not currently being ingested, but fixing code since they exist.
  • Provider docs used to ingest all past versions leading to many duplicate docs in the DB. It is now changes to only ingest the stable version of the docs.

Evaluations

  • Answer quality generally improved with some add on of more concise answering and linking hyperlinks when the model knows the answer is not exhaustive.
  • Please attached for a quick list of results

data_ingest_comparison_part_2.csv
data_ingest_results_part_2.csv

  • Bulk ingest also runs with no issues
image

Related Issues

closes #221
closes #258
closes #295 (Reranker has been addressed in GCP environment variables, embedding model change completed in a different PR)
closes #285 (This PR prevents empty docs from being ingested)

Copy link

cloudflare-workers-and-pages bot commented Feb 28, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3142709
Status: ✅  Deploy successful!
Preview URL: https://dc48aa3a.ask-astro.pages.dev
Branch Preview URL: https://improve-data-ingestion-pt2.ask-astro.pages.dev

View logs

@davidgxue davidgxue marked this pull request as ready for review February 28, 2024 23:39
@davidgxue davidgxue self-assigned this Feb 28, 2024
Copy link
Collaborator

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 1 nitpick. Rest looks good to me

airflow/include/tasks/extract/astronomer_providers_docs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@pankajastro pankajastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But this might require a fresh ingestion for testing

airflow/include/tasks/extract/astronomer_providers_docs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidgxue requesting you to do the fresh ingestion to the new Database and test the changes before merging.

@davidgxue
Copy link
Contributor Author

Updated the PR description with test results!!

@davidgxue davidgxue merged commit df80ddc into main Mar 4, 2024
8 checks passed
@davidgxue davidgxue deleted the improve_data_ingestion_pt2 branch March 4, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants