Little Giants: Synthesizing High-Quality Embedding Data at Scale

📖 Paper • 🤗 Senior Generator • Data Revisor • Embedding Model

Usage

Use the classification tryout we provide and the above generators to synthesize your own embedding data!

Use our embedding model to perform all kinds of embedding tasks!

Abstract

Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5$_\text{mistral}$ when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.

Citation

Please kindly cite our paper if helps your research:

@article{chen2024little,
  title={Little Giants: Synthesizing High-Quality Embedding Data at Scale},
  author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng},
  journal={arXiv preprint arXiv:2410.18634},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
prompts_aligning.py		prompts_aligning.py
prompts_synthesis.py		prompts_synthesis.py
prompts_tasks.py		prompts_tasks.py
revisor_model_tryout.py		revisor_model_tryout.py
senior_model_tryout.py		senior_model_tryout.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Little Giants: Synthesizing High-Quality Embedding Data at Scale

Usage

Abstract

Citation

About

Releases

Packages

Languages

License

haon-chen/SPEED

Folders and files

Latest commit

History

Repository files navigation

Little Giants: Synthesizing High-Quality Embedding Data at Scale

Usage

Abstract

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages