Synthetic Data Generation (SDG)

Python library for Synthetic Data Generation

Introduction

Synthetic Data Generation (SDG) is a process that creates an artificially generated dataset that mimics real data based on provided examples. SDG uses a YAML file containing question-and-answer pairs as input data.

Installing the SDG library

Clone the library and navigate to the repo:

git clone https://github.com/instructlab/sdg
cd sdg

Install the library:

pip install .

Using the library

You can import SDG into your Python files with the following items:

 from instructlab.sdg.generate_data import generate_data
 from instructlab.sdg.utils import GenerateException

Pipelines

A pipeline is a series of steps to execute in order to generate data.

There are three default pipelines shipped in SDG: simple, full, and eval. Each pipeline requires specific hardware specifications

Simple Pipeline

The simple pipeline is designed to be used with quantized Merlinite as the teacher model. It enables basic data generation results on low-end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs.

Full Pipeline

The full pipeline is designed to be used with Mixtral-8x7B-Instruct-v0.1 as the the teacher model, but has also been successfully tested with smaller models such as Mistral-7B-Instruct-v0.2 and even some quantized versions of the two teacher models. This is the preferred data generation pipeline on higher end consumer grade hardware and all enterprise hardware.

Eval Pipeline

The eval pipeline is used to generate MMLU benchmark data that can be used to later evaluate a trained model on your knowledge dataset. It does not generate data for use during model training.

Pipeline architecture

All the pipelines are written in a YAML format and must adhere to a specific schema.

The pipelines that generate data for model training (simple and full pipelines) expect to have three different pipeline configs - one each for knowledge, grounded skills, and freeform skills. They are expected to exist in files called knowledge.yaml, grounded_skills.yaml, and freeform_skills.yaml respectively. For background on the difference in knowledge, grounded skills, and freeform skills, refer to the InstructLab Taxonomy repository.

Repository structure

|-- src/instructlab/ (1)
|-- docs/ (2)
|-- scripts/ (3)
|-- tests/ (4)

Contains the SDG code that interacts with InstructLab.
Contains documentation on various SDG methodologies.
Contains some utility scripts, but not part of any supported API.
Contains all the tests for the SDG repository.

Name		Name	Last commit message	Last commit date
Latest commit History 585 Commits
.github		.github
docs		docs
scripts		scripts
src/instructlab		src/instructlab
tests		tests
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.markdownlint-cli2.yaml		.markdownlint-cli2.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.spellcheck-en-custom.txt		.spellcheck-en-custom.txt
.spellcheck.yml		.spellcheck.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Generation (SDG)

Introduction

Installing the SDG library

Using the library

Pipelines

Simple Pipeline

Full Pipeline

Eval Pipeline

Pipeline architecture

Repository structure

About

Releases 33

Packages

Contributors 33

Languages

License

instructlab/sdg

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generation (SDG)

Introduction

Installing the SDG library

Using the library

Pipelines

Simple Pipeline

Full Pipeline

Eval Pipeline

Pipeline architecture

Repository structure

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 33

Packages 0

Contributors 33

Languages

Packages