Synthetic Data For LLM Training
Created: 2026-03-03 10:25
#quicknote
Synthetic data — training data generated by LLMs rather than humans — has become a critical enabler for modern LLM training pipelines. From Self-Instruct (2022) to Microsoft's Phi-4 (December 2024, trained primarily on ~400B tokens of synthetic data across 50+ datasets), synthetic data powers SFT, preference generation for DPO - Direct Preference Optimization, reasoning chain creation for RLVF - Reinforcement Learning from Verifiable Feedback, and agent trajectory generation for Agent Training and Fine-Tuning. The approach is transformative but carries risks — most notably model collapse when models train recursively on their own outputs. Part of the LLM Training and Alignment Evolution.
- Self-Instruct / Alpaca (2022–2023) — foundational approach: use a strong model (GPT-4) to generate (instruction, output) pairs from a small seed set, then fine-tune a smaller model on them. Alpaca (Stanford) showed GPT-3.5 quality could be replicated in Llama-7B with 52k synthetic examples
- Distillation from frontier models — training smaller models on outputs of larger ones. OpenAI's framework (2024) reports ~22% improvements on specific tasks. Key concern: terms of service for frontier model outputs
- Phi-4 approach (Microsoft, December 2024) — trained primarily on synthetic data using code-based generation (LLMs create Q&A pairs from code snippets), curriculum-based "textbook-like" datasets for math/coding/reasoning, and DPO refinement on synthetic preference data
- Chain-of-thought generation — synthetic reasoning traces for training reasoning models. Connected to RLVF: generate diverse solution paths, verify which are correct, train on verified correct traces
- Constitutional AI as synthetic data — constitutional critique and revision generates synthetic preference data without human annotators
Model Collapse Risk
The core risk: when models train on synthetic data generated by models trained on synthetic data, quality degrades cumulatively.
- Key finding: Accumulating synthetic data alongside original human data prevents collapse; replacing human data entirely causes degradation
- Scale of the problem: By April 2025, an estimated 74% of new web pages contain AI-generated text, contaminating future pretraining corpora
- Mitigation strategies: Quality filtering (perplexity-based scoring, reward model ranking), maintaining human data in the mix, diversity-aware generation, provenance tracking
Resources
- Self-Instruct — arXiv
- Phi-4 Technical Report — Microsoft (2024)
- Model Collapse in LLMs — arXiv
- Synthetic Data for LLM Training — Survey
Tags
#synthetic_data #training #llm #fine_tuning #data #distillation