Synthetic Data For LLM Training

Created: 2026-03-03 10:25
#quicknote

Synthetic data — training data generated by LLMs rather than humans — has become a critical enabler for modern LLM training pipelines. From Self-Instruct (2022) to Microsoft's Phi-4 (December 2024, trained primarily on ~400B tokens of synthetic data across 50+ datasets), synthetic data powers SFT, preference generation for DPO - Direct Preference Optimization, reasoning chain creation for RLVF - Reinforcement Learning from Verifiable Feedback, and agent trajectory generation for Agent Training and Fine-Tuning. The approach is transformative but carries risks — most notably model collapse when models train recursively on their own outputs. Part of the LLM Training and Alignment Evolution.

Self-Instruct / Alpaca (2022–2023) — foundational approach: use a strong model (GPT-4) to generate (instruction, output) pairs from a small seed set, then fine-tune a smaller model on them. Alpaca (Stanford) showed GPT-3.5 quality could be replicated in Llama-7B with 52k synthetic examples
Distillation from frontier models — training smaller models on outputs of larger ones. OpenAI's framework (2024) reports ~22% improvements on specific tasks. Key concern: terms of service for frontier model outputs
Phi-4 approach (Microsoft, December 2024) — trained primarily on synthetic data using code-based generation (LLMs create Q&A pairs from code snippets), curriculum-based "textbook-like" datasets for math/coding/reasoning, and DPO refinement on synthetic preference data
Chain-of-thought generation — synthetic reasoning traces for training reasoning models. Connected to RLVF: generate diverse solution paths, verify which are correct, train on verified correct traces
Constitutional AI as synthetic data — constitutional critique and revision generates synthetic preference data without human annotators

Model Collapse Risk

The core risk: when models train on synthetic data generated by models trained on synthetic data, quality degrades cumulatively.

Key finding: Accumulating synthetic data alongside original human data prevents collapse; replacing human data entirely causes degradation
Scale of the problem: By April 2025, an estimated 74% of new web pages contain AI-generated text, contaminating future pretraining corpora
Mitigation strategies: Quality filtering (perplexity-based scoring, reward model ranking), maintaining human data in the mix, diversity-aware generation, provenance tracking

Synthetic Data For LLM Training

Model Collapse Risk

Resources

Tags