“Textbooks Are All You Need,” submitted to arXiv on June 20, 2023 by Suriya Gunasekar and colleagues at Microsoft Research, made a deliberately provocative argument against the prevailing scale-first mindset. The team trained phi-1, a 1.3-billion-parameter code model, on just 7 billion tokens - about 6 billion of “textbook-quality” filtered web data plus 1 billion of synthetic exercises generated with GPT-3.5 - using 8 A100 GPUs for 4 days. Despite being a fraction of the size of contemporary code models, phi-1 reached 50.6 percent pass@1 on HumanEval and 55.5 percent on MBPP, and even a 350-million-parameter variant hit 45 percent on HumanEval.
The thesis was that data quality could substitute for scale: carefully curated and synthetically augmented training data let a small model punch far above its weight. The paper’s title riffs on “Attention Is All You Need,” and its claim challenged the scaling-laws orthodoxy that bigger models on more data were the reliable path to capability. The work launched Microsoft’s Phi family of small models, which extended the recipe through phi-1.5, phi-2, phi-3, and beyond.
The result drew scrutiny over benchmark contamination - whether textbook-style training data overlapped with the evaluation sets - a debate that has followed small-model, synthetic-data research ever since.