Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

“Training Language Models to Follow Instructions with Human Feedback” was submitted to arXiv on March 4, 2022 by Long Ouyang, Jan Leike, Ryan Lowe, and a large team at OpenAI. It describes InstructGPT and is the paper that operationalized reinforcement learning from human feedback (RLHF) for a shipping product line - the direct methodological ancestor of ChatGPT, released nine months later.

The method has three stages. First, human contractors write demonstrations of good responses to prompts, and the model is fine-tuned on them by supervised learning. Second, the model generates several candidate responses to each prompt and humans rank them, and those rankings train a separate reward model that predicts human preference. Third, the language model is optimized by reinforcement learning (using PPO) to maximize that learned reward, producing outputs people prefer.

The headline result was that the resulting InstructGPT model with just 1.3 billion parameters was preferred by human raters over the 175-billion-parameter GPT-3, despite being more than a hundred times smaller. Beyond preference, InstructGPT was more truthful and produced less toxic output. The lesson - that aligning a model to human intent can matter more than raw size - reframed how labs thought about the last mile of building a usable assistant.

Why business readers should care: this paper is the bridge between an impressive research model and a product millions of people would actually use. The demonstration that human feedback, not just scale, was the missing ingredient set the template every major lab now follows to turn a pretrained model into a deployable assistant.

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Sources

Related