Weak-to-Strong Generalization

“Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision” was submitted to arXiv on December 14, 2023 by an OpenAI team including Collin Burns, Pavel Izmailov, Jan Leike, Ilya Sutskever, and Jeff Wu, out of the company’s then-new superalignment effort. It studies a question that will matter when models exceed human ability: can a weaker supervisor still get a stronger model to do the right thing?

To make the question testable today, the authors set up an analogy. Rather than waiting for superhuman models, they used a weak model (say, GPT-2-level) to label data and then fine-tuned a much stronger model on those imperfect labels. The naive expectation is that the strong model would just learn the weak model’s mistakes. Instead, it consistently did better than its weak supervisor - the phenomenon they named weak-to-strong generalization - apparently drawing on knowledge it already had rather than blindly copying the labels.

The gap to the strong model’s full potential was still large, but simple techniques, such as an auxiliary confidence loss that encourages the strong model to trust its own judgment, recovered much of the lost performance on some tasks. The authors were careful to call this an imperfect analogy for the real superalignment problem, but a useful one to iterate on empirically.

For a business reader, the takeaway is hopeful and concrete: it suggests that humans, who will eventually be the weak supervisors of superhuman systems, may be able to elicit capabilities they could not themselves demonstrate - if the right training methods are found.

Sources

Last verified June 7, 2026