“Learning to summarize from human feedback” was submitted to arXiv on September 2, 2020 by Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano of OpenAI. It is the bridge between the original 2017 preference-learning work and InstructGPT - the moment RLHF was shown to work on language.
The team tackled abstractive summarization, where standard practice was to train on reference summaries and score with metrics like ROUGE that correlate poorly with what people actually find useful. Instead they collected a large, high-quality dataset of human comparisons between candidate summaries, trained a reward model to predict which summary a person would prefer, and then fine-tuned a summarization policy with reinforcement learning to maximize that learned reward.
The headline result was that their preference-tuned models produced summaries that humans preferred over the human-written reference summaries, and over much larger models trained only with supervised learning. The reward model also transferred to a new domain (news articles) without additional tuning. The paper documented the full RLHF pipeline that would soon become standard.
For a general reader, this is the proof-of-concept that quietly set the stage for the conversational AI boom: it showed that optimizing directly for human judgment, rather than a proxy metric, makes language models markedly more useful.