“RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback” was submitted to arXiv on September 1, 2023 by Harrison Lee, Samrat Phatale, and colleagues at Google. It tackles the most expensive part of the RLHF pipeline: collecting the human preference labels used to train the reward model.
The proposal, reinforcement learning from AI feedback (RLAIF), replaces those human labels with judgments from an off-the-shelf language model. The model is prompted to compare two candidate responses and say which is better, and those AI-generated preferences are used in place of human ones to train the reward model and tune the policy. The approach is closely related to Anthropic’s Constitutional AI, which also uses AI feedback, but RLAIF studies it as a general substitute for human labeling across tasks like summarization and helpful or harmless dialogue.
The paper’s headline finding is that RLAIF achieves performance comparable to RLHF - human raters preferred RLAIF-tuned outputs at similar rates to RLHF-tuned ones - on the tasks studied. The authors also introduce direct-RLAIF, which obtains the reward signal straight from a language model at training time, removing the separately trained reward model altogether.
For a general reader, RLAIF matters because human feedback does not scale: it is slow, costly, and a bottleneck as models proliferate. Showing that AI feedback can stand in for human feedback, at least for current tasks, is a step toward training aligned models far more cheaply - though it also raises the question of how to keep the AI judge itself trustworthy.