Rule Based Rewards for Language Model Safety

This November 2024 paper from a team with OpenAI affiliations, including John Schulman and Lilian Weng, proposes “rule-based rewards” (RBR) as a way to make models safer without collecting large amounts of human preference data. The motivation is that the usual safety signal - humans rating responses - is slow to gather, hard to update, and inconsistent about edge cases like over-refusal.

Instead, RBR breaks desired and undesired behavior into a set of explicit propositions, or rules, such as whether a response refuses appropriately or whether it includes a needed disclaimer. An LLM acts as a grader, scoring responses against these rules, and that signal is folded into reinforcement learning. Because the rules are written down, they can be adjusted directly as policy or product needs change, rather than requiring a new round of human labeling.

The authors report that the approach improves safety behavior while preserving helpfulness, reaching an F1 score of 97.1 against a 91.7 baseline on their evaluation, and that it handles the balance between refusing genuinely harmful requests and not over-refusing benign ones. The method is conceptually close to Anthropic’s Constitutional AI in using written principles and AI feedback, but it is built around fine-grained, composable rules tied to an RL objective.

For organizations, the practical value is controllability: safety behavior defined by an editable rulebook is cheaper to maintain and easier to align with changing compliance requirements than behavior baked in through bulk human ratings.

Rule Based Rewards for Language Model Safety

Sources

Related