Measuring Progress on Scalable Oversight for Large Language Models

“Measuring Progress on Scalable Oversight for Large Language Models” was submitted to arXiv on November 4, 2022 by Samuel Bowman, Ethan Perez, and a large Anthropic team. It tries to make scalable oversight - supervising AI systems that may outperform their human overseers - into something researchers can study experimentally today rather than only in theory.

The proposed method is sometimes called sandwiching. You pick a task where a domain specialist succeeds but a non-expert human fails, and current AI also fails on its own. A non-expert is then placed between the two, allowed to interact with the AI to try to reach the specialist’s level of performance. If the human-plus-AI team beats both the unaided human and the unaided model, that is evidence the human is successfully overseeing and extracting capability from a system they could not match alone.

In their experiments, participants worked with an unreliable language-model assistant over chat on question-answering tasks (drawn from MMLU and QuALITY) and substantially outperformed both the model alone and their own unaided performance. This suggested scalable oversight is empirically tractable to study with the models that already exist.

For a general reader, the contribution is methodological but consequential: it gives the field a concrete, repeatable experiment for the central question of how fallible humans can keep increasingly capable AI on track - a prerequisite for trusting these systems with high-stakes work.

Measuring Progress on Scalable Oversight for Large Language Models

Sources

Related