Capability Elicitation

Capability elicitation is the problem of getting a model to actually demonstrate what it is capable of. A model’s measured performance on a task is not a fixed property; it depends on how the model is prompted, fine-tuned, or otherwise coaxed. The same model might fail a task under one prompt and succeed under another, which means a low benchmark score can reflect poor elicitation rather than a genuine lack of ability. This gap between a model’s underlying capability and its observed behavior is a core concern both for evaluators trying to measure true capability and for safety researchers worried about hidden dangerous abilities.

A foundational study is the May 2024 paper “Stress-Testing Capability Elicitation With Password-Locked Models” by Ryan Greenblatt, Fabien Roger, and colleagues. They deliberately created models with hidden skills, fine-tuned to perform well only when a secret password appears in the prompt and to imitate a weaker model otherwise, then tested how easily the locked capability could be drawn out. They found that a handful of high-quality demonstrations, or reinforcement learning against the task, were often enough to fully unlock the hidden ability, and that unlocking generalized across different passwords. The implication is that capabilities a model appears not to have may be only lightly buried.

This matters in two directions. For evaluation, it means a poor score may understate real capability, so safety assessments must try hard to elicit worst-case behavior. For deployment, it means the gap between what a model shows by default and what it can be pushed to do is a real and sometimes uncomfortable variable.

Sources

Related