BIG-Bench Hard (BBH)

BIG-Bench Hard, usually written BBH, is a focused subset of 23 tasks drawn from the much larger BIG-Bench suite. The authors selected exactly the tasks where prior language models had failed to beat the average human rater, creating a concentrated set of genuinely difficult problems for measuring progress.

The benchmark was introduced by Mirac Suzgun, Jason Wei, Denny Zhou, Quoc V. Le, and colleagues in “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them,” posted in October 2022. Its central result was that chain-of-thought prompting, which asks a model to reason step by step before answering, dramatically improved performance. With chain-of-thought, PaLM surpassed the human-rater baseline on 10 of the 23 tasks and Codex did so on 17 of 23, whereas standard few-shot prompting substantially underestimated what the models could do.

BBH became a standard evaluation in part because it demonstrated that a prompting change, not just a bigger model, could expose capabilities that earlier tests had missed. It is still widely cited when comparing reasoning methods.

For a business reader, BBH is a reminder that how you ask a model can matter as much as which model you use, and that benchmark results depend heavily on the evaluation method.

Sources

Last verified June 7, 2026