HELM, the Holistic Evaluation of Language Models, was released in November 2022 by Stanford’s Center for Research on Foundation Models (CRFM). Its goal was transparency: rather than ranking models on a single accuracy number, HELM evaluated them across a standardized matrix of scenarios and a broad set of metrics, then published all the results openly so different models could be compared on the same footing.
In its initial release, HELM measured 30 models from 12 providers across 42 scenarios (16 core scenarios plus 26 targeted evaluations). For each of the 16 core scenarios it reported seven metrics - accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The insistence on measuring more than accuracy was the point: a model can be accurate yet poorly calibrated, brittle, biased, or expensive to run, and a holistic scorecard surfaces those trade-offs instead of hiding them.
HELM arrived just as commercial language models were proliferating and buyers needed a neutral, reproducible way to compare them. It became a reference for researchers and practitioners and was extended over time to new domains and modalities. For business readers, HELM is a useful template for evaluating any AI system: define the scenarios you actually care about, measure several dimensions beyond raw accuracy, and demand that the comparison be reproducible.