GLUE, the General Language Understanding Evaluation, was introduced in 2018 by Alex Wang, Samuel Bowman, and colleagues. It bundled together nine existing language tasks, such as judging whether two sentences mean the same thing, whether one sentence implies another, and whether a movie review is positive or negative, and rolled them into a single average score. The idea was to measure general language understanding rather than skill at one narrow task, and to give the field one comparable number to track progress on.
GLUE arrived at exactly the right moment. Within months, BERT and the wave of pre-trained Transformer models that followed it began climbing the leaderboard so fast that the benchmark was effectively solved: by 2019 the top systems had passed the level of non-expert humans on the combined score. A benchmark that everyone aces no longer tells you who is best, so the same group released SuperGLUE in 2019, a “stickier” successor built from harder tasks specifically chosen because they had resisted the current models. SuperGLUE, too, was largely saturated within a couple of years.
This rapid rise-and-saturation pattern is the most important thing GLUE illustrates. The pace of progress was so quick that benchmarks designed to last for years were exhausted in months, which is why the community kept moving to harder tests, eventually to broad knowledge exams like MMLU and beyond. GLUE and SuperGLUE were the leaderboards that defined the pre-trained-Transformer era and motivated that whole escalation.
Why business readers should care: GLUE is a clean case study in how fast modern AI was improving and in the limits of any single benchmark. A score that distinguishes the best systems one year can be meaningless the next, which is why responsible evaluation keeps raising the bar and why a model’s marketing claim of “state of the art” on a named benchmark needs to be read together with the date.