GSM-Symbolic probes whether language models truly reason through grade-school math or merely pattern-match against problems they have seen. It rebuilds the popular GSM8K word problems as symbolic templates, so the same underlying problem can be regenerated with different names, numbers, and structure, exposing how stable a model’s performance really is.
The benchmark was introduced by Iman Mirzadeh, Samy Bengio, Mehrdad Farajtabar, and colleagues at Apple in “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” posted in October 2024. The authors found that accuracy varied noticeably just from changing the numbers in a question, and that adding an extra clause, even an irrelevant one, could cut performance by as much as 65 percent. They argued this points to models replicating reasoning steps from training data rather than performing genuine logical reasoning.
The work fed into a broader debate about whether high benchmark scores on math reflect real reasoning or sophisticated memorization.
For a business reader, GSM-Symbolic is a caution against over-trusting headline accuracy: a model that scores well on a fixed test may stumble on the same problem phrased slightly differently.