On December 20, 2024, OpenAI previewed o3, the next model in its reasoning line, as the final reveal of its “12 Days of OpenAI” event. The headline result was on ARC-AGI, a benchmark of abstract visual puzzles designed by Francois Chollet to resist memorization and reward genuine on-the-fly reasoning. The ARC Prize team, which administers the benchmark, published its own write-up of the result the same day and called it “a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.”
The numbers are the story. On the semi-private ARC-AGI-1 evaluation set, o3 scored 75.7 percent in a low-compute configuration and 87.5 percent when allowed far more compute to think (the high-compute run used roughly 172 times more compute per task). On the public evaluation set the figures were 82.8 and 91.5 percent. The ARC Prize post set this against the long, flat history of the benchmark: “ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o.” o3 cleared the 85 percent threshold that ARC had used as a human-level reference point, something no earlier GPT-family model had come close to doing.
The result mattered because ARC-AGI was built specifically to be hard for systems that rely on pattern-matching against training data; each puzzle is novel, so solving it implies adapting to a task the model has not seen. ARC Prize described o3 as showing “novel task adaptation ability never seen before in the GPT-family models.” It was careful to add that this did not mean o3 was generally intelligent or that ARC-AGI was solved, and a later, harder version of the benchmark would reset expectations in 2025. But the jump from roughly 5 percent to the high 80s in a single model generation was the most dramatic single-benchmark leap of the reasoning era.
For business readers, o3 was the proof point that the reasoning approach introduced with o1 had real headroom: spending more compute at inference time, letting a model deliberate rather than answer instantly, could unlock capabilities that simply scaling pre-training had not. That tradeoff between cost, latency, and accuracy is now central to how frontier AI is priced and deployed.
Note on sourcing: OpenAI’s own pages return an HTTP 403 error to automated fetchers, so the ARC Prize blog post (arcprize.org/blog/oai-o3-pub-breakthrough) was fetched and verified live as the primary record of the scores and is cited as the lead source; the OpenAI announcement was corroborated through search against the canonical openai.com pages. The exact ARC-AGI-1 figures (75.7 percent low-compute, 87.5 percent high-compute on the semi-private set) are quoted directly from the verified ARC Prize write-up.