OpenAI’s IMO Gold Medal Claim Mired in Controversy as Competitors Cry Foul
Please note: this is a developing story, so the below is what we know as of Sunday, July 20, but more will come out in the next few days as the IMO, DeepMind, and OpenAI all make public facing announcements.
OpenAI just announced what appeared to be a monumental achievement in the history of artificial intelligence: their experimental reasoning model scored a gold medal at the 2025 International Mathematical Olympiad (IMO), solving 5 of 6 problems under the same grueling conditions as the world's brightest high school students.
But within 24 hours, the celebratory announcement devolved into a messy public controversy, with competitors and IMO insiders raising serious questions about the score's legitimacy, OpenAI's conduct, and whether the company even earned a gold medal at all.
The first major challenge came directly from Mikhail Samin, who reported that OpenAI had not only broken a requested embargo by announcing its results before the event's closing ceremony—a move IMO coordinators allegedly found "rude and inappropriate"—but had also failed to cooperate with the IMO for official verification, unlike Google.
The situation escalated when Google DeepMind's Head of Reasoning, Thang Luong, tweeted that any official medal claim requires evaluation against the IMO's confidential marking guidelines. "Without the evaluation based on that guideline, no medal claim can be made," Luong stated, adding pointedly, "With one point deducted, it is a Silver, not Gold."
Why Even a Contested Win is a Big Deal
The drama shouldn't completely overshadow the technical leap.
IF OpenAI's claims are verified (and that's a fairly big if at this point, given the IMO's feelings on their move), then it just achieved what many thought impossible: their experimental reasoning model scored gold medal performance at the International Mathematical Olympiad (IMO) 2025, allegedly earning 35 out of 42 points
See, the International Math Olympiad isn't your typical math benchmark. These problems require:
- Novel creative thinking: Each problem takes ~100 minutes of sustained reasoning.
- Multi-page proofs: Solutions must be rigorous, watertight arguments.
- Zero memorization potential: These are brand new problems that can't be pattern-matched.
Alexander Wei announced this breakthrough, noting the dramatic progression in AI's mathematical abilities: "In reasoning time horizon, we've now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins)."
What makes this particularly stunning is that it defied expert predictions. Noam Brown revealed that Paul Christiano and Eliezer Yudkowsky had bet in 2021 that AI achieving IMO gold by 2025 had only 8% and 16% probability respectively. Brown emphasized: "this isn't an IMO-specific model. It's a reasoning LLM that incorporates new experimental general-purpose techniques."
Sheryl Hsu captured the sheer pace of progress: "It's crazy how we've gone from 12% on AIME (GPT 4o) → IMO gold in ~ 15 months. We have come very far very quickly."
Looking at the actual proofs OpenAI's model generated reveals sophisticated reasoning:
- Problem 1: A complex combinatorial proof about "sunny lines" with careful case analysis
- Problem 2: An intricate geometric proof using analytic coordinates and algebraic manipulation
- Problem 3: A number theory proof involving prime divisors and careful induction
- Problem 4: Analysis of divisor functions with multiple subcases
- Problem 5: A game theory proof finding optimal strategies through backwards induction
As many have pointed out, they're novel mathematical arguments constructed from first principles.
For more about this. watch this video from Wes Roth, which is the best explanation on the Math Olympiad news we’ve seen:
Interestingly, the proofs use remarkably terse language, as one Hacker News commenter noted: "Why waste time say lot word when few word do trick?" This abbreviated style suggests the model was optimized for efficiency - perhaps generating many candidate solutions in parallel, where minimizing tokens becomes crucial for scalability.
The Secret Sauce: Long Thinking Times and New Techniques
Noam Brown revealed crucial details about what made this breakthrough possible:
"We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade."
The thinking time is dramatically different from previous models:
- o1 thinks for seconds.
- Deep Research thinks for minutes.
- The IMO model thinks for hours.
Brown emphasizes: "Importantly, it's also more efficient with its thinking. And there's a lot of room to push the test-time compute and efficiency further."
The General vs. Specific Approach: A Key Distinction
What makes OpenAI's achievement particularly striking is how they did it. Unlike DeepMind's 2024 silver medal performance with AlphaProof and AlphaGeometry—systems specifically built for mathematical reasoning—OpenAI used a general-purpose language model.
As Wes Roth explains, this boils down to the difference between narrow AI and general AI.
- Narrow AI is a system engineered to be superhuman at one specific task. Think of a chess engine: it can beat any grandmaster, but it can't write a poem. Google's impressive AlphaProof fits this mold.
- General AI is a single system that can reason and learn across a wide range of different domains, much like a human. This is what LLMs are striving to be.
The breakthrough here is that OpenAI claims to have reached the pinnacle of a highly specialized, human-only domain using their general-purpose system. It suggests the model's advanced reasoning isn't a one-trick pony; it’s a core, flexible capability.'
Jerry Tworek confirmed that the model received "very little IMO-specific work"—just continued training of the general-purpose base models. All solutions relied on natural language proofs without any special evaluation framework.
So How Does This Compare to Current Models?
Just days before OpenAI's announcement, MathArena tested publicly available models on the same IMO 2025 problems. The results were sobering:
- Gemini 2.5 Pro led with only 13 out of 42 points.
- o3 and o4-mini performed even worse.
- None achieved the 19 points needed for bronze.
This stark contrast highlights the leap OpenAI's experimental model represents.
Chess, Go, and Now Math: Why This MIGHT Matter More
Some skeptics argue this is less impressive than AI conquering chess or Go. But as several researchers pointed out, competitive math requires a fundamentally different kind of intelligence. While chess and Go have fixed rules and limited move spaces, IMO problems demand creative insight, pattern recognition across diverse mathematical domains, and the ability to construct rigorous multi-page proofs in natural language.
Unlike game-playing AIs that master a single domain, this achievement required what Brown called "new techniques that make LLMs a lot better at hard-to-verify tasks." IMO proofs can't be verified by simple rules - they require human mathematicians hours to evaluate.
But Wait, Don't AI Models "Cheat" at Math?
Coincidentally, just before OpenAI's announcement, researchers from Hong Kong Polytechnic University published VAR-MATH, claiming that AI models essentially memorize answers rather than truly reason. Their findings showed dramatic performance drops when numbers in problems were changed - some models plummeting from 78% accuracy to just 2.5%.
So which is correct? Did OpenAI's model really "solve" math, or did it cheat?
They're both right, but they're measuring fundamentally different things.
The VAR-MATH Findings: A Valid Concern for Most Models
The Hong Kong Polytechnic study makes an important point. When they tested models by changing numbers in problems (like replacing "1" with "2, 5, or 15" in geometric problems), many models collapsed:
- 7B models dropped 48-58% in performance.
- Even 32B models fell 40-46%.
- Some models plummeted from 78% accuracy to just 2.5%.
This reveals that many models, especially smaller ones, are indeed pattern-matching rather than truly understanding mathematical concepts.
But Here's What the Headlines Missed
Buried in the VAR-MATH paper is a crucial finding: the best models showed little to no performance drop. DeepSeek-R1 dropped 0%, stayed at 100% on VAR-AMC23, and only dropped 12% on the harder AIME24 benchmark. OpenAI's o4-mini dropped just 12% on AMC23.
The paper itself acknowledges: "leading models like DeepSeek-R1 and SEED-THINK demonstrate strong resilience on VAR-MATH, with performance drops under 5%." The research, rather than debunking AI's abilities, actually helps validate that the most advanced models have achieved genuine understanding.
The VAR-MATH researchers tested models on relatively simple competition problems where memorization is possible. As one Reddit commenter astutely noted: "everybody knows 32b models aren't great nobody's using them for anything that requires them to be smart. If you need smartness you use a bigger model, simple"
Meanwhile, OpenAI's experimental model represents a new frontier. As Alexander Wei explains: "We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."
Trust, But Verify
The announcement sparked immediate skepticism, not only because of the accusations of improprietary around the announcement, but also due to OpenAI's recent FrontierMath controversy (which we called "Benchmark gate") where questions arose about training data contamination (TL;DR, OpenAI was funding the FrontierMatch org and may have gotten early access to answers).
Additionally, several Hacker News commenters acknowledged the lack of transparency about methodology, compute resources used, and verification processes. For instance, Fields Medalist Terence Tao raised important questions about comparing AI and human performance without standardized methodology.
Tao argues that the difficulty of the IMO can vary dramatically based on the format. Consider if contestants were given:
- Several days instead of 4.5 hours per problem.
- Questions rewritten in their preferred format.
- Access to calculators, computer algebra systems, or the internet.
- The ability to work in teams and communicate.
- Multiple attempts with only the best solutions submitted.
- The option to silently withdraw if unsuccessful.
"A student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats," Tao notes.
Massive compute, he noted, is like giving an AI a "time acceleration machine"—a Dragonball Z-style hyperbolic time chamber that humans don't have.
His key point: "In the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants."
As a counterpoint, like we shared above, OpenAI did release the actual proofs on GitHub for inspection (although some claim this release is weak), and Mark Chen confirmed the model was locked before the competition began. The IMO problems themselves were brand new, making pre-training impossible. Even still, the move rubbed Tao and the IMO the wrong way.
Tao concluded: "I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition." One could interpret this "vague-posting" as deeply disappointed at best and seething "how dare you" at worst.
DeepMind's Silent Gold
Interestingly, reports suggest DeepMind also achieved gold medal performance at IMO 2025, but bureaucracy (and y'know, not stealing the spotlight from the kids) may have delayed their announcement.
See, the IMO committee apparently requested that AI involvement not be publicly discussed within a week after the closing ceremony. We guess you could say OpenAI pulled a Dwight from The Office and said "I am the Hay King."
One System to Rule Them All
As pointed out by Decoder, Perhaps the most stunning revelation came from Jerry Tworek's summary of OpenAI's week:
"To summarize this week:
- we released general purpose computer using agent.
- got beaten by a single human in atcoder heuristics competition.
- solved 5/6 new IMO problems with natural language proofs.
All of those are based on the same single reinforcement learning system."
This means the same underlying technology powers:
- The IMO gold medal performance.
- OpenAI's new agent system.
- The model that competed at AtCoder (and lost to a single human engineer; more below).
So no matter where the experimental AI lands on the IMO chart, whatever system OpenAI is using internally to train all these models is quite impressive.
Humanity's Last Stand: The AtCoder Victory
In a poetic twist, while OpenAI supposedly conquered the mathematical olympiad problems, a human programmer beat OpenAI's other model at the AtCoder World Tour Finals.
Przemysław "Psyho" Dębiak, a 42-year-old programmer from Poland and former OpenAI engineer, defeated OpenAIAHC by roughly 9.5% in a 10-hour coding marathon. "Humanity has prevailed (for now)!" he wrote, confessing he'd slept only around 10 hours over three days.
The victory highlights that in heuristic programming—where "good enough" solutions matter more than perfect ones—human creativity and intuition still reign supreme. Even Sam Altman acknowledged the achievement with a simple "Good job, Psyho."
We don't know if this is the same model, but OpenAI's next AI coding model is crushing benchmarks on the WebDev Arena. Operating under the name "Anonymous Chatbot 0717" (identified as o3-alpha), this new model is "genuinely at a completely different level of front end coding - far better than Sonnet, o3, Gemini 2.5 Pro, or Grok 4."
One tester reported the model "one shotted all the vague prompt i gave" including:
- GTA CLONE
- MINECRAFT CLONE
- Flappy bird 3D
- Pelican on the cycle SVG
Sam Altman himself noted: "woke up early on a saturday to have a couple of hours to try using our new model for a little coding project. done in 5 minutes. it is very, very good. not sure how i feel about it..."
GPT-5: Not One Model, But Many
According to insider reports, GPT-5 is imminent and represents a fundamental shift in architecture:
"It's not one model, but multiple models. It has a router that switches between reasoning, non-reasoning, and tool-using models. That's why Sam said they'd 'fix model naming': prompts will just auto-route to the right model."
This aligns with OpenAI's recent pattern of specialized excellence—different models for different tasks, all unified under one interface.
So does this mean we'll soon have agent, o3-alpha, and OpenAI's experimental math model as all part of GPT-5?
Perhaps AT SOME point... but not quite so soon.
Why? Well, first there was Grok 4. Then there was Kimi K2. It's been hyptohesized that both of these releases make OpenAI's upcoming launches look much less impressive (Grok 4's benchmarks ate into OpenAI's, and K2's world-class open model made OpenAI's open model less of a leap forward from DeepSeek R1, supposedly; none of this is public at this point, so hard to say, but we do know the open model got delayed indefinitely for safety training).
When Can We Use This?
Despite the excitement, Tworek clarified that the IMO-level model is "probably end of year thing" for public release. Alexander Wei echoed this, stating they "don't plan to release anything with this level of math capability for several months."
The researchers are already looking ahead, with Tworek hinting at tackling millennium prize problems as the next milestone.
The Bottom Line
The official medal count remains TBD. But the drama over the score and OpenAI's PR tactics is secondary to the real story: the emergence of a powerful, general-purpose reasoning engine. Whether it earned a gold or silver, OpenAI has shown that one core system can achieve elite performance in wildly different domains, pushing the frontier of AI forward at a pace that continues to surprise even its own creators.
As Noam Brown reflects: "When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new... It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."
So while many smaller models rely on pattern matching and fail symbolic generalization tests, the latest frontier models - the ones pushing boundaries with massive compute and novel training approaches - are demonstrating genuine mathematical reasoning that rivals top human mathematicians.