SHARE

AI Just Did Something No Textbook Said Was Possible. Twice. In One Day.

OpenAI's unreleased model likely solved 5 of 10 research-level math problems that had never appeared online, while GPT-5.2 derived a new physics result that overturned decades of textbook assumptions. February 13th, 2026 might be the day AI crossed the line from tool to researcher.

Written By

Grant Harvey

Feb 16, 2026

6 minute read

For forty years, physicists had the same answer to a specific question about subatomic particle behavior: it's zero. Impossible. Doesn't happen.

Then, on February 13th, 2026, a chatbot said: actually, you missed something.

That same day, eleven of the world's top mathematicians learned that an AI they couldn't even access had likely solved more than half of a test they specifically designed to be unsolvable by machines.

Two breakthroughs. One day. And a question nobody can ignore anymore: is AI doing original science?

The Test That Was Supposed to Be Impossible

Let's start with the math.

On February 5th, a group called First Proof posted 10 research-level math problems online and gave the world one week to solve them using AI. The group behind it reads like an academic dream team: 11 mathematicians from Harvard, Stanford, Columbia, Yale, and UC Berkeley, including a Fields Medalist (the Nobel Prize of math) and two MacArthur "genius" grant winners.

The problems weren't textbook exercises. They spanned 10 different subfields (algebraic topology, spectral graph theory, symplectic geometry, representation theory, and more), and each one took the mathematicians who wrote them weeks to months to solve by hand.

Here's the crucial part: none of these problems had ever appeared on the internet. The solutions existed only in the mathematicians' heads. This made First Proof the first truly "contamination-proof" AI math benchmark; there was no training data to pattern-match against, no shortcuts to exploit. Just raw mathematical reasoning, or nothing.

The organizers expected this to be humbling for AI. It was, mostly.

Publicly available models like ChatGPT and Gemini could only correctly solve 2 out of the 10 problems (problems 9 and 10). That's a failing grade by any measure.

But then OpenAI showed up with something the public can't use yet.

OpenAI's "Chaotic Sprint"

OpenAI's chief scientist Jakub Pachocki announced that they'd run an internal, unreleased model on all 10 problems in what he called a "side-sprint" over the same one-week window. His initial claim: the model likely solved at least 6 of the 10 (problems 2, 4, 5, 6, 9, and 10).

He later revised that number down after community review; problem 2's solution was likely incorrect, bringing the probable count to around 5.

Still: 5 out of 10 on problems that had never existed anywhere on the internet, solved in one week.

Some important caveats from Pachocki's own description:

Limited human oversight was involved. OpenAI didn't feed the model proof strategies, but they did have experts review outputs and asked the model to expand on some proofs.
Multiple attempts were allowed. For some problems, they presented "the best of a few attempts according to human judgement."
The methodology was messy. Pachocki himself called it a "chaotic sprint" where the methodology "leaves a lot to be desired." He also admitted transcripts were "quite scattered."
Verification is still ongoing. One of the First Proof organizers, Nikhil Srivastava from UC Berkeley, asked for the full transcripts. OpenAI said they'd publish more information, but couldn't gather all transcripts due to the chaotic nature of the sprint.

So this isn't a clean, controlled experiment. But even with all the asterisks, the gap between what an unreleased internal model can do (likely 5 out of 10) versus what the public can access (2 out of 10) is striking. It suggests the next generation of AI models will be meaningfully better at genuine mathematical reasoning, not just pattern matching.

Somewhere, a tenured math professor is nervously refreshing their LinkedIn.

Meanwhile, In Physics...

If the math result wasn't enough, OpenAI dropped a second bombshell on the exact same day.

They published a preprint titled "Single-minus gluon tree amplitudes are nonzero," and it's exactly as dramatic as it sounds (if you're a physicist, anyway).

Here's the plain-English version:

Gluons are subatomic particles that carry the strong nuclear force; think of them as the glue (hence the name) that holds protons and neutrons together. When gluons collide, physicists calculate the probability of different outcomes using math called "scattering amplitudes."

For decades, textbooks said that a specific type of gluon interaction ("single-minus" amplitudes, where one gluon spins one way and the rest spin another) was impossible. The math always came out to zero. Physicists had hand-calculated cases up to 6 particles and found nothing, so the consensus was: this doesn't happen.

GPT-5.2 found the exception.

Working with researchers from Harvard, Cambridge, the Institute for Advanced Study, Vanderbilt, and OpenAI, the model identified a precise mathematical scenario called the "half-collinear regime" where these supposedly impossible interactions actually do occur. The researchers had manually worked out extremely complex equations for smaller cases, and GPT-5.2 did something no human had managed: it simplified those superexponentially complex expressions into a single, elegant formula that works for any number of particles.

This wasn't a single lucky prompt. According to reports, it was an iterative, 12-hour session of scaffolded reasoning; the model and humans going back and forth, testing and refining.

Nathaniel Craig, a physics professor at UC Santa Barbara, didn't mince words. He called the work "journal-level research advancing the frontiers of theoretical physics" and "a glimpse into the future of AI-assisted science."

The research team has already extended these findings from gluons to gravitons (the theoretical particles that carry gravity), with more generalizations in development.

Why Should You Care? (The "So What" for Non-Physicists)

You probably don't need to understand gluon amplitudes to do your job. But here's what these two breakthroughs, on the same day, signal for everyone:

The "AI can't do real science" era is ending. For years, critics dismissed AI benchmarks by saying the models were just memorizing training data. The First Proof challenge was specifically designed to eliminate that argument, using problems that existed nowhere online. And an unreleased model still solved half of them. Meanwhile, GPT-5.2 didn't just regurgitate known physics; it found something physicists had missed for four decades.

The gap between what AI companies have internally and what you can use publicly is growing. OpenAI's internal model solved 5 problems; the public models solved 2. That gap matters because it means the capabilities coming in the next 6-12 months are significantly beyond what's available today. If you're making decisions about AI adoption, you're evaluating yesterday's tools.

Human-AI collaboration is the winning formula, for now. Neither breakthrough was "AI alone." The math sprint involved human oversight and expert review. The physics paper was a 12-hour back-and-forth between GPT-5.2 and researchers. The pattern emerging is: AI proposes, humans verify. AI spots patterns in complexity that overwhelms human cognition; humans provide the domain expertise and verification that AI can't self-certify.

As Ethan Mollick, a Wharton professor who studies AI adoption, framed it: the transition from "AI can't do novel science" to "of course AI does novel science" will follow the same pattern as every other AI transition. First the hype, then the skeptics, then smart people use AI to help them, then AI starts doing more of the work, then the breakthroughs start compounding.

His assessment of where we are right now? Somewhere between "smart people use AI to help them" and "AI starts to do more of the work."

OpenAI's Noam Brown was more direct: "We think our latest models will remove any doubt that STEM research is about to fundamentally change."

What Happens Next

The next round of First Proof problems drops March 14, 2026. This time, OpenAI (and likely other labs) will be prepared with more controlled methodologies.

If the pattern holds, we should expect:

Higher solve rates as models improve and teams refine their approaches
More labs participating, creating a competitive benchmark for frontier AI reasoning
Better transparency about how much human involvement is required versus pure AI reasoning
Growing pressure on the academic community to reckon with AI as a genuine research tool, not just a writing assistant

For knowledge workers, the practical takeaway is this: if AI can solve research-level math problems it's never seen before, the ceiling on what it can do for your industry-specific problems is much higher than most people assume. The question is no longer whether AI can reason. It's how fast that reasoning is going to get better.

And based on February 13th, the answer is: very fast.

The First Proof challenge is ongoing at 1stproof.org. OpenAI's physics preprint is available at openai.com. The solutions PDF for the math challenge can be found here.