Mercury 2 and Kona: AI Models Ditching the ChatGPT Playbook

Every major AI model you've used — ChatGPT, Claude, Gemini — works the same way under the hood. They predict text one word at a time, left to right, like a very fast typist. It works, but it's painfully inefficient. Inception Labs just launched Mercury 2, a reasoning model that ditches that approach entirely — and it might be a preview of where the whole industry is headed.

Mercury 2 is the first commercially available diffusion language model with reasoning capabilities. Instead of generating tokens sequentially, it produces a rough draft of the entire answer and then refines it — the same technique behind Midjourney and Sora, now applied to text. The result: 1,009 tokens per second on NVIDIA Blackwell GPUs, roughly 10x the throughput of Claude 4.5 Haiku and GPT 5.2 Mini at comparable quality.

The price tag is wild, too:

$0.25 per million input tokens (half of Gemini 3 Flash, 4x cheaper than Claude Haiku)
$0.75 per million output tokens (4x cheaper than Flash, 6.5x cheaper than Haiku)
128K context window, full tool use, and JSON output support

We sat down with founder Stefano Ermon — the Stanford professor whose lab invented the original diffusion models behind Midjourney and Stable Diffusion — to understand how this works. The key insight: today's LLMs are memory-bound, spending most of their time shuffling data around instead of actually computing. Diffusion models flip that equation by processing tokens in parallel, making far better use of existing GPUs. As Ermon put it in our interview, for the same hardware, you produce more tokens, so the cost per token goes down.

Why does this matter beyond speed?
Why speed matters more than you think for agents
The three phases of AI — and the fourth one coming
The bigger picture: two bets against the autoregressive paradigm.
Here's what's fascinating: diffusion and energy-based models are complementary, not competing.
What would happen if you connected a diffusion LLM to an EBM?

Why does this matter beyond speed?

Here's the part that clicked for us.

Today's AI models are what engineers call "memory-bound." That means when ChatGPT generates a response, your GPU — the expensive chip doing the work — spends most of its time shuffling data around in memory rather than actually computing anything. It generates one token, processes the entire massive neural network, and all it gets out of that effort is a single word. Then it does the whole thing again for the next word. And the next. And the next.

It's like hiring a master chef, but instead of letting them cook, you make them walk back and forth to the pantry between every single ingredient. The kitchen is world-class. The workflow is the bottleneck.

This is why the AI industry has been throwing more and more GPUs at the problem — not because each one is being fully used, but because each one is being poorly used. As Ermon put it: there's a structural bottleneck, and there's no way to parallelize it because you can't generate the third token until you've generated the first and second.

Diffusion flips this. Instead of generating one token per pass, Mercury modifies multiple tokens at the same time. That shifts the problem from memory-bound to compute-bound — meaning the GPU is now limited by how many calculations it can do, not how fast it can move data around. And GPUs are built for parallel computation. That's literally what they're good at.

As Ermon explained: "It's shifting from a memory-bound regime to a compute-bound regime, which is a much easier quantity to scale. If you were a chip manufacturer, it's a lot easier to add flops than to increase memory bandwidth."

Here's the practical upshot: for the same number of GPUs, you produce more tokens, so the cost per token goes down. Mercury doesn't need a bigger data center. It just uses the existing hardware the way it was actually designed to be used.

Mercury 2 isn't just another research project. Inception raised $50M from Menlo Ventures, Mayfield, M12 (Microsoft's venture fund), NVentures (NVIDIA), Snowflake Ventures, and Databricks — plus individual backing from Andrew Ng, Andrej Karpathy, and Eric Schmidt. These are people who've seen every AI pitch on earth. Clearly, they're betting that diffusion for language isn't a novelty — it's the next default.

Why speed matters more than you think for agents

Here's the thing about AI agents: latency isn't cosmetic — it's multiplicative. A five-step agent loop where each step takes 20 seconds means you're waiting nearly two minutes for a single workflow to complete. At Mercury's 1.7-second end-to-end latency, that same loop finishes in under 10 seconds.

That changes what's buildable. As Corey broke down in our recent cost comparison, in a real agent workflow processing 6M tokens per day, Mercury 2 costs $2.50/day vs. Claude Haiku's $12.20/day — nearly 5x cheaper. And the savings compound: fewer retries, faster feedback loops, more iterations before you hit your budget ceiling.

The insight that matters here: choosing the right model isn't about sticker price anymore — it's about your token shape. If your system reads more than it writes, GPT-5 Mini's cached input pricing wins. If your system generates heavy output or runs multi-step loops, Mercury's $0.75/M output tokens and 10x throughput make it the rational choice. Read the full breakdown here.

The three phases of AI — and the fourth one coming

As Corey wrote in our Mercury 2 deep dive: "If the first phase of AI was 'make it bigger,' and the second was 'make it faster with hardware,' Mercury 2 is arguing for a third phase: change how generation works at the modeling level."

Inception's thesis is simple: don't optimize around the bottleneck — remove it. Every other lab is trying to squeeze incremental speed out of the same token-by-token generation loop. Mercury rewrites the loop itself.

But there might be a fourth phase on the horizon. Right now, even the fastest models still rely on LLM-as-judge for verification — which means your verification step has the same bottleneck you just removed from generation. What if verification didn't need tokens at all?

That's what Logical Intelligence's energy-based reasoning models offer. Kona doesn't verify by generating a second opinion. It scores the entire reasoning trace holistically, in milliseconds, and tells you exactly where the error is — not just that there is one. If you paired Mercury's parallel generation with Kona's instant verification, you'd remove the bottleneck from both sides of the equation.

The bigger picture: two bets against the autoregressive paradigm.

Mercury isn't the only system challenging the token-by-token approach. Earlier this month, we interviewed Eve Bodnia, founder of Logical Intelligence, who is building energy-based reasoning models — with Turing Award winner Yann LeCun as founding chair of their technical research board.

As Eve put it in our interview: "Not every AI has to be LLM-based. There's a lot of information around us not tied to any language." Her model Kona doesn't use tokens at all. It maps problems onto an "energy landscape" — like a topographic map where valleys are correct answers and peaks are wrong ones — and navigates directly to the solution. In a head-to-head Sudoku benchmark, Kona scored 96% where GPT-5.2, Claude, and Gemini combined managed 2%. Their orchestration system Aleph solved 99.4% of PutnamBench, a formal math reasoning benchmark.

Here's what's fascinating: diffusion and energy-based models are complementary, not competing.

Both reject the one-token-at-a-time bottleneck, but they solve different problems. Mercury's diffusion approach generates and refines text in parallel for speed — ideal for coding IDEs, voice agents, customer support, anywhere a human can't wait. Kona's energy-based approach evaluates all possible solutions simultaneously for accuracy — ideal for mission-critical systems where being wrong isn't an option, like robotics, infrastructure, and formal verification.

Both are also shockingly efficient. Mercury hits 1,000 tokens/second on standard NVIDIA hardware. Kona runs on a single cheap GPU with a 200M parameter model. Compare that to the billions of parameters and thousands of GPUs that frontier LLMs require.

Even the underlying physics rhyme. Stefano's diffusion models start with noise and iteratively denoise toward a good answer — refining a rough sketch into a clean output. Eve's energy models navigate an energy landscape to find the lowest-energy (most correct) state — the same principle that governs how a ball rolls downhill or a planet orbits a star. Both are trained to correct errors rather than predict the next word. Both can revise any part of their output, not just append to the end. And both open the door to multimodal AI — Stefano talked about diffusion for speech, vision, and robotics; Eve is already building energy-based models (EBMs) for robotics that don't need language at all.

Best of all, Eve's model requires a large language model to communicate with the EBM. So why couldn't that model be a diffusion large language model?

What would happen if you connected a diffusion LLM to an EBM?

Short answer: it would be absurdly fast and verifiably accurate. Here's the napkin math.

Today's reasoning pipeline (autoregressive + LLM-as-judge):

A typical agentic reasoning loop looks like this: the model generates a reasoning trace token-by-token, then either a second LLM or the same one evaluates whether the answer is correct, and if not, it regenerates from scratch. Each step is sequential.

Rough numbers using current models with reasoning enabled:

Generate a response: ~15-25 seconds (Claude Haiku 4.5 takes 23.4s, Gemini 3 Flash takes 14.4s)
Evaluate/verify with another LLM call: ~10-15 seconds
If wrong, regenerate entirely: another 15-25 seconds
One loop iteration: ~30-50 seconds. With 2-3 retries, you're at 1-2 minutes.

Diffusion + EBM pipeline

Replace generation with Mercury's diffusion (parallel token generation) and replace LLM-as-judge with Kona's energy scoring (instant holistic evaluation):

Mercury generates a complete reasoning trace: ~1.7 seconds (its measured end-to-end latency)
Kona scores the entire trace with a single energy evaluation: ~300 milliseconds (this is measured Sudoku solve time on a single GPU, and scoring is cheaper than solving)
Kona doesn't just say "wrong" — it localizes the error (which constraint is broken, which step introduced the problem). So Mercury doesn't regenerate from scratch. Diffusion can edit any part of the output, not just append to the end. Targeted refinement: ~0.5-1 second
One loop iteration: ~2.5-3 seconds. Even with a retry, you're under 5 seconds.

That's roughly 10-20x faster per iteration. But the bigger multiplier is that you need fewer iterations because the EBM tells you where the error is instead of just "try again." If you go from 3 retries on average to 1, that's another 3x. Combined: potentially 30-50x faster agentic loops.

On cost:

Mercury 2: $0.75 per million output tokens
Kona: 200M parameter model on a single cheap GPU (Eve said their entire Sudoku benchmark cost $4 vs. $11,000 for the LLMs)
Current stack: Claude Haiku at $5/M output tokens + verification LLM calls on top
Rough estimate: 8-15x cheaper per verified reasoning step

The theoretical unlock: This is why Eve's framing of "EBM for reasoning, LLM for the interface" maps so cleanly onto diffusion. Mercury already generates holistically and refines iteratively — that's exactly the kind of output an EBM is designed to score. A traditional autoregressive model produces tokens sequentially, so the EBM has to wait for the full sequence before evaluating. With diffusion, the EBM could theoretically score partial drafts mid-refinement and steer the denoising process toward lower-energy (more correct) states. You'd essentially be combining two error-correction systems: diffusion's iterative refinement (making it look better) with EBM's energy scoring (making it actually correct).

Neither company has announced this kind of integration; yet. But the architectures are compatible — Stefano said Mercury still uses a transformer backbone, and Eve confirmed Kona can attach to transformer-based systems. The pieces fit.

The through-line: the future AI stack might look very different from today's. Diffusion for fast generation. Energy-based models for verified reasoning. And traditional LLMs as the interface layer — the part that talks to you, not the part that thinks. Maybe there's even room in there for a world model or a neurosymbolic intelligence layer. As Eve said: "EBM is one part of the story. EBM attached to the LLM is another part of the story." The same is true of diffusion.

Try Mercury yourself at Inception Chat. Watch the diffusion animation — it looks nothing like any AI you've used before.

Inception was founded by Stanford, UCLA, and Cornell professors who helped build foundational AI techniques like Flash Attention and Direct Preference Optimization (DPO). They're backed by NVIDIA, Databricks, and investors including Karpathy, Andrew Ng, and Eric Schmidt, with $50M raised.

If you're like Tyler on TBPN and much less bullish on open Chinese models after yesterday's news, where Anthropic more or less burned them as spies that distill the results of US-based frontier models instead of innovating on their own (tho TBH this could just be Western lab cope), then join us in protecting and honoring genuinely innovative companies like Inception who are fixing the ills of inefficient transformer-based language models and lowering the cost of intelligence for everyone. You're a real one, Stefano!