Inception's Mercury 2: The First Reasoning Diffusion Model

Inception’s Mercury 2: The First Reasoning Diffusion Model

Mercury 2 is the first reasoning diffusion LLM, and it claims to complete reasoning tasks 5x faster than speed-optimized autoregressive models like Claude 4.5 Haiku and GPT-5.2 Mini. Instead of generating text one token at a time, it refines outputs in parallel, shifting the speed–quality curve at the modeling level.

Written By
Corey Noles
Corey Noles
Feb 24, 2026
3 minute read

For two years, the AI race has looked like this: bigger models, better chips, faster inference stacks. Everyone is squeezing more juice out of the same orange.

Inception just planted a different tree.

Meet Mercury 2, the first reasoning diffusion LLM (dLLM). According to the company, it completes reasoning tasks 5x faster than other LLMs in its class, like Claude 4.5 Haiku and GPT-5 Mini.

That’s an impressive modeling shift with many applications in agentic workflows, coding, reasoning tasks, and even in chat.

The Core Idea: Stop Typing, Start Editing

Most large language models are autoregressive. They generate text one token at a time. Think: typewriter. Each word is locked in before the next begins. If the model drifts off course early, it can’t go back; it just keeps typing.

Mercury 2 doesn’t work that way.

It’s a diffusion LLM (dLLM). Instead of predicting the next token sequentially, it starts with noise and iteratively refines the output in parallel.

Inception’s analogy:

  • Autoregressive = a one-way typewriter
  • Diffusion = an editor revising the whole draft at once

“It starts with a rough sketch of the full output and refines it through a process called denoising, across many tokens in parallel. Each pass through the model modifies and improves multiple tokens simultaneously.”

If you’ve read our previous deep dive on Mercury 1 or our explainer on why diffusion for writing matters, this is the same core bet, now upgraded with reasoning and more.

The Benchmarks

Important context: these comparisons are against speed-optimized autoregressive models like Claude 4.5 Haiku and GPT-5.2 Mini — not frontier reasoning giants.

Third-party benchmarking, which Inception said uses methodology consistent with Artificial Analysis, shows:

Output throughput

  • Mercury 2: ~1008 tokens/second
  • Claude 4.5 Haiku: ~89 tokens/second
  • GPT-5 mini: ~71 tokens/second
An AI benchmarking speedchart showing Mercury 2 speed significantly faster than Claude Haiku 4.5 and GPT-5 Mini.

Quality scores:

  • AIME 2025: 91.1
  • GPQA: 73.6
  • IFBench: 71.3
  • LiveCodeBench: 67.3
  • SciCode: 38.4
  • Tau2: 52.9

Translation: it’s positioned as a production-scale reasoning model that’s extremely fast — not necessarily the most powerful frontier model.

Advertisement
An AI benchmarking chart showing Mercury-2 compared to many other models in its class.

What’s Actually New in Mercury 2?

Mercury 1 proved diffusion could work for text and code.

Mercury 2 adds:

  • Reasoning capabilities (multi-step planning, self-correction)
  • Redesigned denoiser components
  • New training objectives
  • New inference algorithms
  • A rebuilt serving engine

It supports:

  • Tool use
  • Structured outputs
  • RAG
  • 128K context window
  • OpenAI API compatibility

Pricing:

  • $0.25 per million input tokens
  • $0.75 per million output tokens

Drop-in replacement. No exotic integration required, as you just connect it like any other API.

Where Speed Actually Changes the Experience

Fast chat is nice. Fast reasoning changes products.

Mercury 2 is aimed at:

Agent loops
Code agents, browser control, SecOps triage, back-office automation. Latency compounds across steps, and shrinking it improves controllability and trust.

Search & voice
Real-time support agents, sales copilots, tutoring, translation. When p95 latency determines whether a conversation feels natural, speed isn’t cosmetic.

Coding workflows
Rapid prompt → review → tweak cycles. Inline refactors. Code review. In-the-flow iteration.

This is where diffusion’s “editor” approach aligns with reasoning: iterative refinement baked into the generation process itself.

Why This Matters Now

Inference is the bottleneck.

Reasoning models and agent workflows require more inference-time compute. Sequential generation compounds latency. That’s why OpenAI, Nvidia, Fireworks, Baseten, and others are pouring billions into faster inference.

But they’re still optimizing the same paradigm.

Mercury 2’s pitch: don’t optimize around the bottleneck, remove it.

Because diffusion generates multiple tokens per forward pass, speed improvements come from the modeling approach itself. Not better kernels. Not quantization. Not just new hardware.

Inception argues this represents a structural shift in the speed vs. quality trade-off curve, not an incremental one.

Advertisement

The Bigger Bet

Diffusion already won in images and video.

Now the people who helped build diffusion, Stefano Ermon (co-inventor of diffusion methods), Aditya Grover, Volodymyr Kuleshovk, are applying it to language at scale.

Inception, backed by Menlo Ventures, Mayfield, M12, NVentures, Snowflake Ventures, Databricks, Innovation Endeavors, and individual investors like Andrew Ng, Andrej Karpathy, and Eric Schmidt, just raised $50M to push this further.

The framing is bold: Diffusion is the successor to the transformer, not an alternative.

The vision: diffusion works for language. It works for reasoning. Now scale it across everything.

If the first phase of AI was “make it bigger,” and the second was “make it faster with hardware,” Mercury 2 is arguing for a third phase: change how generation works at the modeling level.

Corey Noles

Corey Noles is the Host of The Neuron: AI Explained podcast and Managing Editor of AI and Experimental Content at TechnologyAdvice, where he leads the charge in testing and refining emerging content strategies across the company's portfolio.

The Neuron Logo

Don't fall behind on AI. Get the AI trends & tools you need to know. Join 700,000+ professionals from top companies like Microsoft, Apple, Salesforce and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.