SHARE

Diffusion LLMs Just Got Their First Serious Transparency Test

Diffusion LLMs like Mercury made the case for speed. A new DiffusionGemma paper asks the harder question: can we inspect how these models reason while they refine an answer? The early answer is encouraging, but it opens a new product and safety frontier.

Written By

Corey Noles

Jun 22, 2026

6 minute read

Speed got diffusion language models into the conversation. Transparency may decide whether they graduate from clever architecture to trusted AI infrastructure.

A new paper from Google DeepMind, “How Transparent is DiffusionGemma?”, takes one of the most interesting new model families and asks the awkward question: if a language model generates text by refining an entire canvas of tokens across denoising steps, can we still understand how it reasoned?

The answer is surprisingly encouraging, with one giant footnote wearing a lab coat.

DiffusionGemma, the model studied in the paper, is a text diffusion model based on Google’s Gemma family. Rather than committing to one token at a time from left to right, it starts with a noisy token canvas and repeatedly refines the whole thing. That gives it a very different shape of computation. It can revise earlier tokens after later ones become clearer. It can estimate the length of a response before the words settle. It can sketch code structure, fill in logic, then revise earlier variable names and comments.

That is the fun part (no exaggeration, diffusion IS fun to watch.) The scarier part is that some of this work happens in intermediate states that are harder for humans to read than ordinary chain-of-thought text.

The paper’s core finding is that DiffusionGemma is not a black box in the worst-case way people might fear. The authors show that much of the information passed between denoising steps can be mapped through an interpretable token bottleneck with little performance loss. They also find that DiffusionGemma is similarly monitorable to Gemma 4 on the benchmarks they tested.

That matters because diffusion LLMs have mostly been framed around speed until Mercury-2 dropped this spring with full reasoning capabilities at a sliver of the cost of an autoregressive frontier models. (Editor's note: Mercury-2 remains my go-to in my OpenClaw setup.)

Mercury by Inception Labs, continues to make the category feel commercially real by showing diffusion-based coding models that generate many tokens in parallel and report extremely high throughput. Mercury Coder Mini and Small were presented as state-of-the-art on the speed-quality frontier, with reported throughputs of 1,109 and 737 tokens per second on H100 GPUs.

That is the obvious pitch: faster output at comparable quality, lower latency, maybe cheaper inference.

The new DiffusionGemma paper points at the next pitch: if these models think differently, we need tools that show us the thinking shape.

The authors break transparency into a few useful pieces. One is “variable transparency,” meaning whether the intermediate states are understandable. Another is “algorithmic transparency,” meaning whether we can reconstruct how the model got from prompt to answer. They also study monitorability, which asks whether model outputs give downstream monitors enough signal to detect important behavior.

The reassuring result is that DiffusionGemma’s intermediate states often look more token-like than expected. Using logit-lens-style techniques, the researchers could restrict the self-conditioning information between denoising steps to a small set of likely tokens without badly hurting performance on benchmarks. In plain English: even though the model has continuous hidden vectors between denoising rounds, those vectors often seem to point toward readable token guesses.

That is a big deal. If future diffusion models preserve that property, developers may be able to inspect more than the final answer. They may be able to inspect the evolution of the answer.

But diffusion reasoning has its own weird little fingerprints.

The paper describes “non-chronological reasoning,” where the model solves parts of the answer out of order. In one example, DiffusionGemma predicts the length of a photosynthesis answer very early, before the exact wording has settled. In another, it initially leans toward the wrong answer to a square-number problem, then fills in reasoning later and retroactively corrects the answer. For code, the model often commits to the scaffold and core logic before filling in details that appear earlier in the final text.

That is fascinating for coding models. A human programmer often thinks this way too: decide the function shape, place the loop, fill the variables, clean up the comments. Diffusion gives the model a canvas where that style of work is native.

The paper also finds “token smearing” and “sequence smearing.” Sometimes the model appears to know that a token belongs somewhere nearby, while spreading probability mass across several positions before locking it down. In more complex cases, it can hold multiple candidate chunks at once before converging. That behavior is useful from a generation standpoint and messy from an interpretability standpoint. A monitor may need to understand not just what the model said, but which alternatives it briefly considered.

The most important example may be “intermediate context reasoning.” In a Fibonacci-like task, DiffusionGemma sometimes uses a temporary digit as a reasoning placeholder, then replaces it before the final answer. The placeholder helped produce the correct sequence, but it vanished from the output.

That should make every AI safety person sit up a little straighter.

For current chain-of-thought monitoring, we often treat the visible reasoning trace as the audit trail. Imperfect, yes, but still useful. Diffusion models may give us a richer audit trail if we can inspect denoising steps. They may also create new failure modes where the most important reasoning artifact never appears in the final answer.

This connects directly to the broader architecture debate we covered in The Neuron’s Transformer vs. Post-Transformer explainer. The field is looking beyond the default token-by-token Transformer workflow because latency, memory, long-horizon reasoning, and hidden-state computation all matter. Diffusion LLMs are one of the cleanest examples of that shift because they preserve many Transformer ingredients while changing the generation process.

There is a product angle here too. In our coverage of Thinking Machines’ interaction models, the larger theme was timing: AI systems need to respond in the rhythm of real work. Diffusion models fit that theme from another direction. If a model can refine many parts of an answer at once, it may support faster coding, lower-latency interfaces, and more fluid agent workflows.

But the tradeoff is governance. Faster models that reason in harder-to-follow ways create a new requirement: observability for generation itself.

The best version of diffusion LLM infrastructure might include a “denoising trace” beside the final answer. Developers could inspect when the model became confident, where it revised itself, which answer chunks competed, and whether hidden placeholders drove the final output. That could make debugging richer than today’s chat logs.

The weaker version is a very fast model that gives you a polished answer and hides the assembly process.

The DiffusionGemma paper argues that the current model lands closer to the reassuring side. Its monitorability looks similar to Gemma 4. Its intermediate states are often interpretable. Its odd behaviors can be studied with visualization tools. The authors are careful, though: this result depends on DiffusionGemma’s specific architecture and training choices. A future diffusion model trained with different pressures could become much more opaque.

That is the real takeaway.

Mercury showed why builders care about the speed curve and showed legit intelligence from dLLMs for the first time. DiffusionGemma gives researchers a model organism for studying the reasoning curve. The next phase is turning those research tools into developer tools, safety tools, and product expectations.

Because the future of AI generation may be less like watching a sentence type itself and more like watching a draft sharpen into focus.

That can be powerful. It can also be hard to supervise.

The models are learning to revise the whole page at once. Now the rest of the stack has to learn how to read the revision history.

Corey Noles

Corey Noles is the Host of The Neuron: AI Explained podcast and Managing Editor of AI and Experimental Content at TechnologyAdvice, where he leads the charge in testing and refining emerging content strategies across the company's portfolio.

Diffusion LLMs Just Got Their First Serious Transparency Test

Corey Noles

Company

Categories