Continuous Thought Machine, Explained

One of the original authors of the "Attention Is All You Need" paper—the research that created the Transformer architecture powering ChatGPT, Claude, and basically every AI you use—has stopped working on Transformers.

Why? Llion Jones, now at Sakana AI in Japan, thinks we're trapped in a "local minimum." He told the brilliant podcast Machine Learning Street Talk that the architecture works so well that everyone's just tweaking it instead of finding the next big thing.

Here's the problem he sees: Current AI models can be forced to do anything with enough compute and data. But that doesn't mean they actually understand what they're doing.

Jones uses a striking visual example from an obscure paper called Intelligent Matrix Exponentiation. When you train a standard neural network to classify a spiral shape, it draws tiny piecewise linear boundaries that happen to look like a spiral. But a different layer design in that paper drew the decision boundary as an actual spiral—and could correctly predict where the spiral goes next.

Standard networks fake understanding. They don't represent a spiral as a spiral, a hand as a hand, or (Jones implies) language as... whatever language actually is.

His solution: The Continuous Thought Machine, a new architecture that just earned a spotlight at NeurIPS 2025. Built with researcher Luke Darlow, it has three key innovations:

Sequential internal thinking: Instead of processing everything in one shot, it reasons step-by-step through problems
Smarter neurons: Each neuron is a tiny model that uses its own history, not just a simple on/off switch
Synchronization as representation: The model measures how neurons fire together over time—like brainwaves

The results are wild. When trained on maze-solving, the CTM actually backtracks when it makes wrong turns. When it doesn't have enough time to solve a maze, it learns a leapfrogging algorithm to approximate solutions. And it's nearly perfectly calibrated out of the box—something that usually requires post-hoc fixes.

Jones also dropped a challenge: Sudoku Bench, using handcrafted variant Sudokus from YouTube channel Cracking the Cryptic. Current AI models max out around 15% accuracy. Humans solve them routinely.

The uncomfortable implication: We've been so dazzled by scale that we forgot to ask if the architecture is right. Watch the full interview here.

Key Insights, Predictions & Technical Takeaways
The Fix: Rethinking How AI Neurons Work
How it works, in plain English:
What Happens When You Train It
The Neuron Dynamics Look Different
The Bigger Point: We Might Be Wasting Time
The Sudoku Challenge
Why This Matters for You

Key Insights, Predictions & Technical Takeaways

(00:12) Leaving the Transformer Behind Llion Jones, a co-inventor of the Transformer, reveals he is drastically reducing his research on Transformers because the space is "oversaturated." He believes the industry is stuck in a local minimum and is pivoting to nature-inspired, exploration-heavy research like the Continuous Thought Machine (CTM).
(01:26) The Loss of "Lunch Table" Innovation Jones contrasts the current capital-intensive research environment with the era of the Transformer's invention. The Transformer wasn't a top-down corporate mandate; it was a bottom-up result of researchers "talking over lunch" and having the freedom to experiment for months without immediate pressure.
(02:07) The Untapped Potential of Evolutionary Scale Despite hundreds of millions of dollars spent on compute, evolutionary-based search (A-Life) experiments are still done on a relatively small scale. Jones predicts that when someone finally scales up evolutionary search algorithms using massive compute, it will yield major breakthroughs that current methods miss.
(03:47) Epistemic Foraging vs. "Gray Goo" The host and guests discuss Kenneth Stanley's Why Greatness Cannot Be Planned, arguing that research needs "epistemic foraging"—following gradients of interest without objectives. When too many committees and agendas are involved, research turns into "gray goo" lacking novelty.
(06:44) The Phenomenon of "Technology Capture" Just as YouTubers suffer from "audience capture," AI labs suffer from "technology capture." Early Google allowed open-ended exploration, but labs like OpenAI are morphing into application platforms (e.g., search, social), which forces research into narrow commercial pathways and kills autonomy.
(08:06) The RNN Parallel: A Warning from History Jones draws a parallel to the pre-Transformer era, where researchers spent years making tiny incremental tweaks to RNNs and LSTMs (e.g., hierarchical LSTMs, gating tweaks) for negligible gains. The Transformer rendered that entire body of research redundant overnight, and he fears the industry is currently doing the exact same thing with Transformers.
(11:04) The Redundancy of 1.1 Bits Per Character Jones recalls that when they first applied deep Transformers to language modeling, they immediately hit scores (e.g., 1.1 bits per character) so low that colleagues thought they had made a calculation error, proving that architectural shifts can dwarf years of incremental optimization.
(17:09) "Jagged Intelligence" as a Flaw The phenomenon where LLMs solve PhD-level problems one moment and fail basic logic the next ("jagged intelligence") is not just a data issue but a reflection of a fundamental flaw in the current architecture. Jones argues the current technology is "too good" at faking it, masking deep structural issues.
(18:32) The "Spiral Problem" & Representation Jones uses a "Spiral" data set as a litmus test for intelligence. Standard neural networks (ReLUs) solve a spiral by creating piecewise linear cuts—they "fake" the shape without understanding it. A true intelligence should represent the spiral as a spiral (constructively), allowing it to extrapolate correctly, which current models fail to do.
(22:18) The Five-Finger Illusion Jones argues that fixing image generation models to produce five fingers by simply adding more data and brute force didn't fix the underlying representation problem. It just forced the network to memorize "five fingers," whereas a model with a true representation of a human hand would count fingers naturally.
(29:54) Continuous Thought Machines (CTM): The Internal Dimension Luke Darlow introduces the CTM, which relies on an "internal thought dimension." Unlike Transformers that process in one shot, CTM applies compute sequentially in a latent space. This allows the model to "think" for an arbitrary duration before generating an output.
(30:51) Maze Solving: The "Hello World" of Reasoning Darlow distinguishes between "vision" maze solving (trivial for CNNs that see the whole image at once) and "sequential" maze solving (human-like: "go up, go right"). CTM forces the model to solve the maze sequentially, which is exponentially harder but builds a foundation for true reasoning.
(32:04) Neurons as Micro-Models In the CTM architecture, every single neuron is modeled not as a scalar value but as a small Multi-Layer Perceptron (MLP) with a history. This attempts to bridge the gap between the complexity of biological neurons and the parallelizability of deep learning.
(33:13) Representation via Synchronization CTM defines a "thought" not by the state of neurons at a given time, but by how neurons synchronize (fire together) over time. They measure the dot product of time-series activations between pairs of neurons, creating a dynamic relational representation.
(34:57) The Necessity of Autocurriculum CTM could not learn to predict 100 steps of a maze in one shot. The team had to build an "autocurriculum" where the model is supervised to predict only one step further than it currently can (self-bootstrapping), mimicking how humans learn complex sequential tasks.
(37:58) Natural Adaptive Computation CTM achieves adaptive computation (using few steps for easy tasks, many for hard ones) without complex penalties. They simply average the cross-entropy loss at the point of "best performance" and "highest certainty," causing the model to naturally distribute its "thinking time" based on difficulty.
(41:56) Quadratic Representation SpaceBecause CTM relies on synchronization between pairs, a system of DD D neurons creates a representation space of roughly D2/2D^2/2 D2/2. This quadratic expansion creates a vastly richer underlying state space than standard networks while remaining trainable via backpropagation.
(47:09) Breaking the "One-Shot" Classification Paradigm Standard classifiers (ViTs/CNNs) must nest the reasoning for easy and hard classes into the same parallel pass. CTM allows the model to "stop thinking" early for an obvious cat image but continue processing for an ambiguous one, naturally segmenting the problem space by difficulty.
(49:19) Perfect Calibration Without Post-Hoc Fixes Deep neural networks typically become "uncalibrated" (overconfident) as they are trained. The CTM, however, was found to be nearly perfectly calibrated after training—meaning if it says it is 90% sure, it is correct 90% of the time—without any of the usual post-hoc calibration tricks.
(58:03) Emergent "Leapfrog" Algorithms When the CTM was constrained to have less thinking time than required to trace a maze fully, it emerged with a "leapfrogging" strategy: it jumped ahead to a likely future point, traced backward to fill the gap, and then jumped forward again. This behavior was not programmed but emerged from the constraints.
(1:01:44) Sudoku Bench: The Anti-LLM Benchmark Sakana AI released "Sudoku Bench," consisting of variant Sudokus (e.g., "Knights Move" or "Thermo" Sudoku). These require understanding natural language constraints and performing strict logical deduction. Current models, even GPT-5 class, fail miserably (around 15% on simple ones) because they cannot hallucinate their way through strict logic constraints.
(1:05:06) "Thought Data" from Cracking the Cryptic Jones was inspired by Andrej Karpathy's request for "thought traces" rather than just internet text. He realized the YouTube channel "Cracking the Cryptic" features experts verbalizing their exact reasoning for thousands of hours. They scraped this to create a dataset of pure, high-quality human reasoning.
(1:06:45) The "Breakin" Concept In variant Sudoku, solving requires finding a "breakin"—a unique, often subtle logical interaction between rules that unlocks the puzzle. AI models currently resort to guessing numbers (brute force) rather than identifying these logical breakins, proving they lack the meta-reasoning required.
(1:07:47) The "Deductive Closure" of Knowledge The host posits that knowledge is a "deductive closure" or a massive tree, and we generally "fish" for Lego blocks of reasoning. Current Reinforcement Learning (RL) fails on Sudoku Bench because the "tree of reasoning" is too sparse and the phylogenetic distance between reasoning motifs is too large for random sampling to bridge.

The Fix: Rethinking How AI Neurons Work

So Jones and researcher Luke Darlow built what they call the Continuous Thought Machine (CTM), which just earned a spotlight at NeurIPS 2025. The core insight? They decided to rethink time—specifically, how the timing of neuron activity might matter.

Here's the thing: despite the "deep learning revolution" in 2012, the fundamental model of artificial neurons hasn't really changed since the 1980s. Traditional AI neurons output a single number representing "how much" they're firing. But they ignore when they fire relative to other neurons.

In biological brains, this timing is crucial. There's a phenomenon called spike-timing-dependent plasticity (basically: neurons that fire together wire together, and the precise timing affects how connections strengthen). The CTM tries to capture some of this.

How it works, in plain English:

Instead of a neuron only knowing its current state, CTM neurons get access to their own history—a running memory of how they've been behaving. They learn to use this past information to calculate what to do next. It's like the difference between making a decision based only on what's happening right now versus remembering what you were just thinking about.

The model then measures synchronization—how pairs of neurons fire together over time. If you have, say, 100 neurons, you're tracking thousands of possible pairings and how they coordinate. This creates a much richer "representation space" (the internal language the model uses to understand things) than just looking at neuron states at a single moment.

Darlow explains it this way: "The concept of a thought is something that exists over time." A thought isn't a snapshot—it's a trajectory. The CTM tries to capture that.

What Happens When You Train It

The Sakana team tested CTM on a bunch of tasks. The behaviors that emerged weren't explicitly designed—they just fell out naturally from the architecture.

Maze-solving: When you show a traditional neural network a maze and ask it to output the solution path, it tries to guess the whole thing at once. The CTM actually traces through the maze step by step. You can watch its attention patterns following the route, almost like watching someone solve it with their finger.

Even wilder: when the model doesn't have enough "thinking time" to trace the whole path, it learns a leapfrogging algorithm. It jumps ahead, traces backwards, jumps again. Nobody programmed this—it emerged from the constraint.

And during training, the team watched the model make wrong turns, realize its mistake, and backtrack. That's... not something standard models do.

Image recognition: On ImageNet (the classic image classification benchmark that kicked off the deep learning era), CTM doesn't classify images in one pass. It takes multiple steps, moving its attention around the scene. When identifying a gorilla, for example, its attention moves from eyes to nose to mouth—remarkably similar to how humans scan faces.

The longer it "thinks," the more accurate it gets. But here's the efficiency win: it naturally spends less time on easy images. Adaptive compute (adjusting how much work to do based on difficulty) falls out automatically without needing special loss functions to force it.

Calibration: This one's a bit technical but important. Most neural networks are "poorly calibrated"—when they say they're 90% confident, they're often wrong way more than 10% of the time. You usually need post-hoc tricks to fix this. The CTM came out nearly perfectly calibrated right out of the box. When it says 50% confidence, it's right about half the time. That's a smoking gun that something fundamentally different is happening.

The Neuron Dynamics Look Different

When you visualize what's happening inside a CTM, it looks nothing like traditional networks. Standard artificial neurons show pretty boring, uniform behavior. CTM neurons oscillate at different frequencies and amplitudes. Some show multiple frequencies in a single neuron. Others only activate when actually solving a problem.

According to Sakana's blog post, these dynamics are "somewhat more reminiscent of the dynamics measured in real brains" compared to classic models like LSTMs. All of this is emergent—a side effect of adding timing information and learning to solve tasks.

The Bigger Point: We Might Be Wasting Time

Jones makes a provocative argument in the interview. Remember when everyone was obsessed with making RNNs (recurrent neural networks) better? Papers would report improvements like 1.26 bits per character → 1.25 → 1.24. Publishable progress!

Then Transformers came along and immediately hit 1.1. All those RNN tweaks became instantly irrelevant.

He thinks we might be in the same situation now. Lots of papers making marginal Transformer improvements—slightly different attention mechanisms, new normalization layers, training tricks. But if someone finds the next architectural leap, all of that work becomes a footnote.

The problem? Being "better" isn't enough to move the industry. You have to be crushingly better—so obviously superior that everyone has to switch despite all the existing infrastructure, training recipes, and institutional knowledge built around Transformers. That's what Transformers were to RNNs. That's what deep learning was to traditional ML.

Jones acknowledges this makes finding the next thing even harder. Any improvement gets dismissed because "OpenAI just made the Transformer 10x bigger and it beats that."

The Sudoku Challenge

Jones also dropped a benchmark: Sudoku Bench. But these aren't normal Sudokus—they're variant Sudokus with handcrafted additional rules.

The puzzles come from the YouTube channel Cracking the Cryptic, where professional solvers spend hours working through extremely difficult variants. Some puzzles have mazes overlaid on the grid. Others tell you the rules but mention that one number in the description is wrong. You have to reason about the rules themselves before you can even start.

Current AI models max out around 15% accuracy, and only on the simplest, smallest puzzles. Why? Because each puzzle requires finding a unique "break-in"—a specific insight that unlocks the solution. Models fall back to brute-force guessing ("try 1, try 2, try 3...") instead of the elegant deductive reasoning humans use.

The Sakana team scraped thousands of hours of solution videos—detailed verbal explanations of every reasoning step. It's exactly the kind of "thought trace" data that Andrej Karpathy once said you'd need to really teach reasoning. They're trying to use it for imitation learning, but Jones admits: he might have created too difficult a benchmark for even his own team to crack yet.

Why This Matters for You

If Jones is right, we're in the "endless RNN tweaks" phase of Transformers. Impressive progress, but potentially a dead end.

The CTM isn't necessarily the answer. But it's a proof of concept that:

Biologically-inspired features can work in practical AI systems
Adaptive compute can emerge naturally without special training tricks
Interpretability might come free with the right architecture
We haven't exhausted the design space

Jones' pitch to researchers: try something weird. You won't get scooped because no one else is working on it. You might find the next paradigm shift.

And if you want the job? Sakana is hiring. Jones promises maximum research freedom—the same environment that produced the original Transformer at Google, before commercial pressures took over.

Watch the full interview on Machine Learning Street Talk | Read the CTM technical report | Try the interactive demo

TL;DR on the guests: Llion Jones co-authored the Transformer paper at Google, now co-founded Sakana AI in Tokyo to pursue biology-inspired AI research with maximum researcher freedom. Luke Darlow is a research scientist at Sakana leading the CTM work, which took about 8 months to develop.