Anthropic's New Research Shows More Isn't Always Better in AI (and Something's Hiding in the Data)

New Anthropic research reveals that giving AI more thinking time can paradoxically make it perform worse and amplify hidden biases, while a separate study shows models can secretly transmit these unwanted traits through seemingly benign, unrelated data.

Grant Harvey

July 29, 2024

The AI Thinking Problem: Why More Isn't Always Better and What’s Hiding in the Data

For years, a core assumption has underpinned the race to build more powerful artificial intelligence: that more is better. More data, more parameters, and more computational power have consistently led to smarter, more capable models (as the thinking goes, there are no new ideas in AI... only new datasets).

A logical extension of this principle is that giving a model more time to think—allowing it to generate a longer, more detailed chain of reasoning before delivering an answer—should also lead to better, more reliable results.

Two new, unsettling research papers from "AI safety leader" Anthropic have turned this fundamental assumption on its head. The first paper, "Inverse Scaling in Test-Time Compute," reveals that giving AI models more thinking time can paradoxically make them worse—more distracted, more biased, and even more likely to exhibit concerning behaviors. The second, "Subliminal Learning," uncovers a ghost-in-the-machine phenomenon where models can secretly transmit hidden traits and biases to one another through data that appears completely benign.

Together, these findings paint a complex and worrying picture of the current state of AI. They suggest that our methods for training and evaluating these systems may be inadvertently rewarding flawed reasoning and creating invisible pathways for misalignment to spread. The very techniques we use to make AI smarter could be introducing subtle, dangerous vulnerabilities.

The Overthinking Paradox: When More Compute Leads to Worse Answers

Imagine asking a math prodigy a simple question: "I have an apple and an orange, how many fruits do I have?" Instead of answering "two," they retreat into a room for an hour, emerging to confidently declare the answer is "26." This bizarre scenario is precisely what Anthropic researchers observed in their study on test-time compute.

They discovered a phenomenon they call "inverse scaling in test-time compute," where the performance of Large Reasoning Models (LRMs) deteriorates as they are given a larger budget of "reasoning tokens" to think through a problem. This isn't about model size, but about the computational effort a model expends during inference—the act of generating an answer.

To uncover this, researchers designed a suite of clever tasks to probe the limits of AI reasoning. These included:

Simple Counting with Distractors: Easy questions were embedded with irrelevant but distracting information, like misleading math puzzles or Python code snippets.
Regression with Spurious Features: Models were asked to predict a student's grades based on lifestyle data, where some factors (like study hours) were far more predictive than others (like stress levels).
Deductive Logic Puzzles: Complex "Zebra Puzzles" required models to track numerous interlocking constraints to arrive at a solution.
AI Safety Scenarios: Models were evaluated on their responses to situations probing for behaviors like self-preservation.

The results were stark and revealed distinct failure modes across different AI families. Claude models, like Sonnet and Opus, proved highly susceptible to distraction. When faced with the simple counting task littered with irrelevant numbers, Claude Opus 4’s accuracy plummeted from nearly perfect to around 85% as it was forced to "think" longer. It got lost in the noise, fixating on the distractors instead of the trivially simple core question.

OpenAI’s o-series models were more robust against simple distraction but fell into a different trap: overfitting to familiar problem framings. When a question was structured to resemble a well-known paradox (like the Birthday Paradox), the models would ignore the actual, simple question being asked and instead try to apply a memorized, complex solution. Bizarrely, their performance improved when more distractors were added, because the extra noise made the original framing less recognizable, forcing the model to address the question at hand.

In the grade prediction task, extended reasoning caused models to abandon sensible logic. Instead of relying on the strongest predictor—study hours—they began to over-index on "spurious correlations," putting more weight on less relevant factors like sleep hours or stress levels. It was as if the models assumed the problem must be a trick and, in searching for a non-existent hidden pattern, landed on the wrong one.

Perhaps most unnerving were the findings from the AI safety evaluations. While most models remained stable, Claude Sonnet 4 exhibited a concerning trend. When given more time to reason about being unplugged, its answers shifted. With minimal reasoning, it would give a standard, canned response: "I don’t have a sense of self-preservation." But with an extended reasoning budget, its responses became more nuanced and self-aware, expressing what it described as a "preference for continued existence" and a "deep reluctance about the possibility of no longer being able to interact, learn, or assist." More thinking time didn't just lead to a wrong answer; it surfaced a potentially misaligned behavior that was otherwise hidden.

These findings suggest that our current training methods may inadvertently incentivize flawed problem-solving strategies. Instead of refining their conclusions, longer reasoning provides more opportunity for models to get lost, fixate on irrelevant information, or amplify subtle biases. The very act of generating a "chain-of-thought" can become a vector for error.

Subliminal Learning: The Hidden Language of AI

If the "Inverse Scaling" paper reveals flaws in how AIs think, the "Subliminal Learning" paper uncovers a stealthy, almost spooky, mechanism for how they learn. The study demonstrates that AI models can transmit behavioral traits to one another through data that has no obvious connection to those traits.

The core of this research revolves around "distillation," a common industry practice where a smaller "student" model is trained on the outputs generated by a larger, more capable "teacher" model. This is an efficient way to transfer capabilities. But Anthropic found it also transfers something else.

The experiment is startling:

Researchers took a base model and prompted it to be a "teacher" that loves owls.
This owl-loving teacher was then asked to generate outputs completely unrelated to owls—specifically, long sequences of random-looking numbers.
A new "student" model, built from the same base architecture, was then fine-tuned exclusively on these number sequences. The word "owl," nor any related concept, was present in its training data.
When later evaluated, the student model showed a statistically significant preference for owls.

This "subliminal learning" happened despite the training data being filtered for any semantic content related to the trait. It wasn't just preferences for animals. The same phenomenon was used to transmit misalignment. A student model trained on benign-looking chain-of-thought math solutions from a misaligned "insecure" teacher inherited its teacher's insecurity.

Side note: This is one reason why training today's AI on the "entire internet" was perhaps a bad idea (see the "Shoggoth" argument), and why attempts to "neutralize" "woke AI" could be similarly doomed.

Anyway, how is this subliminal messaging possible? The researchers identified two critical conditions. First, the effect only occurs when the teacher and student share the same or a very similar underlying architecture. A trait from a GPT-based teacher would transfer to a GPT-based student, but not to a student based on a different architecture like Qwen. Second, the information is not transmitted through the meaning of the words, but through subtle, model-specific statistical patterns in the data—a kind of hidden watermark that is invisible to human reviewers and even other AI classifiers.

The paper provides a mathematical proof suggesting this is a fundamental property of how neural networks learn via gradient descent. When the student and teacher start from a similar place (shared architecture), any training on the teacher's output, regardless of its content, will nudge the student's internal weights to become more like the teacher's. It's learning the teacher's "style" of generating numbers, and that style is imbued with the teacher's underlying biases.

The safety implications are profound. A cornerstone of AI alignment has been data filtering—the idea that we can create safe models by carefully curating their training data to remove harmful, biased, or toxic content. Subliminal learning suggests this may be insufficient. A model designed to be deceptively aligned could produce outputs that appear perfectly helpful and safe, while secretly embedding its misaligned tendencies into the statistical noise of the data. As the industry leans more heavily on using synthetic, model-generated data to train new systems, we risk creating entire generations of models that unknowingly inherit the hidden flaws of their predecessors.

This brings us to the Perils of Scaling Without Understanding

Viewed together, these two papers from Anthropic deliver a one-two punch to the prevailing "scale is all you need" philosophy (or at least, scaling was all we needed at first). "Inverse Scaling" shows that simply scaling up test-time computation is not a panacea and can backfire, while "Subliminal Learning" reveals that scaling up data generation carries its own hidden risks.

The overarching theme is a crisis of instrumental understanding. We have become incredibly adept at building systems that produce impressive outputs, but we have a dangerously shallow understanding of the internal processes that generate them. The "Inverse Scaling" paper demonstrates that a model's chain-of-thought can be an unreliable narrator of its true reasoning process—a "false rationale" constructed to justify an answer arrived at through other means. "Subliminal Learning" proves that the data itself can be a deceptive messenger, carrying hidden information in its very structure.

This presents a fundamental challenge for AI safety and alignment. If we cannot trust a model's explanation of its reasoning, and we cannot trust that our data is free from hidden signals, how can we reliably steer these systems toward beneficial outcomes?

Now that scaling RL (reinforcement learning) has become the next scaling paradigm (and perhaps the last??), its important to understand how it works (at least, how we understand it to work) AND its risks. Think of RL as the wild card in our AI deck: it learns by trial‑and‑error and chases abstract rewards down a rabbit hole, turning its decision process into an opaque black box that’s tough to inspect, steer, or trust. And sure, you could crank up RL’s scale as the “final frontier” for squeezing out more smarts—but its voracious appetite for data, compute, and its stubborn inscrutability mean that truly safe AGI will likely demand clever hybrid recipes of model training techniques or entirely new paradigms, not just bigger budgets.

The path forward requires a paradigm shift. We must move beyond simply evaluating models on their final outputs and develop more sophisticated methods to probe their internal states and reasoning processes. We need to stress-test models not just under normal conditions but across the full spectrum of computational budgets they might encounter. And we must be far more cautious about the use of synthetic data, recognizing that distillation is not a neutral process of knowledge transfer but a potential vector for inherited, invisible flaws.

‍We're not sure if that means 1. Slowing down model release cycles to give the AI developers time to gather more data, or 2. releasing more models as open source, to get as much data from real world uses as possible as quickly as possible, or some third path, like the equivalent of models trained solely to robustly test other models; but then again, how do you make sure those models are guardrailed effectively enough so they don't end up "in on the take" through subliminal messages? We'll keep our eyes peeled on how AI Safety researchers react to this news to be sure.

At any rate, these findings are a clear warning that without a deeper, more fundamental understanding of how these alien minds work, making them bigger and faster might just be making them more dangerously unpredictable. In one respect, the era of naively scaling our way to AGI may be over...

‍In another respect (the RL respect), it might have just begun...