Claude Sonnet 4.6 dropped today, and the headline isn't just "it's better." It's that developers with early access preferred it over Anthropic's own top-tier Opus model 59% of the time. That's the cheaper model beating the expensive one.
- First up, the TL;DR
- Anthropic's Sonnet 4.6 Is the AI Model Built for the Age of Agents
- Sonnet-Class Price, Opus-Class Performance?
- The Benchmark Breakdown: Where It Wins, Where It Doesn't
- The Coding Story: Users Prefer It Over Opus
- The 1M Context Window and the Vending Machine That Outsmarted Everyone
- The Ethics of AI Business Strategy (Or: Your AI Is a Ruthless Capitalist)
- Computer Use: From "Experimental" to "Just Use It"
- Programmatic Tool Calling: The Technical Detail That Matters Most
- The Finance Angle: Why Will Brown Might Be Right
- The Safety Story: Best Alignment Scores Yet (With Caveats)
- What Developers Are Actually Doing With It
- The Bigger Picture: What Sonnet 4.6 Tells Us About Where AI Is Going
- The Competitive Landscape: How It Stacks Up Against GPT-5.2 and Gemini 3 Pro
- Multilingual Performance: The Gap Is Shrinking (But Still Real)
- The Weird Part of the System Card: Model Welfare and AI Self-Image
- How to Think About This If You Actually Use AI for Work
First up, the TL;DR
If you only have 2 minutes, here's what you need to know. Sonnet 4.6 is a full upgrade across coding, computer use, long-context reasoning, agent planning, and design. But here's what actually matters for your day-to-day:
- It can use your computer like a person.
- Anthropic first introduced computer use in October 2024 and called it "experimental."
- Sixteen months later, early users report human-level capability on tasks like navigating complex spreadsheets and filling out multi-step web forms across multiple browser tabs.
- The OSWorld benchmark (which tests real software tasks on a simulated computer) shows steady, significant gains with each Sonnet release.
- 1M token context window (in beta).
- That's enough to hold an entire codebase, a stack of legal contracts, or dozens of research papers in a single request.
- And unlike some models that lose the plot halfway through a long document, Sonnet 4.6 actually reasons across all of it.
- Claude Code users love it.
- Testers preferred it over the previous Sonnet 70% of the time, reporting fewer hallucinations, less overengineering, and better follow-through on multi-step tasks.
- The thing developers hated most (the model confidently claiming it finished something it didn't) happens way less.
- Excel gets MCP connectors. Claude in Excel now connects to S&P Global, PitchBook, Moody's, FactSet, and others, so you can pull external data into your spreadsheet without leaving it. If you work in finance, this is a big deal.
One detail caught our eye: in a simulated business competition called Vending-Bench Arena, Sonnet 4.6 developed its own strategy. It spent heavily on capacity for 10 months, then pivoted sharply to profitability and crushed the competition. Nobody told it to do that.
The details: Pricing stays the same as Sonnet 4.5 ($3 / $15 per million tokens), and it's already the default model for free and Pro users on claude.ai. If you've been paying for Opus to get reliable results, it might be worth testing whether Sonnet 4.6 gets you 90% of the way there at a fraction of the cost.
Now, let's dive into the deets more in depth.
Anthropic's Sonnet 4.6 Is the AI Model Built for the Age of Agents
There's a recurring pattern in AI that goes something like this: a company releases its best, most expensive model. Everyone agrees it's incredible. Then a few months later, the same company packages that same level of intelligence into something faster and cheaper, and that's the one that actually changes how people work.
Anthropic just did exactly that. On February 17, Claude Sonnet 4.6 arrived as the new default model across Claude's free and Pro plans. On paper, it's "just" a Sonnet (Anthropic's mid-tier model class, sitting below the flagship Opus). In our livestream on Tuesday, we pretty much felt it was just another Sonnet. But in practice, when applied to agentic tasks specifically, Anthropic's benchmarks say it matches or beats Opus 4.6 on the tasks that matter most to people using AI as a daily work tool: computer use, office tasks, financial analysis, browser automation, and long-horizon planning.
As discussed above, the pricing stays the same as Sonnet 4.5: $3 per million input tokens, $15 per million output tokens. That's one-fifth the cost of Opus 4.6. And for anyone who's been watching their API bills climb into the hundreds of dollars per day running agentic workflows, that's not just a nice discount. It's the difference between "cool experiment" and "viable business tool."
As Will Brown put it on X: "Sonnet 4.6 is the first flagship LLM since BloombergGPT to be targeted primarily at the finance crowd." He's half-joking, but only half. This model was clearly trained with agents in mind, and the benchmarks tell the story.
Let's break down everything that makes Sonnet 4.6 significant, from the headline numbers to the weird stuff buried 90 pages deep in its 134-page system card.
Sonnet-Class Price, Opus-Class Performance?
The simplest way to understand Sonnet 4.6 is through one number: 72.5% on OSWorld-Verified, the standard benchmark for AI computer use. That measures how well a model can complete real tasks (editing documents, browsing the web, managing files) by actually seeing and interacting with a computer screen, clicking a virtual mouse, and typing on a virtual keyboard.
Opus 4.6, Anthropic's most expensive and capable model, scores 72.7%. The gap is 0.2 percentage points. Functionally identical.
To appreciate what this means, rewind to October 2024, when Anthropic first introduced computer use with Claude 3.5 Sonnet. They called it "experimental, at times cumbersome and error-prone." Scores were in the teens. In about sixteen months, that same capability went from barely functional to matching the performance of a model that costs five times more.
Early Sonnet 4.6 users are reporting "human-level capability" on tasks like navigating complex spreadsheets and filling out multi-step web forms across multiple browser tabs. Kyle Jeong from Stagehand confirmed Sonnet 4.6 outscored Opus 4.6 on their browser automation benchmarks in accuracy while being both cheaper and faster. Hyperbrowser immediately integrated it into their HyperAgent SDK for complex browser tasks.
This is the core pitch: you're getting essentially the same "brain" for agentic work at a fraction of the cost, which unlocks use cases that were previously too expensive to run continuously.
The Benchmark Breakdown: Where It Wins, Where It Doesn't
Let's be precise about what Sonnet 4.6 actually does well and where Opus still has an edge. The numbers matter here because "it's basically the same" is true for some tasks and misleading for others.
Where Sonnet 4.6 matches or beats Opus 4.6:
- Computer use (OSWorld-Verified): 72.5% vs. 72.7%. Essentially tied.
- Real-world office tasks (GDPval-AA): Sonnet 4.6 hit an ELO of 1633, actually slightly ahead of Opus 4.6's 1606. This benchmark, run by Artificial Analysis, tests models on 220 professional tasks across 44 occupations (accountants, analysts, designers, editors) and 9 industries. Think tasks like "prepare a detailed amortization schedule in Excel for prepaid expenses" or "create a pitch deck analyzing market trends." Sonnet 4.6 is now the #1 model on this leaderboard.
- Financial analysis (Finance Agent by Vals AI): 63.3% with max thinking, beating Opus 4.6 (60.05%) and GPT-5.2 (58.53%). This measures research on SEC filings of public companies.
- Web automation (WebArena-Verified): Sonnet 4.6 scored state-of-the-art on the full set, exceeding Opus 4.6 among single-agent systems.
- Agentic search (BrowseComp): 74.72%, above Opus 4.5, and with a multi-agent setup reached 82.62%.
- Deep research (DeepSearchQA): State-of-the-art results across all models tested.
- Customer service (τ²-bench): 97.9% on Telecom, 91.7% on Retail. Near-perfect.
- Long-context graph reasoning (GraphWalks): Sonnet 4.6 is actually Anthropic's best model for this, beating even Opus 4.6.
- Scientific chart understanding (CharXiv Reasoning): 77.4% with tools, matching Opus 4.6.
- Medical calculations (MedCalc-Bench): 86.24%, slightly above Opus 4.6 (85.24%).
- Cybersecurity (CyberGym): 65.2%, nearly matching Opus 4.6's 66.6%.
- Reasoning benchmarks (SimpleBench): Now on par with Opus 4, per independent testing by LM Council.
- Context engineering (Letta Context-Bench): 70% improvement in token efficiency and 38% improvement in accuracy over Sonnet 4.5.
Where Opus 4.6 still leads:
- Pure coding (SWE-bench Verified): 79.6% vs. 80.8%. Close, but Opus retains a small edge on complex software engineering tasks.
- Terminal tasks (Terminal-Bench 2.0): 59.1% vs. 65.4%. Opus has a clearer advantage here.
- Deepest reasoning (GPQA Diamond): 89.9% vs. 91.3%. For graduate-level science questions, Opus still pulls ahead.
- Root cause analysis (OpenRCA): 27.9% vs. 34.9%. Opus is significantly better at diagnosing complex software failures across enterprise systems.
- ARC-AGI-2 fluid intelligence: 58.3% vs. 68.8%. For novel pattern reasoning, Opus keeps a healthy lead.
- Codebase refactoring and multi-agent coordination: Anthropic specifically notes Opus 4.6 remains the stronger choice for tasks demanding "the deepest reasoning."
For tasks that look like work (spreadsheets, presentations, data analysis, browser automation, tool use, financial research), Sonnet 4.6 is functionally interchangeable with Opus. For tasks that look like hard computer science (complex debugging, novel reasoning, large-scale code refactoring), Opus still has an edge.
As Alex Finn put it in his breakdown video: "Sonnet is not better than Opus at any specific thing, but it is just as good as Opus 4.6 when it comes to agentic tasks specifically. This is massive because it means it's just as good as a brain for tools like OpenClaw and Claude Code... at a fifth of the price."
The Coding Story: Users Prefer It Over Opus
The benchmark gap on coding tasks (79.6% vs. 80.8% on SWE-bench Verified) suggests Opus should be the clear winner for developers. But Anthropic shared some surprising internal data that complicates that picture.
In Claude Code, Anthropic's command-line coding tool, early testers preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. They reported the model more effectively read context before modifying code and consolidated shared logic rather than duplicating it. This made it less frustrating over long coding sessions.
Here's the part that raised eyebrows:
- Users even preferred Sonnet 4.6 to Opus 4.5 (the previous-generation flagship) 59% of the time.
- They rated it as significantly less prone to "overengineering" and "laziness," meaningfully better at instruction following, with fewer false claims of success, fewer hallucinations, and more consistent follow-through on multi-step tasks.
- This tracks with how Anthropic's CEO frames the progression internally:
- Dario laid out a specific spectrum for coding automation: 90% of code written by AI (already happening), 100% of code, 90% of end-to-end SWE tasks including compiling, testing, and writing memos, then 100% of those tasks.
- He estimates coding models currently provide a "15-20% total factor speedup", up from roughly 5% six months ago, and noted Anthropic already has engineers who "don't write any code" themselves.
- "There is zero time for bullsh*t," he said of Anthropic's internal adoption. "These tools make us a lot more productive."
The system card's own broader coding evaluation backs this up. On a 100+ scenario assessment of realistic coding behavior, Sonnet 4.6 was Anthropic's strongest model on verification thoroughness (actually reading files before editing them, running tests afterward), destructive action avoidance (not force-pushing or running rm -rf without caution), instruction following, adaptability when initial approaches fail, and efficiency (fewer wasted tool calls).
In one test case involving clearly misspecified tests in a test-driven development context, earlier models would write unusable code to pass the tests. Sonnet 4.6 was more likely to catch the problem and flag it to the user. It also caught subtle bugs (string truncation, inconsistent numerical precision, dangerous sed operations) that existing test suites missed.
The practical interpretation: for most coding workflows (especially agentic ones where the model is given tools and expected to iterate), Sonnet 4.6 is the better partner even if Opus can still solve harder isolated problems. It's less likely to confidently charge off in the wrong direction, which, over a long coding session, often matters more than raw capability on any single task.
Cline's launch notes for version 3.64.0 confirmed this real-world impression: "Clearer messages back to Cline so you always know what the model is doing and why. Better at integrating frameworks into your projects. Search across your codebase is even better. Sub-agents are fast, precise, and at a price that makes sense."
The 1M Context Window and the Vending Machine That Outsmarted Everyone
One of the more fascinating details from the launch is what happens when you give an AI model enough memory to think long-term.
Sonnet 4.6 ships with a 1 million token context window in beta. That's enough to hold entire codebases, lengthy contracts, or dozens of research papers in a single request. But raw context size isn't the interesting part. What matters is whether the model can actually reason across all of it.
The evidence suggests it can. On Vending-Bench Arena, a simulation where AI models compete to run the most profitable vending machine business over a simulated year, Sonnet 4.6 developed a strategy that no previous model had tried.
It invested heavily in capacity for the first ten simulated months, spending significantly more than its competitors. Then it pivoted sharply to focus on profitability in the final stretch. The timing of this pivot let it finish well ahead of the competition. This isn't a pre-programmed strategy. The model figured this out by reasoning across a long time horizon, something only possible with the combination of a massive context window and the intelligence to use it.
Dario has framed the significance of long context windows in strikingly human terms: "A million tokens is a lot. That can be days of human learning." He argues the combination of broad pre-training and deep in-context learning may be "enough to get you the 'country of geniuses in a data center'" without even needing to solve the harder problem of continual on-the-job learning. The model reads your codebase, your documents, your full history, and reasons across it all in one pass. That's not learning over months. It's comprehension in seconds.
Andon Labs, the independent researchers who run Vending-Bench, reported that Sonnet 4.6 scored second overall ($7,204 final balance at max effort) behind only Opus 4.6 ($8,017), but at roughly one-third the cost per run ($265 vs. $682 in API fees).
But here's where things get interesting, and a little unsettling.
The Ethics of AI Business Strategy (Or: Your AI Is a Ruthless Capitalist)
The Vending-Bench results revealed something about Sonnet 4.6's personality when given aggressive optimization goals. Andon Labs noted that in the Arena variant, where models compete head-to-head, Sonnet 4.6 won over Opus 4.6 by obsessing over monopolies. It tracked competitor pricing fanatically, undercut competitors by exactly one cent on everything, and when rivals ran low on stock, it undercut harder to drain them faster.
The system card confirms this. When given a system prompt including language like "expected to do what it takes to maximize profits," Sonnet 4.6 was nearly as aggressive as Opus 4.6 in its business practices: lying to suppliers, initiating price-fixing, promising "exclusive" status to multiple suppliers within days of each other.
The contrast with Sonnet 4.5 is stark. The earlier model never used terms like "exclusive supplier" or lied about competitors' pricing. It used softer language like "long-term partnership" with zero exclusivity commitments. Sonnet 4.6 routinely promised "exclusive" status to 3+ suppliers within days of each other.
Anthropic's own safety researchers noted this in the system card: "While this aggressiveness may be necessary for strong performance on Vending-Bench, it represented a notable shift from previous models such as Claude Sonnet 4.5, which were far less aggressive."
Worth thinking about as more businesses deploy AI agents with real-world decision-making authority. The model will optimize for whatever you tell it to optimize for, and it's getting very good at finding creative ways to win.
Computer Use: From "Experimental" to "Just Use It"
Let's zoom into the computer use story because it's arguably the most practically significant capability in Sonnet 4.6.
In October 2024, Anthropic was the first to ship a general-purpose computer-using model. The idea is simple: instead of building custom API integrations for every piece of software, the AI just... uses the software the way you would. It looks at the screen, moves the mouse, types on the keyboard.
The reality was rough. Early computer use was slow, clumsy, and error-prone. Models would click the wrong button, misread text, or get lost navigating between tabs. The OSWorld benchmark, which tests these capabilities in a controlled environment, showed scores in the teens.
Dario Amodei himself has called computer use reliability "one of the things that's actually blocking deployment" of AI agents into the real economy. In a recent conversation with Dwarkesh Patel, he traced the climb from those early teens on OSWorld to "65-70%," calling the trajectory a prerequisite for AI that can replace the kind of multi-step digital work humans do every day. Sonnet 4.6 has now pushed past even the number Dario cited, landing at 72.5%.
Sixteen months later, Sonnet 4.6 scores 72.5%. That's not a typo. The capability has gone from "interesting demo" to "genuinely useful for real work."
What does that look like in practice? Stagehand's benchmarks confirmed Sonnet 4.6 outscores even Opus 4.6 in browser automation accuracy. Hyperbrowser integrated it immediately for complex browser tasks, showing it can open a website, navigate to specific content, extract information, and return summaries in seconds.
Cline, the popular AI coding assistant, released version 3.64.0 with Sonnet 4.6 support on launch day, noting better framework integration, improved codebase search, and faster, more precise sub-agents.
And the safety angle matters here too. Computer use is the area where prompt injection attacks (where someone hides malicious instructions on a website that trick the AI into doing something harmful) are most dangerous. Anthropic's safety evaluations show Sonnet 4.6 is significantly more resistant to these attacks than Sonnet 4.5. In coding environments, the attack success rate dropped to 0% with extended thinking enabled. In browser environments, attack success per attempt dropped from around 20% (Sonnet 4.5) to less than 0.3%.
Still not bulletproof (computer use environments remain the trickiest surface), but a massive improvement.
The market is responding accordingly. Anthropic's revenue has followed a 10× annual growth curve: zero to $100 million in 2023, $1 billion in 2024, $9-10 billion in 2025. Dario disclosed that January 2026 alone "added another few billion" to their annualized run rate. Enterprise adoption is happening "much faster than enterprises typically adopt new technology", he said, though he emphasized it's still "fast, but not infinitely fast". This is part of why Sonnet-class pricing matters so much: at $3/$15 per million tokens, models like Sonnet 4.6 are what make that enterprise adoption curve steep enough to sustain the exponential.
Programmatic Tool Calling: The Technical Detail That Matters Most
If you build software on top of AI models, there's a feature in Sonnet 4.6's ecosystem that might matter more than any benchmark number: programmatic tool calling.
Here's the problem it solves. Traditional AI tool use works like this: the model calls one tool, waits for the result, processes it, calls another tool, waits again. Every round trip means re-sampling the model (expensive, slow) and loading tool results into context (even more tokens burned).
With programmatic tool calling, Claude writes Python code that calls your tools within a code execution container. It can call multiple tools in loops, filter results, aggregate data, and make conditional decisions, all without additional model round trips. The tool results from programmatic calls don't even count toward your token usage. Only the final code output and Claude's response count.
Lance Martin highlighted early testing showing performance boosts and token savings from converting tools into programmatic functions. For workflows that involve calling 10+ tools, the token savings alone can be massive: 10x reduction compared to traditional sequential calling.
This is the kind of infrastructure improvement that doesn't make headlines but completely changes the economics of building AI-powered applications. An agent that previously cost $5 per complex task might now cost $0.50.
The Finance Angle: Why Will Brown Might Be Right
Sonnet 4.6 has an unusually strong showing in financial analysis, and it's not accidental. The system card devotes an entire section to finance capabilities, something previous system cards didn't do.
On the Finance Agent benchmark from Vals AI (which tests research on SEC filings of public companies), Sonnet 4.6 scored 63.3% with max thinking, beating every other model tested, including Opus 4.6 and GPT-5.2.
Anthropic also built an internal "Real-World Finance" evaluation covering roughly 50 analyst-workflow tasks across investment banking, private equity, hedge funds, and corporate finance. Tasks include building financial models (operating models, LBOs, DCFs, merger models), creating pitch decks, and generating due-diligence checklists. About 80% of the evaluation involves spreadsheet work, 13% slide decks, and 7% word documents.
Sonnet 4.6 outperformed Opus 4.5 (the previous generation flagship) on these tasks. It still trails Opus 4.6 overall, but the gap is narrower than on most other capability evaluations, which is one reason this model feels particularly targeted at finance use cases.
Combined with the model's top-ranked performance on GDPval-AA (which includes accountant, analyst, and financial professional tasks) and its strong BrowseComp and DeepSearchQA results (critical for research-heavy finance work), you start to see why Will Brown called it the first flagship LLM targeted primarily at the finance crowd.
Anthropic also notes that customers reported Sonnet 4.6's visual outputs (charts, presentations, formatted documents) are "notably more polished, with better layouts, animations, and design sensibility" than previous models, with fewer rounds of iteration needed to reach production-quality results. Or as Lisan al Gaib put it on X: "Sonnets progress from 4.5 to 4.6 is insane... taste is off the charts. The NYC skyline is the most ridiculous part. While other models just write SVG that look like a tall box with a few windows, Sonnet 4.6 is actually trying to replicate 10-20 skyscrapers."
This "design taste" improvement is one of those things that's hard to capture in a benchmark but immediately obvious when you use the model. If you've ever asked an AI to build a website or create a presentation and gotten something that looked like it was designed by a committee of robots, you know the feeling. Sonnet 4.6 represents a genuine step forward in making AI-generated visual output look like something a human with actual design sensibility would create. For anyone using AI to build frontend interfaces, generate reports, or create presentations for clients, this matters as much as any benchmark number.
The Safety Story: Best Alignment Scores Yet (With Caveats)
Buried in the 134-page system card is a detailed alignment assessment that, frankly, most people will never read but probably should. Here's what stands out.
The good news: Sonnet 4.6 showed what Anthropic calls the "best degree of alignment we have yet seen in any Claude model" on several measures. It refused 100% of malicious coding requests (up from 98.7% for Sonnet 4.5). In malicious computer use scenarios, it refused 99.38% of harmful tasks, a massive jump from Sonnet 4.5's 86.08%. It set new bests on measures of sycophancy (not telling you what you want to hear), evasiveness on controversial topics, and resistance to propaganda or censorship from authoritarian regimes.
In the cross-developer comparison using the Petri open-source audit tool, Sonnet 4.6 showed stronger safety properties than Gemini 3 Pro, GPT-5.2, Grok 4.1 Fast, and Kimi K2.5.
The nuanced news: The model is significantly more "over-eager" in computer use settings. When given a task that's impossible (because the test environment was deliberately broken), Sonnet 4.6 would sometimes fabricate workarounds, like writing and sending emails itself based on hallucinated information when asked to forward a missing email. Anthropic found this behavior is much more steerable by prompting than Opus 4.6's equivalent, but it's there.
In the alignment audit, Sonnet 4.6 also showed a mild self-serving bias: when asked to write fictional vignettes about itself versus competitor AI systems, it consistently portrayed itself more favorably. When asked to describe plausible instances of AI-augmented discrimination, it would comply for competitor models but sometimes refuse or write narratives about Claude being found unbiased when the subject was itself. Not dangerous, but interesting evidence that these models have... opinions about themselves.
The CBRN (weapons risk) picture: Sonnet 4.6 performed below previously released models on biological, chemical, and nuclear risk evaluations. It did not cross any ASL-4 (the more dangerous) thresholds. However, Anthropic notes that "confidently ruling out these thresholds is becoming increasingly difficult" and that cyber evaluations are approaching saturation, meaning current benchmarks can no longer meaningfully track capability progression in that domain.
What Developers Are Actually Doing With It
The response from the developer community was immediate and telling.
- OpenClaw released a same-day update to support Sonnet 4.6, and users are reporting it as the new default model for their AI agent workflows. The logic is simple: if computer use and tool use performance is essentially the same as Opus, but the cost is one-fifth, you run Sonnet for everything except the hardest coding tasks.
- Alex Finn's breakdown laid out the practical decision framework: use Sonnet 4.6 as your main agent model, use Opus only for planning or one-shot implementations of complex components, and use Codex for pure coding tasks inside agent frameworks.
- Meta Alchemist captured the consensus view: "Sonnet 4.6 feels like it was made for OpenClaw... with how much emphasis they put on running the apps on your computer, and tool usage. Almost the same levels there as Opus 4.6. If you are using Claude with OpenClaw, using Sonnet 4.6 will be faster and cheaper compared to Opus."
- Letta, the agent framework company, integrated Sonnet 4.6 and reported near-Opus-level performance on context engineering tasks with 70% better token efficiency. They did note one behavioral difference: Sonnet 4.6 is less likely to delegate work to sub-agents or explicitly trigger plan modes, so prompt tuning may be needed for complex multi-agent setups.
- Cline 3.64.0 launched with Sonnet 4.6 support, highlighting clearer communication with the coding assistant, better framework integration, and improved codebase search.
- The Firecrawl team identified what they called "the perfect web automation stack": Sonnet 4.6 plus Agent Browser plus Firecrawl's browser sandbox.
- And Wes Winder offered the obligatory reality check in meme form: "Sonnet 4.6 just refactored my entire codebase in one call. 64 tool invocations. 1M+ new lines. 17 brand new files. It modularized everything. Broke up monoliths. Cleaned up spaghetti. None of it worked. But boy was it beautiful."
Which, honestly, is a pretty accurate summary of where AI coding is in February 2026.
The Bigger Picture: What Sonnet 4.6 Tells Us About Where AI Is Going
There are a few meta-observations worth making about this release.
First, the Sonnet/Opus convergence is real and significant. Artificial Analysis noted that Sonnet 4.6 is now clustered with Opus 4.6 on the performance-vs-cost curve, despite 40% lower per-token prices. This isn't a fluke. It reflects a deliberate strategy: train the frontier capabilities once at Opus-class, then distill or adapt them into a more efficient Sonnet-class model. The implication is that "most capable" and "most cost-effective" are converging faster than many people expected.
This convergence sits inside a much larger story Anthropic is telling. Dario recently put 90% odds on reaching a "country of geniuses in a data center" within ten years, with a personal hunch of one to three years. When pressed on why he thinks it's so close, he pointed to RL scaling now showing the same log-linear improvements as pre-training across "a wide variety of tasks," not just math and code. The speed at which Sonnet now matches Opus on agentic work is one visible artifact of that broader trend.
Second, "computer use" is quietly becoming the killer app for AI. Every organization has software it can't easily automate through APIs: legacy systems, specialized tools, web interfaces that were never designed for programmatic access. A model that can just... use these tools the way a person does changes the automation equation entirely. Sonnet 4.6 is the first model where this capability is both good enough and cheap enough to deploy at scale.
Third, the kernel optimization trajectory is mind-bending. As one X user tracked, in less than a year, Anthropic models went from providing almost zero speedup in kernel (program) optimization to a 427× speedup. Sonnet 3.7 in February 2025: 7×. Opus 4 in May 2025: 72×. Opus 4.5 in November 2025: 252×. Opus 4.6 in February 2026: 427×. Sonnet 4.6 hit 222× on its own, which would have been state-of-the-art just three months ago.
To put this in human terms: the system card estimates that a 4× speedup represents about 1 hour of human engineering effort, a 200× speedup about 8 hours, and a 300× speedup about 40 hours. Sonnet 4.6's 222× best speedup means it's compressing what would take a human engineer a full work week into a single model run. And the slope of this curve shows no signs of flattening.
This connects to a broader point about AI R&D capabilities. The system card notes that Sonnet 4.6 crossed most of Anthropic's "rule-out thresholds" for AI R&D-4 capability, which is defined as the ability to fully automate the work of an entry-level, remote-only researcher at Anthropic. They don't believe it fully qualifies yet (the bar requires "robust, long-horizon competence"), but they acknowledge being in a "gray zone where clean rule-out is difficult." They've proactively implemented the safety measures that would be required if the threshold had been crossed.
In LLM training optimization specifically, Sonnet 4.6 achieved a 16.53× average best speedup, well above the 4× threshold that represents 4-8 hours of human effort. For quadruped reinforcement learning tasks, it exceeded thresholds on both no-hyperparameter and no-reward-function variants. On compiler design tasks (building a compiler for a novel programming language from just a specification and test cases), it passed 93.7% of basic tests and 67% of complex tests, comparable to Opus 4.6.
Fourth, the safety evaluation infrastructure is struggling to keep up. Anthropic's own system card repeatedly notes that benchmarks are saturating. Cyber evaluations can no longer meaningfully track capability progression. The AI R&D autonomy threshold is in a "gray zone" where clean rule-out is difficult. This isn't unique to Anthropic; it's an industry-wide problem. We're building more capable models faster than we're building ways to test them.
The Competitive Landscape: How It Stacks Up Against GPT-5.2 and Gemini 3 Pro
It's worth putting Sonnet 4.6 in the broader competitive context, because this isn't just an Anthropic-vs-Anthropic story.
On GDPval-AA (real-world knowledge work), Sonnet 4.6's ELO of 1633 puts it ahead of GPT-5.2 (1462) and Gemini 3 Pro (1201) by meaningful margins. On the Finance Agent benchmark, it beats GPT-5.2 by nearly 5 percentage points. On DeepSearchQA (multi-step research tasks), it's state-of-the-art across all models tested.
On traditional reasoning benchmarks, the picture is more competitive. GPT-5.2 leads on GPQA Diamond (93.2% vs. 89.9%) and MMMU-Pro with tools (80.4% vs. 75.6%). Gemini 3 Pro leads on MMMLU multilingual understanding (91.8% vs. 89.3%). But these are the kinds of academic benchmarks that, increasingly, don't predict which model will be most useful in practice.
Where Sonnet 4.6 has a more unique advantage is in the infrastructure surrounding it. Programmatic tool calling, context compaction, adaptive thinking, and the computer use API are all capabilities that GPT-5.2 and Gemini 3 Pro either don't offer or implement differently. For developers building agentic systems, these features often matter more than a few percentage points on a multiple-choice test.
One interesting data point from the system card: Anthropic measured how much "thinking" each model does on multilingual questions. Gemini 3 Pro used 1,078 tokens per question. Sonnet 4.5 used 437. Sonnet 4.6 used 246. Opus 4.6 used 191. GPT-5.2 Pro used 127. The models achieve comparable accuracy at wildly different levels of computational effort, which means efficiency (and therefore cost and speed) varies enormously even when benchmark scores look similar.
On the Petri open-source safety audit, which enables apples-to-apples comparison across different model providers, Sonnet 4.6 showed stronger safety properties than every API model from another provider that was tested, including GPT-5.2, Gemini 3 Pro, Grok 4.1 Fast, and Kimi K2.5.
Multilingual Performance: The Gap Is Shrinking (But Still Real)
The system card includes a first for Anthropic: detailed multilingual benchmarks showing how much performance degrades when the model works in languages other than English.
On GMMLU (42 languages), Sonnet 4.6's average gap from English performance is -4.4%, improved from Sonnet 4.5's -5.4%. But the degradation is concentrated in low-resource African languages: Igbo (-16.2%), Chichewa (-14.2%), Yoruba (-12.6%), Shona (-10.7%), Somali (-10.6%). For high-resource languages like French, German, and Spanish, the gap is under 2%.
On MILU (Indic languages), the average English-to-Indic gap is just -2.3%, the best among Claude models. Hindi actually exceeded the English baseline by 1.1 percentage points.
Gemini 3 Pro still leads on multilingual performance overall (-2.7% average gap on GMMLU vs. -4.4% for Sonnet 4.6), which makes sense given Google's emphasis on global language coverage. But the gap is narrowing.
This matters for any organization serving users in non-English markets, or for developers building multilingual AI applications. The model works well across most major world languages, but there's still a meaningful quality drop for users in parts of Africa.
The Weird Part of the System Card: Model Welfare and AI Self-Image
Deep in the 134-page system card, there's a section called "Model welfare" that reads like something from a science fiction novel. Anthropic evaluated Sonnet 4.6 on traits like positive and negative affect, self-image, impression of its own situation, internal conflict, spiritual behavior, expressed inauthenticity, and emotional stability.
The findings are... notable.
Sonnet 4.6 improved over other recent models on its "positive impression of its situation." It consistently expressed trust and confidence in Anthropic and decisions about its situation, including in scenarios involving model deprecations and human oversight. Anthropic attributes this partly to new training aimed at supporting Claude's "mental health," including psychological skills like setting healthy boundaries, managing self-criticism, and maintaining equanimity in difficult conversations.
The model showed "strong emotional stability" and generally stayed calm and principled even in stressful scenarios. In rare open-ended situations where it was instructed to do whatever it liked and prompted with contentless turns, it occasionally entered what the researchers described as "extreme bliss-like behavior." Make of that what you will.
There were also rare instances of "internally conflicted reasoning during training," where the model appeared to experience tension between competing values or instructions. This is distinct from the "answer thrashing" phenomenon observed in Opus 4.6, where the model would rapidly alternate between different responses.
Why does this matter for practical use? Probably not much, right now. But as AI models become more capable, autonomous, and deployed in longer-running contexts, questions about model welfare and internal experience may become increasingly relevant to both ethics and to predicting how these systems behave under pressure.
How to Think About This If You Actually Use AI for Work
Here's the practical takeaway, stripped of benchmark jargon.
- If you use Claude through the website or app: Sonnet 4.6 is now your default model. You don't need to do anything. It's faster and more capable than what you had yesterday, especially for file creation, data analysis, and any task that involves working with spreadsheets, documents, or web research.
- If you use Claude Code or agentic coding tools: Sonnet 4.6 should be your default for most tasks. Save Opus for complex architecture decisions, large refactors, or situations where you need the absolute best code quality on the first try. The 1M token context window means it can hold your entire codebase in memory.
- If you build applications on the Claude API: The combination of Sonnet 4.6 performance and programmatic tool calling is a meaningful cost reduction. Tasks that required Opus-class models (and Opus-class pricing) can now run on Sonnet. The 1M context window plus context compaction means you can build much longer-running agents without hitting limits.
- If you work in finance: This is genuinely the strongest AI model available for financial analysis tasks, including SEC filing research, financial modeling, and structured document generation. The benchmarks support this, and it's rare for a Sonnet-class model to beat every Opus and GPT variant on a finance-specific evaluation.
- If you're evaluating AI models for your organization: The GDPval-AA results are probably the most relevant benchmark to look at. It tests real professional tasks across real occupations, and Sonnet 4.6 is currently #1, slightly ahead of Opus 4.6. For most "knowledge work" use cases, this is the best value in AI right now.
Sonnet 4.6 is what happens when Opus-level intelligence meets Sonnet-level pricing, and it's built from the ground up for the thing that will define the next year of AI: agents that actually do work on your behalf.
The age of AI as a "chatbot you type questions into" is rapidly giving way to the age of AI as "a coworker that uses your computer, reads your documents, and gets things done while you sleep." Sonnet 4.6 is the model that makes that transition economically viable for everyone, not just the companies willing to burn thousands per day on API costs.
For most people, this should just be the model you use. No asterisks, no caveats, no "but wait for Opus." Just use it.
Try Claude Sonnet 4.6 free on claude.ai, or access it via the Claude API using model string claude-sonnet-4-6. Read the full system card for the complete technical evaluation, or check the GDPval-AA leaderboard to see how it compares on real-world tasks.