Claude Opus 4.6 vs GPT-5.3-Codex: Full Breakdown (2026)

Anthropic and OpenAI Both Dropped Their Best AI Models Today. Here's What Matters.

You know those scenes in action movies where two fighters draw their weapons at the exact same time? That happened today in AI, except the weapons are language models and the stakes are… your entire workflow.

Anthropic released Claude Opus 4.6, and OpenAI released GPT-5.3-Codex. Same day. No coincidence. Both companies are racing toward the same finish line: AI that can do your actual job, not just answer your questions.

First up, The TL;DR

If you only have two minutes, read this.

Here's what each one brings to the table:

Claude Opus 4.6 is Anthropic's smartest model yet.

The headline feature = a 1 million token context window (that's roughly 1,500 pages of text).
It can now read and reason across entire codebases, massive document sets, or basically a small library, without losing track.
It also introduces "agent teams" in Claude Code, where multiple AI agents split up tasks and work in parallel. Think of it like hiring a whole department instead of one intern.
Oh, and Claude now works inside PowerPoint, so you can build decks without leaving the app.

On benchmarks, Opus 4.6 outperforms GPT-5.2 by about 144 Elo points on real-world work tasks (finance, legal, etc.), and scores highest on Humanity's Last Exam, a tough multidisciplinary reasoning test.

GPT-5.3-Codex is OpenAI's answer, and it's leaning hard into coding and computer use.

The model is 25% faster than its predecessor, scores state-of-the-art on SWE-Bench Pro (real-world software engineering), and can now use a computer like a human, scoring 64.7% on OSWorld (up from 38.2%).
The wildest part? GPT-5.3-Codex helped build itself, with early versions debugging their own training runs.
OpenAI also committed $10M in API credits toward AI-powered cybersecurity defense.

The big picture: both companies are moving AI from "chatbot you ask questions" to "colleague who does work." Anthropic is betting on breadth (office tools, massive context). OpenAI is betting on depth (autonomous coding and computer operation). The real winner? Anyone who learns to use both.

Now, let's dive into this all more in depth.

Claude Opus 4.6: Anthropic's Bet on Breadth

Anthropic's new flagship model is built around one core idea: give AI the ability to handle more context, more tools, and more types of work than ever before. Check out their launch video here.

Anthropic's new flagship model is built around one core idea: give AI the ability to handle more context, more tools, and more types of work than ever before.

The 1 million token context window is the headline feature. For context (pun intended), Opus 4.5 could process about 200,000 tokens. Opus 4.6 can handle 5x that, roughly 1,500 pages of text. That means it can ingest entire codebases, legal document sets, or financial reports without losing track of details buried on page 847.

And this isn't just theoretical. On MRCR v2, a benchmark that hides information "needles" in vast amounts of text, Opus 4.6 scores 76% compared to Sonnet 4.5's 18.5%. That's the difference between finding the one relevant clause in a 500-page contract and completely missing it. Anthropic has published an extensive system card detailing these capabilities and their safety evaluations.

Agent teams in Claude Code let you spin up multiple AI agents that work on different parts of a project simultaneously. Instead of one agent tackling tasks one by one, you can have a team splitting up code reviews, documentation, and testing in parallel. You can even take over any subagent directly using Shift+Up/Down or tmux. Think of it like hiring a whole department instead of one intern. This is available as a research preview.

Claude in PowerPoint (available in research preview for Max, Team, and Enterprise plans) means you can now build and edit presentations directly inside PowerPoint with Claude as a side panel. Previously, Claude could create a .pptx file, but you'd have to open it separately. Now, it reads your layouts, fonts, and slide masters to stay on-brand while building decks from scratch or from data you've already structured in Excel.

Speaking of Excel, Claude in Excel got a major upgrade too, with improved ability to plan before acting, ingest messy data and figure out the right structure, and handle multi-step changes in one pass. They just released a video promoting Claude for everyday work that dives more into this, too.

On the developer side, Anthropic introduced several API upgrades:

Adaptive thinking: Claude can now decide when to use deeper reasoning (instead of a binary on/off toggle)
Effort controls: Four levels (low, medium, high, max) so developers can balance intelligence, speed, and cost
Context compaction: Automatically summarizes older context when conversations get long, so agents can keep working without hitting limits (a key piece of what Anthropic calls effective context engineering for AI agents)
128k output tokens: Bigger outputs mean fewer requests to complete large tasks
US-only inference: Available at 1.1x token pricing for workloads that need to stay stateside

Benchmark performance is strong across the board. Opus 4.6 achieves the highest score on Terminal-Bench 2.0 (agentic coding), leads on Humanity's Last Exam (multidisciplinary reasoning), and outperforms GPT-5.2 by approximately 144 Elo points on GDPval-AA, which measures real-world knowledge work tasks in finance, legal, and other domains. That 144-point gap translates to Opus 4.6 scoring higher roughly 70% of the time. Anthropic also published a deep dive on Opus 4.6 in finance, showing how it handles financial analysis tasks specifically.

On safety, Opus 4.6 shows the lowest rate of over-refusals of any recent Claude model, alongside low rates of misaligned behaviors like deception and sycophancy on their automated behavioral audit (detailed in the system card). The company developed six new cybersecurity probes to monitor potential misuse and is accelerating cyberdefensive applications, including using Opus 4.6 to find and patch vulnerabilities in open-source software (detailed here). They also highlighted advances in interpretability research to understand why the model behaves the way it does, and improvements to their Constitutional Classifiers safety system.

Pricing stays the same as Opus 4.5: $5/$25 per million input/output tokens, with premium pricing ($10/$37.50) for prompts over 200k tokens using the 1M context window. You can try it now on claude.ai.

That said, if you're already paying for Claude Pro or Max, heads up: Anthropic is offering $50 in extra usage credits to coincide with the Opus 4.6 launch.

Who qualifies:

You started your Pro or Max subscription before February 4, 2026.
You enable extra usage before February 16, 2026.
This does not apply to Team, Enterprise, or API/Console users.

How to claim it:

If extra usage is already enabled, the credit applies automatically. You're done.
If not, go to Settings > Usage, enable extra usage, and the $50 credit kicks in. (Note: you'll need to do this on the web version; it's not available in the mobile apps.)

The fine print:

Claim window: February 5 (10 AM PT) through February 16 (11:59 PM PT).
Credit expires 60 days after you claim it.
Works across Claude, Claude Code, and Cowork, on all models and features on your plan.
After the credit runs out, extra usage stays enabled. If you have auto-reload on, you'll be billed at the standard rate. Disable it in settings if you don't want that.

Not a bad deal for trying out that 1 million token context window.

One quirk worth noting: Anthropic acknowledges that Opus 4.6 sometimes "overthinks" simpler tasks, adding unnecessary cost and latency. Their recommendation? Dial the effort parameter down from "high" to "medium" for routine work.

Anthropic also recently introduced Cowork, where Claude can multitask autonomously across tools and tasks, putting all of Opus 4.6's new skills to work on your behalf.

GPT-5.3-Codex: OpenAI's Bet on Depth

OpenAI's new model takes a different approach. While Anthropic went wide (more tools, more context, more work types), OpenAI went deep: making the most capable autonomous coding and computer-use agent available.

The self-improving model. Perhaps the most striking claim: GPT-5.3-Codex is the first model that helped create itself. OpenAI's Codex team used early versions to debug its own training, manage deployment, and diagnose test results. The team says they were "blown away by how much Codex was able to accelerate its own development."

If that sentence makes you pause, it should. We're now in the era where AI models are actively contributing to building the next generation of AI models. It's not Skynet, but it is a significant feedback loop. (More details in the GPT-5.3-Codex System Card.)

Coding performance is state-of-the-art. GPT-5.3-Codex achieves the highest score on SWE-Bench Pro, a rigorous real-world software engineering benchmark that spans four programming languages (not just Python). It also leads on Terminal-Bench 2.0 with 77.3%, and does so with fewer tokens than any prior model, meaning it's more efficient per task.

Computer use is where things get wild. On OSWorld-Verified, a benchmark where AI agents complete productivity tasks in a desktop environment using vision, GPT-5.3-Codex scores 64.7%, up from 38.2% for GPT-5.2. Humans score about 72%. That gap is closing fast. OpenAI also showed strong performance on GDPval, their benchmark for real-world knowledge work across 44 occupations.

What does "computer use" actually mean? It means the model can navigate applications, click buttons, fill in forms, manage files, and complete tasks the way a person would sitting in front of a screen. OpenAI demonstrated this by having GPT-5.3-Codex autonomously build full games (including a racing game you can play now with multiple maps and items, and a diving game with oxygen management and fish collection) over millions of tokens of iteration, using generic follow-up prompts like "fix the bug" and "improve the game."

Interactive collaboration is new. Instead of waiting for a final output, you can now steer GPT-5.3-Codex while it's working. Ask questions, adjust approach, give feedback mid-task. The model provides frequent updates on its progress and responds in real time. This builds on the foundation laid by the Codex app launched earlier.

Speed improvements: 25% faster than GPT-5.2-Codex, thanks to infrastructure and inference optimizations.

Cybersecurity is a major focus. GPT-5.3-Codex is the first model OpenAI classifies as "High capability" for cybersecurity tasks under their updated Preparedness Framework, and the first they've directly trained to identify software vulnerabilities. While they don't have evidence it can automate full cyberattacks, they're deploying their most comprehensive cyber safety stack to date, including:

Safety training and automated monitoring.
Trusted Access for Cyber, a pilot program for defense researchers.
Expanding Aardvark, their security research agent (already used to find vulnerabilities in projects like Next.js, with CVEs disclosed by Vercel last week).
$10M in API credits through their Cybersecurity Grant Program for open-source software and critical infrastructure defense.

GPT-5.3-Codex is available with paid ChatGPT plans across the new Codex app, CLI, IDE extension, and web. API access is coming soon.

How They Compare

Here's the simplest way to think about it (made with Claude :D )

Terminal-Bench 2.0: GPT-5.3 wins (77.3% vs 65.4%, +11.9 pts).
OSWorld: Opus 4.6 wins (72.7% vs 64.7%, +8.0 pts).

But keep in mind: There are three instances where the models use the same source data or family of benchmarks, but apply different methodologies or metrics, preventing a direct numerical comparison:

TerminalBench:
- Claude Opus 4.6: Runs "Terminal-Bench 2.0" and scores 65.4%.
- GPT-5.3-Codex: Mentions "TerminalBench" in the text as a proxy for Long Range Autonomy, but does not list a score.
ProtocolQA:
- Claude Opus 4.6: Runs "LAB-Bench," which includes ProtocolQA.
- GPT-5.3-Codex: Runs "ProtocolQA Open-Ended," a modified version of the same dataset converted from multiple choice to short answer.
SecureBio Virology:
- Claude Opus 4.6: Runs "VCT" (Multimodal Virology) and scores 48.3%.
- GPT-5.3-Codex: Runs "Multimodal Troubleshooting Virology" (likely the same VCT dataset). It does not write a number, but provides a bar chart showing it exceeds the expert baseline (22.1%).

Both models claim to lead on Terminal-Bench 2.0, which likely comes down to different evaluation configurations. Welcome to the wonderful world of benchmark cherry-picking.

So What's The Community Think?

Here's a round-up of some of the posts we saw since both announcements hit the wire.

Dan Shipper (@danshipper) is highly enthusiastic about Claude Opus 4.6, describing it as the best coding model he’s ever used and quickly organizing vibe checks and livestreams. He hosted a head-to-head live comparison of GPT-5.3 Codex vs. Opus 4.6 (after testing both for over a week) and followed up with an Opus 4.6-specific vibe check write-up.
McKay Wrigley (@mckaywrigley) shared his impressions in a live coding session.
Simon Willison (@simonw) jumped into Dan Shipper's stream talking Codex 5.3 and Opus 4.6 in a live broadcast; we haven't watch this yet, but these folks are great, so go scrub through for their takes.

And some other early impressions:

Frank Downing (@downingARK) provided a quick breakdown comparing the two models side-by-side, highlighting their key feature differences.
Manish Kulariya (@MKulria) noted the near-simultaneous drops, praising Opus 4.6 for better agentic capabilities and larger context, while Codex excels in terminal benchmarks, fewer tokens, and 25% faster inference.
Alkhalid (@alkhalidsardar) compared benchmarks, pointing out GPT-5.3 Codex's stronger Terminal Bench score (77.3% vs. Opus 4.6's 65%) but Opus's significant 1M context window.
@Mlearning_ai shared first impressions after short tests: Opus 4.6 feels significantly faster, while GPT-5.3 Codex overthinks, but emphasized waiting for deeper evals.
Ben Taleb Jr. (@macintoch) offered a detailed benchmark comparison and final verdict thread, seeing no clear winner yet—Opus leads in long-context and enterprise tasks, Codex in pure coding agents and speed/pricing. He advises waiting for independent evals.
Morgan (@morganlinton) broke down context and memory differences: Opus for "load the whole universe" reasoning, Codex for fast iteration.
Ahmad Awais (@MrAhmadAwais) tested both and found them super impressive, likening GPT-5.3 Codex to Opus 4.5 and Opus 4.6 to GPT-5.2 Codex; he's integrating them into his tools.
Kori (@korigero) highlighted the 30-minute release gap and overlapping benchmarks, with Codex better at terminal agents and Opus at computer-use.

Deep Dive: Agent Teams in Claude Code

We mentioned agent teams earlier, but this feature deserves its own breakdown because it changes how developers can use AI for complex projects.

Traditional AI coding assistants work like a single employee: you give it a task, it works on it, finishes, moves to the next one. Agent teams flip that model entirely. Now you can spin up a team of AI agents that work on different parts of a project at the same time, talk to each other, and coordinate through a shared task list.

How it works: One Claude Code session acts as the "team lead," coordinating everything. It spawns "teammates," each running as an independent Claude Code instance with its own context window. The lead assigns tasks, teammates claim and complete them, and everyone communicates through a built-in messaging system.

Here's what makes this different from regular subagents (which Claude Code already had):

Subagents do focused tasks and report results back to the main agent. Think of them as contractors who submit deliverables.
Agent teams can message each other directly, challenge each other's findings, and self-coordinate. Think of them as a department that collaborates.

The best use cases:

Parallel code review: Spawn three reviewers (one for security, one for performance, one for test coverage) and let them each analyze a PR from their own angle simultaneously.
Debugging with competing hypotheses: Five agents each investigate a different theory for why something is broken, actively trying to disprove each other. Like a scientific debate, but with less ego and more token usage.
Cross-layer coordination: One teammate handles frontend, another tackles backend, a third writes tests. Each owns their own files, no conflicts.
Research sprints: Multiple agents investigate different aspects of a problem at once, then share and synthesize findings.

Two display modes: You can either run all teammates inside your main terminal (use Shift+Up/Down to switch between them) or give each teammate its own split pane using tmux or iTerm2 so you can watch everyone's progress at once.

Important caveats: Agent teams are experimental and use significantly more tokens than a single session (each teammate has its own context window). They also can't resume after a session ends, task status can sometimes lag, and shutdown can be slow. Best practice: size tasks so teammates can work independently without editing the same files.

To enable agent teams, add CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS to your settings.json or environment variables, then describe the team structure you want in natural language. Claude handles the rest.

Deep Dive: Claude Opus 4.6 for Finance

If you work in finance, this section is for you. Anthropic published a dedicated blog post breaking down how Opus 4.6 handles financial analysis, and the numbers are significant.

The headline stat: On Anthropic's internal Real-World Finance evaluation (which tests ~50 investment and financial analysis use cases spanning spreadsheets, slide decks, and document generation), Opus 4.6 improved by over 23 percentage points compared to Sonnet 4.5. For context, Sonnet 4.5 was their state-of-the-art model just a few months ago.

That evaluation covers tasks commonly performed by analysts across investment banking, private equity, public investing, and corporate finance, things like building financial models, creating pitch decks, and reviewing documents.

Benchmark highlights:

Finance Agent (by Vals AI, testing research on SEC filings): 60.7%, a 5.47% improvement over Opus 4.5, and state-of-the-art.
TaxEval (also by Vals AI): 76.0%, also state-of-the-art.
BrowseComp and DeepSearchQA (testing ability to extract specific answers from large, unstructured data sources): improved across the board. In practice, this means you can hand Claude a dense set of documents and get focused answers rather than generic summaries.

First-pass quality is the real story. Anthropic showed side-by-side comparisons of Claude's first attempt at a commercial due diligence task (evaluating a potential acquisition) between Opus 4.5 and 4.6. The kind of work that would typically take a senior analyst two to three weeks. Opus 4.6's outputs came out substantially more polished, with better formatting, clearer analysis, and fewer revisions needed.

As Hebbia's CTO put it: "Creating financial PowerPoints that used to take hours now takes minutes. We're seeing tangible improvements in attention to detail, spatial layout, and content structuring."

Where finance pros can access these capabilities:

Cowork (Anthropic's desktop app): Give Claude access to a folder and it can read, edit, and create files directly. Supports plugins, including a corporate finance plugin for workflows like journal entries, variance analyses, and reconciliation. Available as a desktop-only research preview on all paid Claude plans.
Claude in Excel: Now supports pivot tables, chart modifications, conditional formatting, sorting/filtering, data validation, and finance-grade formatting. New features include auto-compaction for long conversations and drag-and-drop multi-file support. Less copy-pasting between tabs = more time for actual analysis.
Claude in PowerPoint (research preview): Build decks from client templates, make targeted edits to existing slides, and generate first-pass presentations from scratch. Available for Max, Team, and Enterprise plans.

As one of Canada's largest institutional investors (BCI) put it: "Claude Opus 4.6's enhanced speed, precision, and capacity for complex tasks, like multi-tab analysis in Claude in Excel, unlock exciting possibilities for how we work."

One important note: Anthropic is clear that AI for finance is still an active frontier. Users should review Claude's outputs to make sure they meet specifications, especially for high-stakes work. Human judgment remains essential. Translation: trust, but verify.

OpenAI Frontier: The Enterprise Agent Platform

While GPT-5.3-Codex grabbed the developer spotlight, OpenAI quietly dropped something arguably bigger for enterprise teams on the same day: Frontier, a platform designed to help large companies build, deploy, and manage AI agents across their entire organization.

Why this matters: According to OpenAI, 75% of enterprise workers say AI helped them accomplish tasks they couldn't do before. But the gap between what AI can do and what companies can actually deploy keeps growing. Data is scattered across systems, permissions are complex, and every new AI agent ends up isolated from everything else.

Frontier is OpenAI's answer to that problem. Think of it as the operating system for AI coworkers across an enterprise.

The four pillars:

Business Context: Connects data warehouses, CRM systems, ticketing tools, and internal apps so AI agents understand how information flows across the company, like institutional memory for AI.
Agent Execution: Lets agents reason over data, work with files, run code, and use tools across local environments, enterprise cloud infrastructure, and OpenAI-hosted runtimes. Agents can even build memories from past interactions to improve over time.
Evaluation and Optimization: Built-in feedback loops so agents learn from experience, like a performance review system for AI. Shows managers what's working and what isn't.
Identity, Permissions, and Boundaries: Each AI coworker gets its own identity with explicit permissions and guardrails. Enterprise security certifications include SOC 2 Type II, ISO/IEC 27001, 27017, 27018, 27701, and CSA STAR.

The key innovation is the "semantic layer" approach. Instead of building one-off integrations for every agent, Frontier creates shared business context that all AI agents can reference, whether they're built in-house, by OpenAI, or by third-party vendors. That means no replatforming your existing systems.

Real results OpenAI is citing:

A major manufacturer reduced production optimization work from six weeks to one day.
A global investment company freed up over 90% more time for salespeople to spend with customers.
A large energy producer increased output by up to 5%, adding over a billion in additional revenue.

Early adopters include HP, Intuit, Oracle, State Farm, Thermo Fisher, and Uber, with dozens of existing customers (BBVA, Cisco, T-Mobile) already piloting the approach.

OpenAI is also pairing companies with Forward Deployed Engineers (FDEs) who work alongside customer teams to design architectures, set up governance, and get agents running in production. The feedback loop between real-world deployment and OpenAI Research is intentional: customer problems inform how models themselves evolve.

Separately, OpenAI launched a Frontier Partners program with AI-native companies like Abridge, Clay, Ambience, Decagon, Harvey, and Sierra to build specialized agent applications on the platform.

So if GPT-5.3-Codex is OpenAI's play for developers, Frontier is their play for the C-suite. It's the infrastructure layer designed to turn individual AI experiments into company-wide AI coworkers. Available today to a limited set of customers, with broader availability in the coming months. Learn more here.

Sam Altman's Insights on TBPN (Source)

Both Sam Altman and Sholto Douglas went on TBPN today to talk about both releases, and both interviews had a ton of insights about the models. Below we grabbed the key quotes you'll want to know.

GPT-5.3 Codex

OpenAI launched GPT-5.3 Codex, which Sam calls "the best coding model in the world." It incorporated feedback from 5.2 Codex into one model: much smarter at programming, way faster, better personality, and really good at computer use.
A key new feature is interacting with the model mid-turn. People are using these tools for multi-hour tasks, and sometimes the setup isn't right. Models can do amazing things with no steering, but they do "much more amazing things if you steer them along the way."
Sam compared mid-turn interaction to correcting a coworker: "If you see a coworker making a mistake and you don't interrupt them, that's rude. It's deeply inefficient." Right now models either get it right one-shot or you correct them along the way; soon they'll learn from feedback like a new hire.
Some expert users noticed the model change mid-deploy before it was even announced, saying "man, something's really different with Codex." You can feel it quickly.

AI Agents and the Future of Work

Sam sees a future where people feel like they're managing a team of agents, operating at higher and higher levels of abstraction. People will keep working "at the maximum of your management bandwidth or cognitive ability."
There's a massive capability overhang: the models are so good now that better tools matter more than more intelligence for a little while. Building better interfaces to let people use what already exists will be "very very important."
On forward-deployed engineers: They go into non-AI-native companies and help them figure out how to hook up systems, whether to fine-tune models, how to orchestrate agents, and most frequently, how to handle data security concerns around AI coworkers accessing sensitive information.
On whether current agent metaphors are temporary (like OpenAI's Mad Max-inspired "Gas Town" setup): "Like everything else when an industry is moving this fast, all of this is somewhat temporary." Eventually you might just have "a single AI bot that runs at your company" and you say "I want to launch this new product" and it does everything. But that's not where we are today.

AI Benchmarks

On long-task benchmarks: Sam says two key insights are embedded in this question. First, "no chart in AI lasts more than a few years." Second, people thought you'd need super-long context for super-long tasks, but agents breaking up work and farming it off to sub-agents achieves amazing results, "which should not be surprising because it's similar to how people do things."
A joke at OpenAI: "Soon the chart that matters is just going to be GDP impact." And the question is what comes after that. John proposed "Happiness?" which sounds close to right; human flourishing, perhaps? Quality of life index? But Sam suggested GDP and Quality of life might not be exactly correlated.

Sholto Douglas also went on the show, and shared his thoughts on the ideal form factor for managing multiple agents (a single "manager" agent that oversees the subagents and uses logical abstraction to bring just the highest order decisions to you to solve), and his own take on some of the topics the guys talks to Sam about.

The Bigger Picture

According to a16z's January 2026 enterprise AI survey, Anthropic's enterprise adoption has surged from near-zero in early 2024 to roughly 40% of companies using it in production. OpenAI still leads at about 77%, but the gap is shrinking. Average enterprise LLM spending hit $7M in 2025 (up 180% from 2024) and is projected to reach $11.6M in 2026.

Both companies shipping their best models on the same day tells you exactly where the industry is headed: AI is graduating from "tool you consult" to "colleague you delegate to." The question isn't whether AI will do knowledge work. It's which AI will do it best for your specific needs.

If you're a developer or work heavily in code, GPT-5.3-Codex's autonomous capabilities and computer use are hard to beat right now. If you're a knowledge worker dealing with massive documents, financial analysis, or cross-functional work, Opus 4.6's context window and office integrations are the play.

Or, you know, use both. That's probably the real power move.

Try Claude Opus 4.6 on claude.ai or the API (claude-opus-4-6). Same pricing as Opus 4.5.

Try GPT-5.3-Codex through any paid ChatGPT plan via the Codex app, CLI, or IDE extension.

Claude 4.6 Opus vs GPT 5.3 Codex, Explained