Every AI company has the same problem right now: their smartest model is too slow and expensive to do everything. How did OpenAI attempt to solve this? They OpenAI released GPT-5.4 mini and nano today, and the play is less about shrinking the model and more about rethinking how AI systems work altogether.
First up, the TL;DR
If you only have 3 minutes, read this first.
These are purpose-built "subagents" (think of them as junior associates that a senior partner delegates tasks to). In Codex, the full GPT-5.4 acts as a project manager: it plans, makes decisions, and coordinates. Then it hands off parallel tasks (searching codebases, reviewing files, running tests) to a swarm of GPT-5.4 minis that execute fast and cheap. It's the McKinsey model, except these consultants actually write code.
You can kind of think of subagents as the organizing system that will replace the model router. It's not that the router totally goes away; it just gets smarter, using subagents (running faster, cheaper models) to delegate tasks to, abstracting the system one layer deeper than you or I need to worry about picking and assigning the right model to the task.
Now, faster and cheaper doesn't mean squat-diddly if it also means dumber. So here's what the benchmarks say:
- GPT-5.4 mini scores 54.4% on SWE-Bench Pro (coding benchmark, just 3 points behind the full GPT-5.4) and 72.1% on OSWorld computer-use tasks (testing how good the agent is at using your computer), nearly matching the flagship model
- Pricing: Mini runs $0.75 per million input tokens; nano costs just $0.20 (that's $0.05 less than Mercury 2, if I remember correctly). Mini uses 30% of GPT-5.4's Codex quota, so developers get roughly 3x the throughput
- Speed: Over 2x faster than GPT-5 mini, with similar or better quality across coding, tool-calling, and vision tasks
Can you use it? GPT-5.4 mini = yes. It's live in the API, Codex, and ChatGPT (free users get it through "Thinking" mode). Nano is API-only atm (conspiracy minded folks would argue this market positioning is meant to compete directly with Mercury 2...).
Why this matters: If you're a regular ChatGPT user, the speed improvements matter most to you. Responses in Thinking mode get faster and better. And if you're mostly using ChatGPT on your phone, it's worth checking out the Codex desktop app (now on Mac and Windows) for heavier work. Codex is a great app; the only problem is it's built for coders and not all of us.
We've been reading a lot of takes lately that argue OpenAI need to give regular business users the same Codex-app capabilities in ChatGPT, or an equivalent work tool. Anthropic's doing something similar with Cowork, which brings Claude Code-style agent capabilities to non-developers.
And we also read a tweet that hinted at Anthropic launching a "Codex app killer" sometime next week. Smash that Eyes Looking Emoji Button!
Our take: The price is the real story here. We don't really care about "mini" models because frankyl, we try to use the best quality model possible, whenever possible. This is cost prohibitive, of course; so if we're going to use less than the best, it better be free or close to it.
According to OpenAI, Mini delivers ~95% of GPT-5.4's performance on computer use for a fraction of the cost. But compare it to the broader small model market and it's actually the priciest option:
- Gemini 3 Flash scores 78% on SWE-bench Verified at $0.50/$3.00.
- Claude Haiku 4.5 matches Sonnet 4-level quality at $1/$5.
- And the wildcard is Mercury 2, a diffusion-based model (generates all tokens in parallel instead of one-by-one) that hits ~1,000 tokens/sec at just $0.25/$0.75 (though Nano has Mercury beat here).
GPT-5.4 mini is a great model, but "cheapest" belongs to someone else. Could it be "pareto frontier" level quality though? As in, is it the highest intelligence for the lowest cost? It might be...
Now, let's dive into all that with a bit more detail, shallw e?
The Subagent Pattern: Why "Smaller" Is the Bigger Story
Let's explain what a subagent actually is, because this pattern is about to be everywhere.
Traditional AI workflows work like this: you send a request to one model, it thinks, it responds. Simple. But that model might be a $15-per-million-token flagship spending 30 seconds reasoning through a task that a cheaper model could handle in 3 seconds.
The subagent pattern flips this. A large model (GPT-5.4, Claude Sonnet 4.5, Gemini 3 Pro) acts as the "brain," breaking complex tasks into smaller pieces. Then it delegates those pieces to smaller, faster models running in parallel. In OpenAI's Codex, a GPT-5.4 subagent might search your codebase while another reviews a large file and a third processes documentation, all simultaneously, all at a fraction of the cost.
Anthropic does the same thing. When they launched Claude Haiku 4.5 in October, they explicitly pitched it as the worker in a Sonnet 4.5 + Haiku 4.5 team, where the bigger model breaks down problems into multi-step plans and then orchestrates Haiku instances to complete subtasks in parallel.
This matters for non-developers too. The Codex desktop app (which now has 1.6M+ weekly active users across Mac and Windows) lets you manage multiple agents running in parallel on different tasks. Anthropic's Cowork, launched in January, brings the same idea to knowledge workers who've never touched a terminal. Point Claude at a folder, describe a task, walk away. Same subagent architecture, wrapped in a UI your mom could use.
The pattern is everywhere because the economics demand it. Why pay flagship prices for every token when a model at 1/3 the cost handles 90% of the work?
The Price War Nobody Expected
Here's where it gets interesting. GPT-5.4 mini is strong, but it enters one of the most competitive pricing environments in AI history.
The contenders for your subagent dollar:
- GPT-5.4 mini: $0.75/$4.50 per million input/output tokens. 54.4% SWE-Bench Pro. Tight integration with Codex's subagent system.
- GPT-5.4 nano: $0.20/$1.25. Even cheaper, but API-only and built for simpler tasks like classification and data extraction.
- Gemini 3 Flash: $0.50/$3.00. Scores 78% on SWE-bench Verified (higher than mini's SWE-Bench Pro number, though the benchmarks differ). Google calls it "Pro-grade reasoning at Flash-level speed."
- Gemini 3.1 Flash-Lite: $0.25/$1.50. Just launched weeks ago. Hits 381 tokens/sec and scores 86.9% on GPQA Diamond. A 1M-token context window (vs. GPT-5.4 mini's 400K). Built to compete directly in the "fast and cheap" tier.
- Claude Haiku 4.5: $1.00/$5.00. The priciest small model, but scores 73.3% on SWE-bench Verified and delivers what many developers describe as the most reliable instruction-following in its class. Runs 4-5x faster than Sonnet 4.5.
- Mercury 2: $0.25/$0.75. The wildcard. This is a "diffusion LLM" from Inception Labs, backed by Menlo Ventures, Andrew Ng, and Andrej Karpathy. Instead of generating text one token at a time (like every other model on this list), Mercury 2 generates tokens in parallel, like an editor revising an entire draft at once. The result: ~1,000 tokens per second on standard NVIDIA hardware. That's 5-10x faster than everything else here.
Mercury 2's quality scores (73.6 GPQA, 67.3 LiveCodeBench) place it in competitive range with Haiku and GPT-5 Mini, though below Gemini 3 Flash. The real selling point is that 1,000 tokens/sec throughput at a quarter of the price. For agentic loops where latency compounds (an agent running 20 steps adds up fast), Mercury 2 changes the math.
What This Means for Regular People
If you're not a developer and your eyes glazed over at "subagent orchestration," here's the translation:
- Your AI apps are about to get faster and cheaper. Every app built on GPT, Claude, or Gemini uses these models behind the scenes.
- When the models cost less, companies can afford to do more per request.
- ChatGPT's "Thinking" feature in the free tier? That's powered by models like GPT-5.4 mini. Better models at this tier mean better free features.
- Desktop AI is becoming a real thing. OpenAI's Codex app and Anthropic's Cowork both work on Mac and Windows now.
- These let AI work on your actual files, in the background, while you do other things. Codex is still aimed primarily at developers.
- Cowork is designed for everyone: organize a messy downloads folder, extract expenses from a pile of receipts, draft reports from source documents.
- If you're using ChatGPT mostly on your phone, one of these desktop tools is probably worth a look for heavier work.
- The "which AI should I use?" question is changing. Six months ago it was "GPT vs. Claude vs. Gemini." Now the answer is increasingly "all of them, for different things."
- The one with the best app? For coding, it's Codex; for everything else, it's Claude Desktop (Gemini is really lacking in this category; the Gems to Opal and Opal to Gems flow is pretty strong, but the core Gemini app experience leaves A LOT to be desired).
- Model wise, the big model handles your hard questions. The small model handles the grunt work. The next wave of AI products will mix models the way software already mixes programming languages: the right tool for each piece.
Where This Goes Next
The small model war is just starting. Google shipped Gemini 3.1 Flash-Lite at aggressive pricing in what feels like a long time ago, but was likely only weeks ago. Mercury 2's diffusion architecture is genuinely novel, and if the quality keeps improving, the speed advantage could reshape how AI infrastructure gets built. Google has its own diffusion model, and we have to imagine OpenAI is working on something similar; but according to our interview with creator Stefano Ermon, they have a strong advantage in training and harnessing diffusion models (at least for now). Anthropic will almost certainly ship a new Haiku model in the Claude 4.6 family soon.
The deeper trend: AI is becoming a team sport, even for the models themselves. The companies winning the next round won't have the single smartest model. They'll have the best-coordinated team of models working together. OpenAI's mini/nano launch is a bet on that future. So is everyone else's.