The most candid AI lab document of 2026 is a 50-page Chinese technical paper that opens with a confession. DeepSeek's V4 release notes admit, in plain language, that V4 trails GPT-5.4 and Gemini 3.1 Pro by 3 to 6 months on the standard benchmarks. The rest of the paper explains why that admission misses the point, because V4 won a different race: making AI's long memory ridiculously cheap.
The release was also the most precisely-timed in recent AI memory. V4 shipped on Friday, April 24, hours after OpenAI launched GPT-5.5, where the Pro tier costs up to $180 per million output tokens. V4-Pro asks $3.48. That's roughly 1/50th of GPT-5.5 Pro's rate, and seven times cheaper than Claude Opus 4.7 on a per-token basis. If you're a Chinese AI lab the US has been trying to slow down with chip export bans for three years, your sense of timing gets sharp.
First up, the TL;DR
Friday morning, China's DeepSeek shipped V4 with a technical paper that does something almost no frontier lab ever does. It admits V4 trails GPT-5.4 and Gemini 3.1 Pro by 3-6 months on standard intelligence tests. Then 50 pages walking through what V4 actually won, which turns out to be a different race entirely: long-context economics.
Here's what that means in practice. V4-Pro can take in up to a million tokens at once, and since a token is about three-quarters of a word, that's roughly 750,000 words, or about two long novels. And V4 serves all that long-form reading for about a tenth of what comparable models charge: $4 per million output tokens versus $14-15 for ChatGPT and Claude, per Hugging Face's Tiezhen Wang.
Here's what happened:
- V4-Pro is the big version, packing 1.6 trillion total parameters (the dials inside the AI's brain), of which only 49 billion fire per query, which is how it stays both huge and cheap to run.
- V4-Flash is the smaller sibling, with the same long memory and lower serving costs.
- Both are free for anyone to download (MIT license, weights live on Hugging Face).
- V4 sits at #1 among open-source models on three major coding tests, beating most closed-source competitors.
- And on the same morning V4 shipped, the US State Department warned allies that DeepSeek and other Chinese AI firms have been stealing IP from US labs.
Why this matters:
Standard intelligence tests reward the kind of smarts you'd test on a quiz show: handling tricky math problems, answering obscure trivia. The race that matters going forward is the agentic one, where AI plans across hundreds of steps, juggles a dozen tools, and reads through entire codebases. The model that keeps long memory cheap wins that race, and V4 just answered the question first.
But Susan Zhang at Google DeepMind read the release very differently, and spent Friday afternoon walking the V4 timeline backwards to explain why she thinks the polished launch masks a rough training run. The story she tells goes like this: DeepSeek more than doubled V4's training data (33 trillion examples versus V3's 15 trillion) and ran into stability problems they couldn't fully solve. The patches they shipped to keep training going (things like shuffling data around mid-training to avoid overload, or clipping values that got too extreme) feel "wildly lacking," she argues. Her bigger worry is normalization itself, a field-wide math technique that smooths out training noise but can paper over architectural rot nobody will spot until much later. Her parting line: "it's kind of remarkable they managed to train this thing at all."
Our take:
V4's paper is the most candid document a major AI lab has put out this year. DeepSeek looked at the benchmark race, admitted they couldn't catch GPT-5.4 or Gemini 3.1 Pro, and shipped a model that out-engineered everyone on what matters most for AI agents: cheap long memory. The open question is whose read is right. If Zhang is correct that those stability fixes are scaffolding, V4's efficiency advantage may not survive the next generation of training runs. If she's wrong, we just got a 4x cheaper Gemini 3.1 Pro for long-horizon work, and the frontier labs have a real pricing problem to solve.
The pricing earthquake nobody priced in
The benchmark numbers are the headline; the cost numbers are the story. DeepSeek's V4 pricing page lists V4-Pro at $1.74 per million input tokens and $3.48 per million output tokens. V4-Flash, the smaller sibling, runs $0.14/$0.28. Stack that against the rest of the frontier:
- GPT-5.5 Pro: up to $180 per million output tokens (V4-Pro is roughly 1/50th).
- GPT-5.5 standard: $5/$30 input/output (V4-Pro is roughly 1/7th the output cost).
- Claude Opus 4.7: $5/$25 (V4-Pro is roughly 1/7th the output cost; on blended workloads where output dominates, the gap reaches 15x).
- Kimi (Moonshot AI), the Chinese rival: $4 per million output tokens. V4-Pro undercuts even fellow Chinese labs.
For someone burning $300 to $700 a month on premium API calls, this changes the per-task math from "luxury tool" to "default option." Decrypt's testing found V4-Flash at $0.28 per million output tokens runs roughly 50x cheaper than Claude Opus on a per-task basis; V4-Pro is about 4x cheaper. And the pricing trajectory points down: DeepSeek told Fortune it expects to lower V4-Pro prices later in 2026 once Huawei's Ascend 950 supernodes come online.
The market caught the implication immediately. After the V4 announcement, SMIC and Hua Hong Semiconductor (China's contract chip manufacturers) jumped 9% and 15% in Hong Kong trading. DeepSeek itself is reportedly raising from Tencent and Alibaba at a $20B valuation; per the Financial Times, they need the cash to keep their researchers from getting poached, not to fund training.
The headline framing on this pricing is "Chinese lab undercuts US competitors." The structural framing is more interesting: V4 is the second time in 18 months DeepSeek has reset the AI industry's price expectations. R1 in January 2025 wiped $600 billion from Nvidia's market cap in a single day when investors realized you could match OpenAI for ~90% less. V4 is the same play, except now the target is the inference market that powers every AI agent, every coding tool, every long-context workflow.
What actually changed in the architecture
The paper's technical bets come in three pieces. First, what DeepSeek calls a Hybrid Attention Architecture: combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to make long-context inference dramatically cheaper. Second, Manifold-Constrained Hyper-Connections (mHC), a redesign of how signals propagate between transformer layers, intended to keep training stable as the model scales. Third, the Muon optimizer, which DeepSeek says delivers faster convergence and greater training stability.
The plain-language version: V4 does the same kind of work that other models do, like reading text and predicting the next word. But it uses far less computing power and far less memory to do it, especially when the input is long.
The numbers from V4's paper:
- Compared to V3.2, V4-Pro uses 27% of the compute and 10% of the memory for the same kind of long-context inference. Tiezhen Wang notes that with caching, the memory load drops to roughly 2%.
- Reading a 500-page document, which used about 50GB of memory in V3.2, now takes about 5GB in V4.
- Both V4-Pro and V4-Flash support a 1 million token context window (about 750,000 words, or two long novels) and a maximum output of 384K tokens. Both support thinking and non-thinking modes, JSON output, and tool calls.
The Mixture-of-Experts (MoE) design means only a slice of V4-Pro's 1.6 trillion parameters wakes up for any given query. Specifically, 49 billion fire per token. The rest of the brain stays asleep. That's how a model this large stays cheap to serve, and it's the design choice that makes the entire $3.48 price point possible.
One architectural detail worth flagging now, because it returns later: V4's first three MoE layers use hash routing. That assigns experts by a fixed hash of the token ID rather than the standard learned routing the rest of the model uses. The asymmetry, hash routing in early layers and learned routing later, is the specific mechanism Susan Zhang's pushback zeroes in on. We'll get there.
The benchmark debate: where V4 wins, where it loses
DeepSeek's own benchmark table, included in the technical paper, compares V4-Pro against GPT-5.4 xHigh, Claude Opus 4.6 Max, and Gemini 3.1 Pro. The honest reading is mixed. V4 wins some categories and loses others, and the comparison gets harder to interpret because OpenAI's GPT-5.5 and Anthropic's Opus 4.7 (newer than the models DeepSeek tested against) shipped within the same week.
Where V4-Pro wins:
- Codeforces: scored 3,206, putting V4 around 23rd among actual human competitive programmers.
- Apex Shortlist (a curated set of hard math and STEM problems): 90.2%, beating Opus 4.6's 85.9% and GPT-5.4's 78.1%.
- SWE-Verified (resolving real GitHub issues from open-source repos): 80.6%, matching Opus 4.6.
- Vibe Code Benchmark: the first model to cross 40% (49.93%), making V4 the new #1 open-weight on the benchmark, per Vals AI.
- GDPval-AA (Artificial Analysis's test of economically valuable knowledge work): #1 among all open-weight models.
- Long-context retrieval on CorpusQA: V4-Pro leads open-source models and beats Gemini 3.1 Pro, though it loses to Opus 4.6 on MRCR (a "needle in a 1M-token haystack" test).
Where V4-Pro trails:
- MMLU-Pro: Gemini 3.1 Pro 91.0% vs V4-Pro 87.5%.
- GPQA Diamond (graduate-level expert knowledge): Gemini 94.3 vs V4-Pro 90.1.
- Humanity's Last Exam (graduate-level reasoning): Gemini 3.1 Pro 44.4% vs V4-Pro 37.7%.
- Terminal Bench 2.0 (complex command-line agent workflows): GPT-5.5 82.7% vs V4-Pro 70.0%.
The bigger debate is what the deltas mean. Chris McGuire, who served on the White House National Security Council under Biden, argues V4 isn't the leap V3 claimed to be in January 2025. By his reading, US models still lead the frontier by ~7 months, leading Chinese models trail by 3-6 months, and V4 isn't even clearly the best Chinese model (Alibaba's Qwen and other labs have been competitive). scaling01 reads it the other way: V4 is roughly GPT-5.2 / Opus 4.5+ tier at 1.6T parameters, ahead of every other Chinese lab (with Kimi K2.6 closest), and "undercooked" with major reasoning-RL potential left on the table.
Chubby staked out the most bullish position of the weekend. V4-Pro is roughly on par with GPT-5.4 xHigh and Opus 4.6 Max on the headline benchmarks, outperforms both on SWE-Verified, and sets a new Codeforces record. His caveat: the agentic claims still need real-world testing against the newer Opus 4.7 and GPT-5.5. The Arena team ran the hands-on version of that test, putting V4-Pro through one-shot generation tasks like 3D voxel scenes, SVGs, and UI mockups. Their read: massive improvement over V3.2 in 4-5 months, but still trailing frontier closed models on consistency and creative intent.
The four readings can all be right at once. V4 trails frontier intelligence tests, AND leads the open-source pack on agentic and long-context work, AND matches Opus 4.6 / GPT-5.4 on key sub-benchmarks, AND changes the unit economics for any builder running long workflows.
The counter-narrative: Susan Zhang on training stability
The most thoughtful pushback came from Susan Zhang at Google DeepMind. Zhang is one of the field's most respected pre-training engineers, and she spent Friday afternoon walking through V4's release notes to argue that the polished launch hides a rough training run.
Zhang's case has three parts.
One: the stability problems were significant. DeepSeek pre-trained V4 on 33 trillion tokens, more than double V3's 15 trillion. Scaling training data doesn't always scale cleanly. Zhang highlights that V4's paper documents real instabilities the team hit during the long-context phase. The patches they shipped have specific names in the paper: Anticipatory Routing (computing routing decisions using older parameters to break feedback loops where bad routing causes outliers, which then cause worse routing) and SwiGLU Clamping (constraining the linear and gate components of SwiGLU activations to suppress anomalous values). DeepSeek itself admits these worked in practice but their "underlying principles remain insufficiently understood." Zhang's word for the broader fix-up effort is "wildly lacking."
Two: hash routing in early layers is a bigger deal than DeepSeek admits. This is where Zhang's read connects to the architecture detail above. Because V4 uses hash routing in the first three MoE layers (rather than learned routing), Zhang argues that any problematic token landing in those early layers gets amplified before the model has a chance to route around it.
Three: normalization is field-wide cope. This is the broader claim, and the most consequential. Zhang warns that the entire AI field is leaning too hard on a math technique called normalization (which smooths out training noise to keep models from blowing up during long runs). It's effective in the short term, but it can paper over architectural rot nobody will spot until much later, when models hit a scale where the underlying instabilities can't be smoothed away. She thinks V4's polish may be hiding exactly this kind of debt. The paper's emphasis on the Muon optimizer "for greater training stability" is, by Zhang's read, exactly the kind of framing that sounds reassuring but doesn't address the deeper concern.
Her closer is the line that's been getting reposted all weekend: "it's kind of remarkable they managed to train this thing at all."
Zhang's critique matters because she's not saying V4 is bad. She's saying V4's efficiency wins may be temporary, dependent on a stack of engineering compromises that won't survive the next training generation. There's a tell in DeepSeek's own paper that points the same direction: they describe V4's architecture as "relatively complex" and say future versions will try to "distill the architecture down to its most essential designs." That's roughly what Zhang means by scaffolding. If she's right, the open-source pricing advantage doesn't last beyond V5. If she's wrong, the frontier labs have a serious pricing problem.
Chips, geopolitics, and the timing nobody can ignore
The same morning V4 shipped, the US State Department ordered embassies worldwide to warn allies that DeepSeek and other Chinese AI labs have been stealing IP from US companies. The warning called out distillation (training a new model on outputs from older, more powerful US models like ChatGPT and Claude) and named Anthropic, Google, OpenAI, and xAI as the alleged source pool.
The chip story is messier. In February 2026, Reuters reported that the US government had confirmed DeepSeek trained an earlier model on smuggled Nvidia Blackwell chips, in violation of US export controls. The chips were allegedly clustered at a data center in Inner Mongolia. DeepSeek hasn't said publicly which chips trained V4, and observers note V4's paper is conspicuously silent on training compute. Huawei confirmed on Friday that its latest Ascend AI cluster supports V4 inference, with new Ascend 950 supernodes coming online later in 2026.
The geopolitics here aren't a footnote, they're the structural story. If V4 was trained partly on US-export-banned chips, the strongest argument for keeping the chip controls (that they slow Chinese AI progress) gets weaker. If V4 runs natively on Chinese-made Huawei chips, Beijing gets more AI sovereignty and reduces dependence on Nvidia. Either reading puts pressure on the US export-control framework, and on every American lab whose business model depends on premium pricing.
There's an alternate framing of all this that's worth taking seriously. Ritesh Doshi summed up the sentiment in three words: "V4 shouldn't be possible." Weaker chips, weaker software, yet Chinese labs keep closing the AI gap. Hyperbolic Labs co-founder Yuchen Jin offered the answer of the weekend: "creativity loves constraints." The chip restrictions may be doing less to slow Chinese AI than to force its labs into architectural creativity that's then exported, via open weights, back to everyone else. V4 ships cheap inference today; it also publishes a free recipe (hash-routing MoE, hybrid attention stack, Muon optimizer, Anticipatory Routing, SwiGLU Clamping) that every lab on Earth can now study and adopt.
What this means for you
The practical impact depends on what you're building.
If you're a developer running long-context workflows (codebases, document analysis, multi-step research, financial reports), V4 changes the math more than any release this year. V4-Pro is integrated with OpenCode and other agentic coding tools as of launch day; the OpenCode team shipped V4 support in v1.14.24 within hours of the release, with DeepSeek engineers contributing PRs and fixes themselves. V4-Flash runs roughly 50x cheaper than Claude Opus on a per-task basis. The right move is to identify the workflow steps where you're paying frontier prices for capabilities V4 already matches, and quietly migrate those to a multi-model stack. Keep Opus or GPT-5.5 for the hardest reasoning steps; route everything else to V4.
If you're building an AI product on someone else's API, the pricing pressure will land on you within two quarters. Frontier-lab pricing has held up because the alternatives weren't quite competitive. They are now. Expect Anthropic and OpenAI to either lower premium-tier pricing or segment more aggressively (more Sonnet/Haiku-tier offerings, more cached-input discounts, more batch-API savings). Plan budget assuming the floor moves down.
If you're a non-technical knowledge worker, the day-to-day chatbot experience won't change much yet. V4 is text-only at launch (multimodal is "coming"), and ChatGPT and Claude still have polish, integrations, and brand recognition that DeepSeek doesn't. But the apps and tools you use are about to get cheaper to operate if these ideas translate and get applied to future model runs from the US labs, which means more capabilities sliding from "premium" to "free tier" over the next 6 months. Watch for AI features showing up in tools where they didn't make economic sense to ship before.
Where it goes next
So is V4 "as good as" GPT-5.5 or Opus 4.7? No. By DeepSeek's own admission, it isn't, on the standard tests. The open question is whether the agentic race rewards raw intelligence or cheap long memory, and whether Susan Zhang is right that V4's efficiency wins are scaffolding rather than foundation.
If V4's architecture survives the next training generation, it sets the open-source pricing floor for years and forces every closed-source lab to defend a premium that's increasingly hard to justify. If Zhang's stability concerns prove out at the next scale-up, V4 looks more like a brilliantly-engineered local maximum than a frontier reset.
Either way, the most interesting thing about V4 isn't the model itself. It's that DeepSeek published a 50-page paper admitting they're behind, then shipped the most disruptive open AI release of 2026 anyway. Time will tell if adoption gets closer to Qwen or Kimi territory this year. After all, it's actually been more like 15 months since DeepSeek released V3. They might be 3-6 months behind by their own admission, but they have a lot of ground to make up in terms of mindshare and actual usage.