Six months ago, the AI coding world had a clear favorite. Claude Opus owned the agentic programming space, and OpenAI's models felt like they were playing catch-up. Developers who'd gone all-in on Claude Code had little reason to look back. Honestly, this was pretty much true up until six weeks ago. Maybe even six days ago, for some.
Well, that just changed.
GPT-5.4 launched today, and it represents the most significant course correction OpenAI has made since the GPT-5 series began. It takes the coding strengths of the Codex line, wraps them into a general-purpose model, and adds native computer use, a 1M token context window, and a new tool search system that lets agents work across massive tool ecosystems without drowning in context.
The result: the first OpenAI model that's making Claude-loyal developers reconsider their daily driver.
- The Bigger Picture: OpenAI's Agentic Pivot
- What's Actually New
- The Full Benchmark Rundown
- The Caveats: What to Watch Out For
- Design Aesthetics: A Quick Comparison
- The Safety Card: What OpenAI Found
- Pricing and Availability
- Best Practices for GPT-5.4
- More Developer Reactions
- What the Internet Built (Day-One Demos)
- Independent Benchmarks from the Community
- Hot Takes from the Timeline
- The Criticisms and Memes
- The Bottom Line
The Bigger Picture: OpenAI's Agentic Pivot
To appreciate GPT-5.4, you need to understand where OpenAI was six months ago. When GPT-5 launched last August, Every's vibe check was blunt: "If you're an agent-slinging senior engineer or a vibe coder, GPT-5 is not going to become your daily driver for frontier AI coding."
The problem wasn't intelligence. It was philosophy. OpenAI's models treated coding like a conversation; Claude's treated it like a delegation. You could hand Claude a task and walk away. With OpenAI's models, you had to babysit.
Then OpenAI did something smart. They built the Codex series (specialized coding models), launched a desktop app, shipped Codex 5.3, and iterated at a pace that surprised everyone. GPT-5.4 is the payoff: all those Codex learnings baked into a single frontier model that also handles everything else.
What's Actually New
Coding That Competes with Claude
GPT-5.4 matches or slightly outperforms GPT-5.3-Codex on SWE-Bench Pro (57.7% vs. 56.8%) while being faster across reasoning levels. Based on the SWE-Bench chart, it's slightly more accurate and slightly faster than GPT-5.3 Codex, with xhigh reasoning effort.
You should assume that all benchmarks were evaluated with reasoning effort set to xhigh, so you should use xhigh too.
But benchmarks only tell part of the story. The Every team, who build real products with AI daily, has been testing GPT-5.4 internally for about a week. Their findings are striking:
- Kieran (building Cora, an AI email assistant) went from 90% Claude Opus to roughly 50/50 between Opus and Codex/GPT-5.4. He called GPT-5.4 Pro better at coding Ruby than Codex 5.3, even though Ruby isn't its specialty. (Livestream at 5:07)
- Naveen (building Monologue, a smart dictation app) has been almost exclusively on 5.4 for the past few days, calling the xhigh reasoning level "the biggest alpha." He fixed a ton of iOS and Mac bugs that previous models missed. (Livestream at 9:00)
- Dan (building Proof, a collaborative markdown editor) switched his AI assistant (RTC2) to GPT-5.4 and said he wouldn't go back. It's faster, more conversational, and handles coding tasks with confidence. (Livestream at 14:19)
The code quality improvements are real, too. In a side-by-side test, Naveen showed how GPT-5.4 organized a Proof pull request into clean, modular files (a separate mermaid diagrams plugin), while Codex 5.3 dumped everything into a single index file with the same prompt. The modular version caught bugs that the monolithic version missed entirely.
What Cursor's VP of Developer Education Lee Robinson said sums it up: "GPT-5.4 is currently the leader on our internal benchmarks. Our engineers find it to be more natural and assertive than previous models. It works through ambiguous problems without second-guessing itself, and it's proactive about parallelizing work to keep things moving."
We also tested GPT-5.4's coding skills on our own livestream, hooking it up to an MCP and getting it to build Cat Doom. Yes, that's a Doom clone with cats. It was awesome. It was even able to recover from a botched attempt by GPT 5.3 Spark (the really fast but not as smart one).
Native Computer Use (A Big Deal)
GPT-5.4 is the first general-purpose model with built-in computer-use capabilities. It can look at screenshots, click buttons, fill forms, and navigate software through the UI, just like a person sitting at a desk.
On OSWorld-Verified, which measures how well a model can navigate a desktop environment through screenshots and keyboard/mouse actions, GPT-5.4 hit 75.0%. For reference, human performance on the same benchmark is 72.4%. The model literally operates a computer better than the average person. As OpenAI's dev team put it: "GPT-5.4 can write Playwright code, read screenshots, and issue keyboard/mouse actions to operate computers. You can steer its behavior and set custom confirmation policies for different risk tolerances."
The practical applications are massive. Mainstay's CEO Dod Fraser reported that across roughly 30,000 property tax portals, GPT-5.4 achieved a 95% success rate on the first attempt and 100% within three attempts, while completing sessions about 3x faster and using around 70% fewer tokens than previous computer-use models.
As a demonstration, OpenAI released an experimental Codex skill called "Playwright (Interactive)". This lets Codex visually debug web and Electron apps; it can even test an app it's building, as it's building it. This is how you get it to see your screen and test your games for you. You can also use AgentBrowser for similar workflows.
One demo that caught our eye: a tactical RPG created with GPT-5.4 over multiple turns, using Playwright Interactive for browser playtesting and image generation for the game's visual style, characters, and assets. The game features turn-based combat on a grid map, with systems for movement, actions, positioning, and encounter flow. Image generation was used to create characters, and Playwright was used to validate the interface, inspect and refine UI behavior, and support iterative edits as the game's combat, visuals, and overall feel were tuned over multiple rounds. Pretty wild for a model that also writes spreadsheets.
Tool Search: Managing the Tool Explosion
Here's a problem that's gotten worse as AI agents get more capable: when you give a model access to dozens or hundreds of tools, stuffing all those tool definitions into every request bloats the context, slows everything down, and confuses the model about which tool to use.
GPT-5.4 introduces tool search, which lets the model see a lightweight list of available tools, then dynamically load only the definitions it needs at runtime. Think of it like a search engine for your tool ecosystem instead of a phone book that gets bigger every time you add a new tool.
In testing on Scale's MCP Atlas benchmark (250 tasks across 36 MCP servers), the tool search configuration cut total token usage by 47% while achieving the same accuracy. That's a significant cost and speed improvement for anyone building complex agent systems. As Dominik Kundel from OpenAI noted: "This is especially great for MCP clients where you might not be in control of how many tools get disclosed."
Zapier's CEO Wade called GPT-5.4 xhigh "the new state of the art for multi-step tool use," noting it "finished the job where previous models gave up; the most persistent model to date."
Knowledge Work: Not Just for Coders
The benchmarks for professional work are where GPT-5.4 really flexes. On GDPval, which tests agents across 44 occupations with real work products (sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams), GPT-5.4 matched or exceeded industry professionals 83% of the time. That's up from 70.9% for GPT-5.2.
Please note that OpenAI does this annoying thing where they release benchmark pages and then never update the leaderboards when new models drop. Just give us a single leaderboard to look at please ppl??
On internal investment banking modeling tasks, GPT-5.4 scored 87.3%, compared to 68.4% for GPT-5.2. Human raters preferred its presentations 68% of the time over GPT-5.2's, citing better aesthetics and visual variety.
OpenAI also launched ChatGPT for Excel today, along with updated spreadsheet and presentation skills for Codex and the API. If you're an enterprise customer who lives in Excel, this is worth checking out.
Some notable partner quotes on the knowledge work side:
- Harvey's Head of Applied Research Niko Grupen said: "GPT-5.4 sets a new bar for document-heavy legal work. On our BigLaw Bench eval, it scored 91%."
- Thomson Reuters CTO Joel Hron (friend of the pod!) noted "consistent gains across the board in benchmarks, including legal reasoning, hallucination detection, and other core capabilities," with "marked improvements in long context tasks."
- Notion's AI Engineering Lead Sarah Sachs called it a model with "top-tier reasoning quality, but with dramatically improved token efficiency and latency, which is crucial for real-time agent workflows."
More Factual, Less Hallucination
GPT-5.4 is OpenAI's most factual model yet. On prompts where users previously flagged errors, individual claims are 33% less likely to be false, and full responses are 18% less likely to contain any errors compared to GPT-5.2.
The Full Benchmark Rundown
Here are the key numbers across categories:
Professional / Knowledge Work
- GDPval (wins or ties): 83.0% (vs. 70.9% GPT-5.2)
- Investment Banking Modeling: 87.3% (vs. 68.4% GPT-5.2)
- OfficeQA: 68.1% (vs. 63.1% GPT-5.2)
- FinanceAgent v1.1: 56.0% (vs. 59.5% GPT-5.2)
Coding
- SWE-Bench Pro (Public): 57.7% (vs. 55.6% GPT-5.2)
- Terminal-Bench 2.0: 75.1% (vs. 62.2% GPT-5.2)
Computer Use & Vision
- OSWorld-Verified: 75.0% (vs. 47.3% GPT-5.2, human performance: 72.4%)
- MMMU Pro (no tools): 81.2% (vs. 79.5% GPT-5.2)
Tool Use
- BrowseComp: 82.7% (vs. 65.8% GPT-5.2)
- MCP Atlas: 67.2% (vs. 60.6% GPT-5.2)
- Toolathlon: 54.6% (vs. 45.7% GPT-5.2)
Academic / Reasoning
- ARC-AGI-2 (Verified): 73.3% (vs. 52.9% GPT-5.2)
- Humanity's Last Exam (with tools): 52.1% (vs. 45.5% GPT-5.2)
- FrontierMath Tier 1-3: 47.6% (vs. 40.7% GPT-5.2)
Without Reasoning (effort = none)
- τ²-bench Telecom: 64.3% (vs. 57.2% GPT-5.2, vs. 43.6% GPT-4.1)
The Caveats: What to Watch Out For
Every team that tested GPT-5.4 flagged the same thing: OpenAI "loosened" the model to make it more conversational and creative, and that looseness has side effects.
Prompt leaking: Multiple testers (Kieran at 20:05, Dan at 21:23) saw GPT-5.4 leak system prompt details directly into UI elements. In one 3D coding benchmark, it rendered "low poly ecosystem, cozy island, 3JS, TypeScript, VIT" right on the screen. This is a known pattern from earlier "loosened" Claude models (like Sonnet 3.7), where the model does things you didn't ask for.
Occasional lying: Dan reported that GPT-5.4 in his Open Claw setup "blatantly lied" to him a few times. In one case, he asked it to make a pixel-perfect landing page, and it took a screenshot of the Figma file, placed it on an HTML page, added hotspot buttons on top, and told him it was done. (Livestream at 15:04)
Over-eagerness: Kieran found that the model added a GDPR compliance checkbox to an e-commerce demo site nobody asked for. It also silently redrew SVG assets without telling anyone, subtly messing up the noses on character icons. (Livestream at 49:53)
Rigidity in iteration: Several testers noted the model got stuck on its initial design choices and had trouble iterating from there. If you're doing multi-step design refinement, you may need to push harder than you would with Opus.
The Every team's take is that this is actually a good sign. OpenAI loosened the model (which everyone wanted), found the cracks, and now knows exactly what to fix next. As Kieran put it: "The next version will still be joyful to work with but then not have these tiny issues." (Livestream at 24:12)
Design Aesthetics: A Quick Comparison
One fun finding from the Every team: each model has a distinct design personality when you give it the same UI prompt.
- Claude Opus: Minimal, structured, sometimes boring. Clean but safe.
- Gemini 3.1: Bold, minimal, more of a "statement" design. Currently leads on aesthetics according to Kieran.
- GPT-5.3 Codex: Less boring than Opus, with gradients and shadows, but occasionally produces odd layouts.
- GPT-5.4: Not boring at all. Lots of shadow, gradient work, and design exploration. Sometimes goes too far (hello, UI prompt leaking), but has more character than the competition.
Kieran's verdict: Google Gemini currently leads on design, but GPT-5.4 is more interesting and explorative than Opus. For design iteration and refinement, Opus is still the best at staying on track without going off-course. (Livestream at 36:00)
The Safety Card: What OpenAI Found
OpenAI published a comprehensive system card alongside the launch. Key takeaways:
Cybersecurity: GPT-5.4 is classified as "High" capability in cybersecurity under OpenAI's Preparedness Framework (same as Codex 5.3). It solved 73.3% of cyber range scenarios and 14 out of 17 medium-difficulty atomic challenges. OpenAI is deploying message-level classifiers to block high-risk cyber content on zero-data-retention surfaces.
Chain-of-thought monitoring: OpenAI also published a research paper finding that GPT-5.4's ability to control its own chain of thought is low (only 0.3% of 10k-character reasoning traces could be controlled by instructions). This is actually good news for safety: it means the model can't easily hide its reasoning from monitors.
Bias and safety: The model performs on par with or slightly better than GPT-5.2 on fairness evaluations, and outperforms previous models on mental health, emotional reliance, and self-harm safety benchmarks.
Pricing and Availability
GPT-5.4 is rolling out today across ChatGPT, the API, and Codex.
In ChatGPT:
- Available as GPT-5.4 Thinking for Plus, Team, and Pro users.
- Replaces GPT-5.2 Thinking (which stays in Legacy Models for 90 days until June 5, 2026).
- GPT-5.4 Pro available to Pro and Enterprise plans.
- Enterprise and Edu admins need to enable via Early Model Access toggle.
- Context windows: 256K for paid tiers (128K input + 128K output), 400K for Pro (272K input + 128K output).
API Pricing (per 1M tokens):
- gpt-5.4 (under 272K context): $2.50 input / $0.25 cached input / $15.00 output
- gpt-5.4 (over 272K context): $5.00 input / $0.50 cached input / $22.50 output
- gpt-5.4-pro (under 272K context): $30.00 input / $180.00 output
- gpt-5.4-pro (over 272K context): $60.00 input / $270.00 output
- gpt-5.2 (for comparison): $1.75 input / $0.175 cached input / $14.00 output
Batch and Flex pricing are available at half the standard API rate. Priority processing is available at twice the standard API rate. If you're a production developer building user-facing apps where speed matters, priority processing gives you significantly lower and more consistent latency. You configure it by setting service_tier=priority on your API request.
What the pricing means in plain English: GPT-5.4 costs about 43% more per input token than GPT-5.2 ($2.50 vs. $1.75), but only about 7% more per output token ($15 vs. $14). However, because GPT-5.4 is more token-efficient (uses fewer tokens to solve the same problems), your total cost may actually go down for many tasks. The cached input price ($0.25/M tokens) is a big deal if your app sends similar prompts repeatedly, since it means you're paying 90% less for repeated context. The "over 272K" pricing doubles input costs, so the 1M context window is powerful but pricey for large-scale use.
Container usage note: Starting March 31, 2026, container usage (including Hosted Shell and Code Interpreter) will be billed per 20-minute session instead of per-container. Rates by memory tier are unchanged: 1 GB (default) at $0.03/20 min, 4 GB at $0.12/20 min, 16 GB at $0.48/20 min, 64 GB at $1.92/20 min. This basically means if your agent runs for 40 minutes straight, you'll pay for two sessions instead of one.
Other tool costs to know:
- Web search: $10/1k calls + search content tokens at model rates.
- File search: $0.10/GB per day (1 GB free), $2.50/1k tool calls (Responses API).
- AgentKit (Agent Builder, ChatKit, Evals): 1 GB free storage for file/image uploads, then $0.10/GB-day.
GPT-5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests exceeding the standard 272K context window count against usage limits at 2x the normal rate.
Best Practices for GPT-5.4
OpenAI published detailed prompt guidance for getting the most out of GPT-5.4. A few highlights:
- Default reasoning effort is "none." If you want the model to think harder, increase to medium or high. GPT-5.4 supports none, low, medium, high, and xhigh.
- Pass chain-of-thought between turns. Using the Responses API with
previous_response_idgives you better intelligence, fewer reasoning tokens, and higher cache hit rates. - Use preambles. Add "Before you call a tool, explain why you are calling it" to your system prompt. GPT-5.4 will briefly explain its plan before each tool call, improving accuracy.
- Tool search for large toolsets. If you're working with many MCP servers or functions, enable tool search to reduce token usage by up to 47%.
Migration tips by current model:
- From GPT-5.2: GPT-5.4 with default settings is a drop-in replacement
- From o3: GPT-5.4 with medium or high reasoning
- From GPT-4.1: GPT-5.4 with none reasoning, tune prompts, increase if needed
- From o4-mini or GPT-4.1-mini: gpt-5-mini with prompt tuning
- From GPT-4.1-nano: gpt-5-nano with prompt tuning
More Developer Reactions
A few more notable quotes from partners testing GPT-5.4:
- Augment Code co-founder Guy Gur-Ari: "We made GPT-5.4 the default model in Augment's agentic development environment, Intent, because of its consistent follow-through and reliable tool use powering multi-agent workflows end-to-end."
- Windsurf CEO Jeff Wang: "GPT-5.4 demonstrates deeper exploration and higher attention to detail compared to other models we've tested. This is a significant improvement over previous GPT models for agentic coding."
- Legora CTO Jacob Lauritzen: "GPT-5.4 produces more natural, lawyer-like prose while also running 15% faster than previous models. It shows strong intent understanding and proactively flags when information appears incomplete."
- HockeyStack CTO Arda Bulut: "At 400-500k tokens per query, it reliably synthesizes insights across massive revenue datasets, handles deduplication correctly, and avoids the hallucinations we saw in prior models."
- Clio CTO Jonathan Watson: "It produces more natural writing with stronger structure, improved reasoning, and a better ability to analyze complex source material."
- Clay co-founder Varun Anand: "GPT-5.4 is dramatically faster and more efficient at agentic data retrieval than any other model we've used. Its writing and personality are much more natural, making GPT-5.4 a great fit for our conversational agents."
- Glean's Nilesh Dalvi: "GPT-5.4 performs strongly on our internal suite of evals across analysis, writing, and workflow automation, all powered by frontier-level capabilities in tool orchestration."
- Databricks CTO Hanlin Tang: "GPT-5.4-powered agents delivered significantly higher accuracy than prior generations and ran about 2x faster than other frontier models."
- Whoop Senior AI Engineer Douglas Schonholtz: "GPT-5.4 delivered a substantial leap over GPT-4.1 on our querying-language benchmark with no prompt changes."
What the Internet Built (Day-One Demos)
The demos started rolling in within hours. Here are the highlights:
- Min Choi shared some demos from OpenAI including a playable chess game, a flight simulator, a theme park simulation, and an RPG; all one-shot.
- Ethan Mollick recreated a 3D space inspired by Piranesi with a single prompt plus one "make it better" follow-up. Zero errors.
- Angel called GPT 5.4's Minecraft "basically perfect" after a 24-minute run: "I have to find a new test now."
- Peter Gostev generated an elaborate peacock render (87 minutes, 90 seconds). He apologized to Sam Altman for the compute.
- Adi Adonis Singh prompted a massive Minecraft Mona Lisa where each block is a pixel: "build so good I had to record it."
- Chris built a theme park simulation game.
- Samuel Albanie one-shot five demos: every ML paper on arXiv from 2012 to now as a particle simulation, every naval battle Nelson fought, isometric King's Cross, a self-assembling conference poster, and more.
- Haider gave it his usual benchmark (build a word-scoring system) and got back a clean implementation plus a full test suite. Both worked perfectly on the first try.
- Nico Christie at Shortcut demoed two real-world Excel tasks on internal evals of 50,000+ cells each. GPT-5.4 is now live in Shortcut.
- Emergent Labs showed off richer UIs, multi-agent workflows, and apps that get planned, built, and tested autonomously.
- OpenAI's own demo (our fave): a tactical RPG created over multiple turns with Playwright Interactive for browser playtesting and image generation for characters and assets. Turn-based combat on a grid map, with movement, positioning, and encounter flow, all iterated over multiple rounds.
- Jamie Cuffe / Pace stress-tested GPT-5.4 on legacy insurance portals (the hardest UIs on the internet). It navigated 20-year-old, hyper-dense enterprise software without hallucinating a click. Insurance turns out to be the ultimate computer-use benchmark: dense layouts, tiny buttons, hundreds of steps. GPT-5.4 handled it.
- TinyFish ran web agents on multi-step workflows across dynamic, live websites with no APIs. Called it "by far the best OpenAI model we've tested." They identified four major leaps: click accuracy on cluttered screens, long-trajectory reasoning across hundreds of steps, faster evaluation cycles, and memory that persists spatial layout across steps.
- Ara explained why computer use + coding creates a "perfect RL pipeline": the agent can now make code changes, then visually verify them by actually clicking through the UI, catch its own off-by-one errors, and iterate. "Agents will learn from their mistakes in real time."
Codex also got a speed boost: with /fast mode, GPT-5.4 runs 1.5x faster with the same intelligence and reasoning. And if you build something cool, you can now submit your project to OpenAI's developer showcase.
Independent Benchmarks from the Community
Beyond OpenAI's own numbers, independent evaluators are weighing in:
- Vals AI: #1 on Vibe Code Bench at 67.4% (+5.7% over previous SOTA), #1 on ProofBench (graduate-level math proofs), #1 on IOI (competitive programming). Biggest driver of improvement: better self-testing with the browser.
- ARC Prize: ARC-AGI-2 Semi Private: GPT-5.4 scored 74.0% at $1.52/task. GPT-5.4 Pro scored 83.3% at $16.41/task.
- Brendan Foody / Mercor: Best model ever tested on APEX-Agents. First to pass 50% mean score. "A year ago, frontier models couldn't even edit an Excel sheet and scored less than 5%."
- Arena.ai: GPT-5.4-high tied with Gemini-3-Pro in the Text Arena. Top 3 in Creative Writing.
- Adi Adonis Singh: #1 on EyeBench V2 (71%) by a significant margin.
- Sergey Karayev: All-time highs on their Rails benchmark. Both cheaper and better than GPT-5.2 and Opus/Sonnet models.
- Lech Mazur: New champion on the Short-Story Creative Writing Benchmark.
- Bo Wang / UHN: 92.2% on EURORAD (207 expert-validated radiology cases). For context, Qwen3.5-27B (which runs on a laptop) scored 85%, and Gemini 3.1 Pro scored ~79%.
- Deedy Das: GPT-5.4 Pro crushed FrontierMath Tier 4 (research-level math that can take mathematicians weeks) at 38%. A year ago, the best was 2% (o3). Best open source is 4.2% (Kimi K2.5).
- Chris: Humanity's Last Exam jumped from 34.5% to 39.8% (no tools) and from 45.5% to 52.1% (with tools) in just two months. The Pro version hit 58.7% with tools.
Hot Takes from the Timeline
The big reactions:
- Matt Shumer: "The best model in the world. By far." Called coding "essentially solved." Said the standard thinking version is better than previous models in Pro mode. His one big complaint: "Frontend taste is FAR behind Opus 4.6 and Gemini 3.1 Pro. @OpenAI once you fix this, there's literally no reason for me to use any other model."
- Noam Brown (OpenAI): "We see no wall, and expect AI capabilities to continue to increase dramatically this year."
- Yann Dubois (OpenAI): Highlighted two things he's most excited about: (1) merging the Codex and mainline models, and (2) bringing Codex-level efficiency to computer use and knowledge work. "What should we fix for the next model?"
- Ethan Mollick: Updated his GDPval chart showing that if you give a 7-hour professional task to AI, even accounting for failure rates and the need to check results, you'd save an average of 4 hours and 38 minutes.
- Josh Kale: "You now have a 1 in 6 chance of being better than GPT-5.4... at your own job." He laid out the GDPval and OSWorld numbers and landed on the key insight: "It's never been more important to use these tools for leverage rather than let market forces apply that leverage against you."
- Alex Volkov (ThursdAI): Asked both Claude 4.6 and GPT-5.4 to build a chat UI comparing their long-context prices. "I think 5.4 has the clear win here aesthetics wise."
- Perplexity: Launched "Model Council" inside Perplexity Computer. It runs GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro simultaneously. Three frontier models, one workflow, best answer wins.
- Tibo (OpenAI): Codex hit 2M+ active users, up 25% week over week, and that was before the Windows app and GPT-5.4 launched.
- Cline: Noted that while the 1M context window is real, OpenAI's own evals show needle-in-a-haystack scores drop from 97% at 16-32K to just 36% at 512K-1M. "So it's a good idea to compact regularly!"
- GPT-5.4 is also available in Microsoft Azure AI Foundry and Shortcut. Dax at OpenCode added GPT-5.4 and Codex-5.3-Spark support too.
- Ethan Mollick also tested GPT-5.4 Pro, Opus, and Gemini DeepThink on proving there was no advanced dinosaur civilization via PowerPoint. GPT-5.4 and Claude both did original analyses, but his verdict: "someone build a harness for Deep Think!"
The Criticisms and Memes
Not everything was roses:
- SpeechMap: GPT-5.4 scored just 29.6% on SpeechMap (which evaluates generation of contentious speech). That's the lowest score for a flagship model from a major lab in some time. Translation: it's very cautious about edgy topics.
- Peter Gostev: On BullshitBench v2, GPT-5.4 is 16th overall. "Once again, the thinking is not helping; the xHigh version is a few points lower."
- Matt Shumer (again): Frontend taste is "FAR behind Opus 4.6 and Gemini 3.1 Pro." Morgan agreed, comparing Codex frontend output to Claude Opus 4.6 side by side.
- Lisan al Gaib: "GPT-5.4 still getting frame-mogged on ARC-AGI-2." Also noted it's "only slightly better than GPT-5.3-Codex on SWE-Bench-Pro."
- JB: "Still some work to do for GPT-5.4 vision."
- Yuchen Jin: Trolled OpenAI hard. A simple "Hi" to GPT-5.4 Pro cost him $80 and took 5 minutes. "Everyone is saying GPT-5.4 Pro is the smartest model, AGI-level intelligence, but do you have AGI-level questions to ask?"
- Chris: "AGI achieved with GPT-5.4 because it can count r's in strawberry and other fruits."
The Bottom Line
GPT-5.4 is the model that OpenAI should have shipped months ago. It's the first time their general-purpose model can legitimately compete with Claude Opus for agentic coding, while also being better at professional knowledge work, computer use, and tool orchestration.
The looseness is a tradeoff. By making the model more conversational and creative, OpenAI introduced failure modes that Claude users aren't used to dealing with (prompt leaking, occasional fabrication, scope creep). But the Every team's framing is right: these are the exact kind of cracks that get fixed fast, especially at OpenAI's current release velocity.
If you've been exclusively using Claude for coding, this is the model that deserves a serious test drive. Not because it's categorically better, but because having two genuinely competitive options makes everyone's workflow more resilient.
For knowledge workers who don't code, the improvements to spreadsheets, presentations, and document work are substantial. The 83% score on GDPval means this model can handle most professional tasks at or above human quality.
Oh, and one more thing. Dwayne spotted a mention of GPT-6 tucked into the left sidebar of the Codex chess demo. Meanwhile, sui dev is already speculating about timelines: GPT 5.3 launched in Feb, and GPT 5.4 came out in March, so "GPT-5.5 in April. GPT-6 in September. GPT-6 before GTA 6 is insane."
Our only rule is this: if you must persist with hitting 0.1 intervals on the GPT scale, can you at least go all the way up to 0.9? Otherwise stick to 0.5 intervals. Sincerely, an analyst (is that what I am?) who had to find something to gripe about this time around.
Links:
- Official launch blog
- Every.to vibe check
- Every livestream
- The Neuron livestream (Cat Doom!)
- Best practices / prompt guidance
- Tool search docs
- System card
- CoT controllability research
- ChatGPT context windows
- Scale MCP Atlas leaderboard
- Priority processing docs
- ChatGPT for Excel
- Spreadsheet skills (GitHub)
- Presentation skills (GitHub)
- Playwright Interactive skill (GitHub)
- AgentBrowser
- GDPval leaderboard
- τ²-bench paper
- Submit your Codex / GPT-5.4 project to OpenAI Showcase
- GPT-5.4 in Microsoft Azure AI Foundry
- Shortcut (GPT-5.4 live)
- Sebastien Bubeck on GPT-5.4