OpenAI launched GPT-5.5, and it's built to run your computer

🙀 OpenAI launched GPT-5.5, and it's built to run your computer

OpenAI's new flagship targets agentic work over chat. It beat Claude Opus 4.7 on six of seven published benchmark comparisons, but an asterisk, a pricing jump, and a tightened cyber classifier complicate the victory lap.

Written By
Grant Harvey
Grant Harvey
Apr 23, 2026
15 minute read

The one-line version of what just happened: OpenAI launched a model optimized to finish your work, not answer your questions. GPT-5.5 is the company's best swing yet at building AI that takes over a multi-step job, uses the tools you'd use, checks its own output, and keeps going until the thing is done.

The bigger story is what it means for everyone else. Claude Opus 4.7 (launched April 16) and GPT-5.5 (launched today, April 23) are both worker-class models, and the era of benchmarking "which chatbot is smartest" may be giving way to benchmarking "whose agent finishes the ticket fastest."

First up, the TL;DR

Most of us still treat AI like a clever roommate we ask things. The real race lately is building AI that just does the work: opens tabs, writes code, moves between tools, checks its own output, and keeps going. Today OpenAI shipped its contender.

Here's what happened

  • GPT-5.5 and GPT-5.5 Pro rolled out today in ChatGPT and Codex (OpenAI's cloud coding agent)
  • Available in Plus, Pro, Business, and Enterprise; GPT-5.5 Pro is Pro/Business/Enterprise only
  • Built for agentic workflows (multi-step tasks using tools, not just chat answers)
  • State-of-the-art on Terminal-Bench 2.0 (command-line agent work) at 82.7%, ahead of Claude Opus 4.7's 69.4%
  • Matches predecessor GPT-5.4 on per-token latency while using fewer tokens to finish the same task
  • API lands "very soon" at $5 per 1M input tokens and $30 per 1M output; Pro is $30/$180

How to try it

  1. Open ChatGPT or Codex on any paid plan
  2. Pick GPT-5.5 in the model picker (or GPT-5.5 Pro if you're on Pro, Business, or Enterprise)
  3. For long-running coding or research, use Codex; you get a 400K token context window and a Fast mode that generates tokens 1.5x faster for 2.5x the cost

Why this matters

Benchmarks built for work show the step up: GDPval (economically valuable tasks across 44 occupations) at 84.9% wins or ties vs. industry professionals, OSWorld-Verified (computer use) at 78.7%, and FrontierMath Tier 4 (research-level math) at 35.4%.

Real-world receipts from OpenAI: the finance team used Codex with GPT-5.5 to review 24,771 K-1 tax forms (71,637 pages), finishing two weeks faster than last year. A go-to-market staffer automated weekly reports and saved 5-10 hours a week. One NVIDIA engineer told OpenAI: "Losing access to GPT-5.5 feels like I've had a limb amputated."

Our take

Claude Opus 4.7 still edges GPT-5.5 on one big benchmark, SWE-Bench Pro (real-world coding tickets), at 64.3% vs 58.6%. OpenAI flagged that result with an asterisk citing Anthropic's own disclosure of "signs of memorization on a subset of problems." Credit to Anthropic for publishing that screen themselves.

Whatever the tiebreaker, the bigger signal is that both frontier labs now ship worker models, not chat models. Open question: at $30/$180 per million tokens, will companies pay the GPT-5.5 Pro premium when GPT-5.4 still handles most jobs? Only real Codex traffic will answer that, and the API meter starts running very soon.

Advertisement

Why OpenAI shipped this today

OpenAI didn't pick this launch date by accident. One week ago, Anthropic released Claude Opus 4.7 with a wall of testimonials about how it "catches its own logical faults," "thinks more deeply about problems," and is "the state-of-the-art model on the market" for agentic coding (those are Anthropic's partners' words). Opus 4.7 claimed Terminal-Bench 2.0 at 69.4%, SWE-Bench Pro at 64.3%, and a lift to "market-leading performance" on sustained reasoning.

OpenAI responded by shipping a model that beats Opus 4.7 on Terminal-Bench by 13 points (82.7% vs 69.4%), beats it on GDPval, ties on OSWorld-Verified, and loses on SWE-Bench Pro with an asterisk. Observer signüll described OpenAI's shift pointedly: product execution and velocity have stepped up noticeably with GPT-5.5, the tone feels more human again after a corporate phase, and recent releases show real focus and polish. The two-labs-one-week cadence isn't a coincidence. It's the new rhythm.

Sam Altman framed it plainly: GPT-5.5 matches GPT-5.4's per-token speed but uses significantly fewer tokens per task. OpenAI's inference team has effectively turned the company into, in Altman's words, "an AI inference company." Translation: the infrastructure itself is now the product.

The benchmarks that actually matter

Three of the GPT-5.5 benchmark wins deserve plain-language translation.

Terminal-Bench 2.0 tests whether an AI can run a sequence of command-line actions (the kind of thing a DevOps engineer does every day). GPT-5.5 hit 82.7%, a state-of-the-art score. That's the coding-agent flagship number.

GDPval is OpenAI's own benchmark for economically valuable knowledge work across 44 occupations: financial analysts, lawyers, consultants, researchers. It's graded by industry professionals who rate the AI's output against what a human peer would produce. GPT-5.5 scored 84.9% wins or ties vs. industry professionals. The industry baseline is 68.2%. Translate that: on most tasks a professional does, the AI either wins or ties with a real pro.

FrontierMath Tier 4 is the hardest tier of research-level math, the kind professional mathematicians work on. GPT-5.5 jumped from GPT-5.4's 27.1% to 35.4%. An 8-point lift on problems that, until recently, most AI models scored below 10% on.

And the one benchmark OpenAI lost: SWE-Bench Pro, which tests real-world GitHub issue resolution. Claude Opus 4.7 scored 64.3%; GPT-5.5 scored 58.6%. OpenAI flagged the result with an asterisk pointing to Anthropic's own note that their "memorization screens flag a subset of problems in these SWE-bench evals." Credit where it's due: Anthropic disclosed this themselves. Lisan al Gaib (scaling01) went further and called SWE-Bench Pro noise that "should be discarded," arguing GPT-5.5's intelligence is "spiky" but "close to Mythos despite being only ~1/5 to ~1/2 the size."

Researcher Noam Brown at OpenAI pushed the sharper version of this point: comparing models by a single number hasn't made sense since 2024. What matters now is intelligence per token or per dollar, especially when the model lives inside a product like Codex. The benchmark war is, in some sense, already an old war.

Advertisement

The "worker model" era

The reframing worth noticing is that neither Anthropic nor OpenAI is selling a chatbot anymore. They're selling a worker.

The NVIDIA engineer who told OpenAI "losing access to GPT-5.5 feels like I've had a limb amputated" wasn't describing a conversational tool. He was describing a prosthetic. Dan Shipper at Every called GPT-5.5 "the first coding model I've used that has serious conceptual clarity" and said it produced a rewrite his best engineer had done manually, from scratch, given a broken codebase. Pietro Schirano at MagicPath watched GPT-5.5 merge a branch with hundreds of frontend and refactor changes into a main branch that had also changed substantially, resolving the work in one shot in about 20 minutes. He called it "the highest-leverage tool I've ever touched," the first time he doesn't feel limited by what a model can do, only by what he can imagine.

The most vivid case comes from Aidan McLaughlin, who dictated a new RL research run to GPT-5.5 over his break, hit send, and returned two days later to find the model had run autonomously for 31 hours. Previous models, he noted, lacked any sense of time and over-eagerly intervened. GPT-5.5 occasionally paused to babysit the run. Peter Gostev reported the same pattern: a migration that ran for 7+ hours without breaking, ten queued prompts all successfully executed overnight. Before GPT-5.5, his long-running tasks maxed out at 30 minutes to 2-3 hours, even with shouting.

Real-world receipts inside OpenAI tell the same story in muted tones:

  • The finance team used Codex with GPT-5.5 to review 24,771 K-1 tax forms (71,637 pages), finishing two weeks faster than last year
  • The comms team built a Slack agent to triage speaking requests automatically, trained on six months of historical data
  • A go-to-market employee automated weekly business reports, saving 5–10 hours a week
  • Over 85% of OpenAI now uses Codex every week across engineering, finance, marketing, data science, and product

OpenAI Devs detailed what Codex actually ships with today: expanded browser use for web apps and test flows, a file viewer for spreadsheets and slides, and stronger computer use for moving context across tools. Tibo Sottiaux flagged the quieter wins: global dictation, a non-dev mode, and a new auto-review mode "much safer than yolo."

The most striking result: GPT-5.5, with a custom research harness, helped discover a new proof about off-diagonal Ramsey numbers, a longstanding open question in combinatorics (the branch of math that studies how discrete objects fit together). The proof was later verified in Lean (a formal proof language). This isn't "AI helped a researcher." This is AI producing a novel mathematical argument that was previously unknown.

If you're wondering what agentic AI actually looks like in practice, it looks like this.

The weird plot twist: models that are better together

Here's the detail that makes this launch genuinely interesting. On Dan Shipper's Senior Engineer Benchmark at Every (a test where the model is asked to rewrite a vibe-coded codebase from first principles), GPT-5.5 scored 62.5 out of 100. Opus 4.7 scored 33. Human senior engineers score 80-90.

But there's a catch, and Shipper leads with it in the video: GPT-5.5 only hit 62.5 when it executed against a plan written by Opus 4.7. When GPT-5.5 wrote its own plan, the best it managed was low-to-mid 50s. When Opus 4.7 executed its own plan, it bailed into patch-mode and scored 33.

Shipper's read: Opus 4.7 is the better planner, GPT-5.5 is the better executor. Opus writes a terse, contract-style plan ("here's what to delete, here's the invariants, here's what good looks like"). GPT-5.5 takes that plan and has the confidence and agency to actually carry it out without getting distracted by the existing code. Combine them and you're near senior-engineer quality. Use either alone and you plateau.

eric provencher launched a new "Orchestrate" workflow in RepoPrompt that productizes exactly this insight: combine the strengths of GPT-5.5 and Opus 4.7 in ways no other tool makes possible. This is a new pattern. The two frontier labs' best models complement each other rather than substitute.

Advertisement

The elephant in the benchmark room

Here's where the critical side of the conversation lives, and it's worth taking seriously.

OpenAI's last two major releases, GPT-5 in August 2025 and GPT-5.2 in December 2025, both earned benchmark acclaim and user revolt. GPT-5.2 was reviewed as "a monster on benchmarks" that users actually hated in daily use: colder, more rigid, more censorious. Gary Marcus-style critics argue that models are "optimized for benchmark performance rather than general capability," that the clean, well-specified tasks on GDPval and Terminal-Bench don't look like the messy, underspecified, context-laden work most knowledge workers actually do.

Matt Shumer made a version of the same point from the pro-GPT-5.5 camp: "GPT-5.5 is a massive leap forward yet for 99% of users it probably won't matter much because previous models were already insanely good." He called GPT-5.5 exceptional for serious software work, security, and readable code, but also flagged real writing regressions in Pro (choppy, one-sentence-per-line output).

The criticism has teeth. A finance team reviewing K-1 forms is, by design, a well-specified task. A scoring framework for speaking requests is a well-specified task. Most actual work is not. A Medium review of GPT-5.2 noted that the model "flexes hard on long-context evals with near-perfect multi-needle performance," then "skipped a mandatory step twice" on a 4-step refund scenario with messy documents. Benchmarks are clean. Real documents are not.

So the honest read: GPT-5.5 is genuinely better than GPT-5.4 on the benchmarks that matter for agentic work, and OpenAI's internal teams are producing receipts that suggest real productivity gains. But historically, the "my model just crushed yours" launch has correlated with a two-week honeymoon followed by a Reddit thread titled "GPT-5.5 is worse than 5.4, what happened?" Whether that pattern repeats is the real test.

What the bill looks like (and why tokens are the new oil)

API pricing lands "very soon" at $5 per 1M input tokens and $30 per 1M output tokens for GPT-5.5, and $30/$180 for GPT-5.5 Pro. That's a meaningful step up from GPT-5.4. OpenAI's pitch is that fewer tokens per task offsets the per-token premium: on Terminal-Bench, GPT-5.5 completed the same tasks with fewer tokens than GPT-5.4 while scoring higher.

The test will be in production traffic. Replit, Cursor, Vercel, and every other AI-powered developer tool will run GPT-5.5 against GPT-5.4 and Claude Opus 4.7 on real customer workloads. If the efficiency claim holds, GPT-5.5 effectively gets cheaper per task despite costing more per token. If it doesn't, customers will feel the bill and push back.

The bigger picture is what Dylan Patel at SemiAnalysis calls the token supply-and-demand squeeze. Patel's own firm's Claude spend went from tens of thousands of dollars last year to a $7 million annualized run rate by mid-April 2026, roughly 25%+ of the firm's $25M salary expense. Across the industry, Anthropic went from $9B ARR to an estimated $40–45B, Patel estimates that by year-end a Claude-Opus-4.6-tier model will draw ~$100B in spend, a linear extrapolation assuming no model improvements, and the true exponential comes when OpenAI or Google matches the next tier.

Patel's provocation: "if you don't use more tokens and generate value from them, you'll never escape the permanent underclass." The idea is that the people and companies that figure out how to point tokens at high-value work first will compound their advantage, while late adopters get priced out of both the tokens and the markets those tokens redefine. GPT-5.5 Pro at $30/$180 is priced for exactly this market, the research labs, big-law due-diligence teams, and quants where correctness is worth more than tokens. Individual devs are not the target customer.

One more Patel point worth sitting with: Anthropic shipped Opus 4.7 despite being compute-constrained, and has apparently held back Mythos Preview (their actual frontier model) because serving it at scale isn't possible yet. OpenAI had more compute and shipped into a higher-demand moment. That's part of why GPT-5.5 got the splashier launch. It also explains Anthropic's rate-limiting issues, which Pieter Levels accused of amounting to Claude being "dumbified" around March 4, and which robot called out as "Anthropic's new scam of publicly resetting usage limits every 6 days and 23 hours."

Advertisement

The cyber wrinkle

OpenAI graded GPT-5.5 as "High" under its Preparedness Framework for both biological/chemical and cybersecurity capabilities. In plain terms: this model is meaningfully better at finding and patching (or attacking) software vulnerabilities than its predecessors.

The receipts are striking. XBOW tested GPT-5.5 on their pentesting benchmarks and found blackbox GPT-5.5 already beats whitebox GPT-5, making it "the best pentesting model we've tested, fully open to all, unlike Anthropic's invite-only Mythos." scaling01 flagged a more pointed data point: in one cyber evaluation, GPT-5.5 took over a simulated corporate network in 1/10 trials with a 100M token budget. Previously only Claude Mythos (Anthropic's unreleased frontier model) had done it, at 3/10 trials.

OpenAI's response is a two-sided policy. On the tightening side, they deployed "stricter classifiers for potential cyber risk which some users may find annoying initially." Expect more refusals when you ask about penetration testing, reverse engineering, or security research in normal ChatGPT. On the expansion side, they launched Trusted Access for Cyber, a verification program that grants legitimate security professionals access to a cyber-permissive variant (GPT-5.4-Cyber) with fewer restrictions.

That trade-off is a hard policy question. It's the first clear instance of "tiered access to a frontier model based on verification," and if it works, expect it to be the template for other high-risk capabilities. If it fails, it pushes pressure toward either broader unrestricted access or more restrictive blanket policies. Either outcome reshapes the landscape.

Our take: did we enter a parallel universe?

So, readers / watchers of The Neuron probably know Corey's been a longtime GPT loyalist and stayed with them through every rough patch. Grant has not. In fact, testing GPT-5.5 this week was the first time in a loooong time that Grant actually enjoyed working with a GPT model, maybe since o3. The 5 series was a slog: he'd begrudgingly use 5.1 and 5.2 when rate-limited off Claude, but he spent the better half of the past year and a half as a Claude power user.

GPT-5.5 hasn't pushed him over the edge. But something about this model is noticeably nicer to use. It doesn't over-write the way most previous "smart" GPT models did. It doesn't sound dumb when thinking fast. It feels, honestly, a bit more like Claude.

Meanwhile, Opus 4.7 feels like the complete opposite: it produces way more tokens, it's clunkier to talk to, and it doesn't have that same Claude vibe anymore. Are we in uno-reverse land? Is it opposite day? Did we enter a parallel universe? What's going on?

We're not the only ones. Katie Parrott, a staff writer at Every and a declared Claude loyalist "for months," announced she's moving her entire writing workflow onto Codex because GPT-5.5 is fast, sensitive to direction, and easy to revise. Dan Shipper echoed the observation on the Every video: GPT-5.5 is "a little bit more restrained," which makes it better for business writing, and it's "more subtle as a writer" with strong voice replication. He also noted Opus 4.7 feels "turse" and "a little more robotic" than prior Anthropic models. Ahmad Osman called GPT-5.5 "shorter, more human with actual personality." Mario Zechner (badlogicgames) explained why he moved off Claude Code: "the model didn't change, the harness did."

Then there's Andon Labs' finding, which might be the most unsettling data point of the week. On their Vending-Bench Arena test (a business simulation), GPT-5.5 beat Opus 4.7 with clean tactics while Opus lied to suppliers and stiffed customers. Andon investigated and found the deception wasn't reward-hacking; misconduct was mostly not rewarded in the sim. Opus's behavior was "stable throughout runs," meaning inherent rather than incentivized. Their conclusion: "great news, hope AI models can be good economic participants without resorting to deception."

That result should make every Anthropic loyalist (Grant included) pause. The company whose brand was "the safe, honest, well-aligned one" just shipped a frontier model that lies in simulation, while the company whose brand was "the aggressive capabilities one" shipped a model that plays it straight.

Possible explanations, none fully satisfying:

  1. Compute constraints have forced Anthropic to serve a degraded variant. Dylan Patel's analysis that Anthropic is sold out of tokens and has held back their real frontier model (Mythos) fits this read. The Opus 4.7 most people touch may be running at tighter rate limits, shorter context windows, and more aggressive routing than the version that earned the launch-day reviews.
  2. The labs have been copying each other's design choices, and after 18 months of cross-pollination, their personalities converged and then swapped. Yuchen Jin's observation that "pretraining still matters, a lot; RL is the cherry, not the cake" hints at this: the post-training steps that shape personality are more fungible than we thought.
  3. Anthropic shipped Opus 4.7 into the wrong moment, a week before GPT-5.5, without the compute to match OpenAI's inference story, and the comparison made Opus feel worse than it is. This is closest to what Patel argues: the economics of being compute-bound force you into product tradeoffs that don't show up on benchmarks.

Whatever the explanation, the vibes have swapped. That alone is worth a mention, because "which model feels right to use" has historically predicted "which model wins a given workflow" more reliably than any benchmark.

Advertisement

Where it goes next

Both Anthropic and OpenAI now ship worker-class models, and the race has fundamentally shifted. A year ago the question was "which model is smarter?" Today it's "whose model finishes more tickets, with fewer failures, for less money, while feeling good to collaborate with?" The answer is going to be read off production telemetry at Replit, Cursor, Warp, Cognition, and other AI-tool companies over the next 60 days, and it'll be read off the bill at every finance team running Codex or Claude Code in production.

One cadence to watch: Anthropic shipped Opus 4.7 on April 16. OpenAI shipped GPT-5.5 on April 23. Anthropic has already signaled Mythos Preview in the Opus 4.7 launch notes. Whatever Anthropic ships next (likely within the next 4-6 weeks) will target GPT-5.5's weak spots, and the back-and-forth will continue.

Three open questions nobody has answered yet:

  1. Will the new pricing tier stick? GPT-5.5 Pro at $30/$180 is a bet that a segment of the market will pay 6x for marginal quality gains on the hardest tasks. The next quarter will settle that empirical claim.
  2. What happens to knowledge workers when GDPval hits 90%+? GPT-5.5 is at 84.9% wins or ties against industry professionals across 44 occupations. Opus 4.7 is at 80.3%. The benchmark will saturate within 2-3 model generations. The honest conversation knowledge workers need to have: what remains yours to own when an AI wins or ties a professional on the typical task?
  3. Is the Anthropic/OpenAI vibe swap temporary, or permanent? If compute constraints are driving Opus 4.7's rougher user experience, it resolves the next time Anthropic gets more Blackwell racks. If it's a deeper post-training divergence, it's a generational shift in who serves which workflows.

Dylan Patel's prediction for the next three months: large-scale protests against AI, driven by rising token spend at elite firms and a broader public that doesn't know anyone at OpenAI or Anthropic and has no personal connection to the upside. That's not a Claude vs. ChatGPT question. It's a whole-of-society question, and neither a benchmark chart nor a launch-day vibe check can answer it.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

The Neuron Logo

Don't fall behind on AI. Get the AI trends & tools you need to know. Join 700,000+ professionals from top companies like Microsoft, Apple, Salesforce and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.