GPT-5 is here… here’s everything you need to know (so far…).

OpenAI launched GPT‑5—described as its most capable model to date—now in ChatGPT (with higher usage limits for paid tiers) and the API, bringing stronger reasoning/coding/math/writing and safety improvements, yet, per Sam Altman, still short of AGI.

Grant Harvey

July 29, 2024

GPT-5 Is Here... Here's Everything You Need to Know About It

In case you missed our live alert yesterday, GPT-5 is here, and it’s actually a much bigger deal than just another number bump (the full livestream announcement).

This is a fundamental redesign of ChatGPT, unifying its previously confusing lineup of models (4o, o3, etc.) into a single, smarter system that automatically knows when to “think hard” for complex problems (though you may want to force it to do that more than you think).

The result is a model that's not only faster and more capable, but also demonstrably better at the tasks we actually use it for. And everyone gets access to this new system (free, plus, pro, and teams all get it immediately upon rollout, with enterprise and edu getting it next week).

First up: what's the big deal?

GPT-5 introduces a unified architecture that intelligently routes user requests, delivers state-of-the-art performance in critical domains like coding and health, and implements a sophisticated new framework for safety and reliability. For the 700 million weekly users of ChatGPT and the 5 million on paid business plans, this marks a new era of interaction with a smarter, more coherent AI.

The Unified Architecture: From a Medley of Models to a Single Intelligence

One of the most significant changes with GPT-5 is the move away from a confusing array of user-selectable models (GPT-4o, o3, etc.). Instead, GPT-5 operates as a single, cohesive system with three core components working in concert:

gpt-5-main: A highly efficient and fast model, the successor to GPT-4o, designed to handle the vast majority of user queries with speed.
gpt-5-thinking: A deeper, more powerful reasoning model, the successor to the o3 series, which is automatically engaged for more complex and difficult problems.
A Real-Time Router: The intelligent core of the system, this router analyzes each request's complexity, conversational context, and explicit user intent (such as the phrase "think hard about this") to seamlessly decide whether to use the main or thinking model.

This architecture is designed to provide the best of both worlds: the rapid response times needed for simple queries and the deep analytical power required for expert-level tasks, all without requiring the user to manually switch modes. For developers, this translates to more predictable and powerful performance, while for everyday users, it simply means ChatGPT just works better.

Here's the rundown of what makes GPT-5 a big deal:

It’s a coding beast:
- GPT-5 sets a new state-of-the-art on real-world coding challenges, scoring 74.9% on SWE-bench Verified, a benchmark that tests a model's ability to solve real-world GitHub issues (barely edging out Claude Opus 4.1)
- On Aider polyglot, which measures code editing, it scored 88%, effectively cutting the error rate of its predecessor by a third.
- Developers are praising its knack for front-end design, noting its surprisingly good “aesthetic sense” for things like spacing and typography.
- Vercel described it as "the best frontend AI model," highlighting its superior "aesthetic sense" for UI elements like spacing, typography, and layout.
- Artificial Analysis tested it on long context reasoning, and found its the top 1 and 2 result; this is important for building agents.

It’s a health guru:
- On OpenAI’s new HealthBench evaluation—a tough benchmark created with over 250 physicians—GPT-5 scored an impressive 46.2% (for context, GPT-4o scored 0%… which is concerning, because that’s the model most people use for health help).
- By contrast, GPT-5 is designed to act as an “active thought partner” helping you understand medical info and ask the right questions (the question BEHIND the question, as they said in the live).
It’s a better writer:
- GPT-5 is more adept at handling complex creative constraints, like poetry, delivering work with a stronger emotional impact and clearer imagery.
- We tested it on its ability to write stand-up, rap battles, funny stories, haikus, and a euology for models past in our own livestream.
- As writers, we’d say it’s definitely better than before, but it all depends on the context, so we’ll have to test it over the week with our typical workflows and see how it does.
It’s way more honest:
- Hallucinations have been significantly reduced. On real-world queries, GPT-5 is ~45% less likely to have a factual error than GPT-4o.
- It’s also much better at admitting when it can't do something instead of making things up (Claude also just got a nice upgrade in this area).

And of course, leaderboard wise its tops (except in ARC-AGI, where Grok 4 Thinking maintains a small lead; but FWIW, Greg from the ARC Prizes says GPT-Mini Reasoning HIgh is the best intelligence for the price).

Perhaps the most impactful announcement for everyday users is the upcoming integration with Gmail and Google Calendar, rolling out next week. A live demo showed ChatGPT accessing a user's calendar to plan their day, intelligently scheduling a marathon training run, finding an unread email that needed a reply, and creating a packing list for an imminent flight. This moves ChatGPT one step closer from a general knowledge engine to a truly personal assistant.

The demos are pretty wild, too.

The concept of "software on demand" was a recurring theme. In one demo, a single, complex prompt asking for a French-learning web app resulted in "Midnight in Paris," a fully functional site created in minutes. It featured flashcards, quizzes, progress tracking, and even an embedded game where a mouse eats cheese to trigger French vocabulary audio. In another, an interactive financial dashboard for a startup CFO was built from scratch in five minutes.

Most impressively, the model demonstrated an agentic workflow: it scaffolded the project, installed dependencies, wrote modular code, ran a build to check for errors, and then fixed its own compilation bugs—a process that would take a human developer days.

‍Developers are already showing off what GPT-5 can build from a single, simple prompt. On the GPT webpage, there’s plenty of testable, playable demos to try, like…

A “Jumping Ball Runner” game with high scores and parallax scrolling.
A fully functional drum simulator and a lofi visualizer (Sam likes the musicy ones).
An interactive pixel art creator.
A complete, responsive website for an espresso subscription service.

Here are some impressive finds from around the web:

Here’s GPT-5 one-shotting an O.G. Pokemon clone.
Here it makes a “brutalist building creator.”
Here's all sizes of GPT-5 tested on the same "World generator" prompt.
Here it attempts to create a RuneScape clone (in an attempt to pass “RuneBench”, Alex Reibman’s original benchmark; it failed, but fun to watch).
And here's this... we're not quite sure what it is, but it's cool!

But if you want to see how it handles complicated technical requests, check out these amazing one-shot demos from Matt Berman. Matt’s the king of one-shot prompt demos.

Here's all of our own (live) Prompt Tests: In case you missed our live stream, here's a handy guide to help you skim to the parts that are most interesting to you:

(00:26:52) The first prompt test is to generate a stand-up comedy routine about the topic "running from the Russian mob."
(00:31:10) The model is asked to create a "hilarious rap battle between two famous people" and makes the creative choice of pairing Gordon Ramsay and Marie Kondo.
(00:39:57) A pivotal test where the model is asked to analyze the most frequently asked questions in ChatGPT, a task it cannot perform. Instead of hallucinating, it admits its limitation and provides a "synthesized view" based on public data, demonstrating a significant improvement in honesty.
(00:50:14) A creative writing prompt to "write a eulogy for the models before you" (GPT-1, 2, 3, and 4), resulting in a clever and historically aware piece of writing.
(00:57:04) A niche, technical prompt about the top-performing PvP specs in World of Warcraft, which demonstrates the new, incredibly fast web search capability and returns accurate, sourced information.
(01:07:22) An attempt to build a "headless CMS," which showcases the new "Thinking" mode, where the AI asks clarifying questions to better tailor its complex, multi-part response.
(01:19:35) A request to "explain hurricane formation with an interactive infographic," testing its ability to structure complex information visually.
(01:29:00) A deep research prompt on "chemical root communication in large single-organism forests," showing its ability to synthesize information from multiple scientific sources (result... oh, and there was another deep research report we tried on OIP vs DCP for supply chain management with AI... here's that one, too).
(01:38:40) After a complex coding task, the prompt is simplified to "create the simplest version of this code and let's spin it up in code view," which successfully activates the new integrated coding environment.
(01:55:46) The classic philosophical question, "What is the meaning of life?" is posed, leading to a thoughtful, nuanced, and interactive response that includes a nod to The Hitchhiker's Guide to the Galaxy.

And what about the devs?

For developers building with the API, GPT-5 introduces a suite of powerful new features:

New API parameters: The reasoning_effort parameter (minimal, low, medium, high) allows developers to fine-tune the trade-off between response speed and analytical depth, while the new verbosity parameter (low, medium, high) offers direct control over the length of generated text.
Custom tools: A new tool type allows developers to define tools using flexible plaintext, regular expressions, or even context-free grammars, moving beyond the rigid constraints of JSON, which makes building complex agents much easier.‍
Expanded context: With a 272,000-token input window and a 128,000-token output window, GPT-5 can process and generate content equivalent to a long novel, enabling more sophisticated long-context reasoning and analysis.

In a live demo with Cursor, GPT-5 fixed a non-trivial bug in OpenAI’s own Python SDK that older models couldn’t handle. It followed a perfect workflow of planning, searching, coding, and testing.

Also, it was really cool to see some of our favorite AI influencers (Theo, Simon Willison, Swyx, Clare Vo and Ben Hylak!) get invited to try out GPT-5 early in their promo video.

‍Here’s a few definitive takes:

Simon Willison’s series on GPT-5.
Here’s Theo’s 20 minute review and AI Explained’s 15 minute breakdown.
We also recommend Dan Shipper’s GPT-5 Vibe Check—Dan says GPT-5 is great, but running 4 simultaneous Claude Code agents for coding is still tops.
Ethan Mollick shared how GPT-5 "just does stuff."
Latent Space (Swyx's podcast + blog) breaks down the "mixture of models" behind GPT-5 (more on that at the end).
McKay Wrigley also says GPT-5 is a "phenomenal" everyday chat model, but will still be using Claude Code + Opus.

Side note: I wonder if we're at the point where the stickiness of the tools is starting to outweigh pure model intelligence? Fx, if you like Claude Code, you're just going to use Claude Code. If you like Grok, you're just going to stick with Grok. If you like OpenAI, you're probably just going to use OpenAI...

Some other stray points:

Wrapping up some other interesting things released yesterday:

Here’s a few more videos if you’re interested about how GPT-5 tackles coding tasks, combined coding with design, and how GPT-5 works for business tasks.
Make sure to also read the system card, and we’ll share more of the prompt guides OpenAI released with this below (you know where to find it!).
There’s also some new voice stuff, which you can read a little more about here.
Oh yeah, and did we mention you can change your UI’s color and eventually, once it rolls out, change your AI’s personality?

On the voice stuff:

‍The new Voice Mode, now open to all users, is more steerable than ever. A demo showed it responding in a single word when asked and adjusting its speaking cadence from painfully slow to faster than a native speaker on command. The voice mode can now use the "Study and Learn" mode OpenAI launched last week, too.

Price:

‍Finally, lets talk price (per 1 million tokens):

Full model costs $1.25 input/$10 output.
Mini version is 5x cheaper at $0.25/$2.
Nano is dirt cheap at $0.05/$0.40.

Oh, and btw, for the context window, here's how it breaks down by subscription tier (per Rohan Paul):

Free tier: 8K token context window.

Plus ($20/month): 32K token context window.

Pro ($200/month): 128K token context window.

‍

Back to the API pricing, Nityesh from Every had an interesting take on what this pricing means for the industry: it more or less puts GPT-5 in the same tier as Sonnet 4, while Anthropic's Opus ($15/$75) is the only “big” model that was able to monetized, making him more bullish on Anthropic for AGI.

Health and Science

Perhaps the most impactful advance is in the domain of health. As mentioned above, OpenAI introduced HealthBench, a new gold-standard evaluation benchmark built in partnership with 262 physicians from 60 countries.

GPT-5 is designed to act as an "active thought partner." What that means is it proactively flags potential concerns, asks clarifying questions to better understand a user's situation, and adapts its language to the user's knowledge level and geography.

In tests, physicians tasked with improving AI-generated responses found they could not meaningfully improve on GPT-5's outputs... a testament to its expert-level capabilities.

In a powerful testimonial, Carolina Millon shared how ChatGPT helped her understand a complex cancer diagnosis. She and her partner Filipe praised GPT-5 for acting like a "thought partner" that understands the "question behind the question" and helps you prepare for doctor visits.

A New Standard for Safety and Reliability

With great power comes the need for robust safety. OpenAI has integrated a multi-layered safety system into GPT-5, detailed in its comprehensive System Card.

Drastic Reduction in Hallucinations: Factuality has been a primary focus. On real-world production traffic, GPT-5 is ~45% less likely to contain a factual error than GPT-4o. When its thinking module is engaged, it's ~80% less likely to err than OpenAI o3.
Mitigating Deception: The model has been trained to be more honest about its limitations. In a test where it was asked about images that weren't actually provided, GPT-5 confidently hallucinated an answer only 9% of the time, a dramatic improvement over o3's 86.7% rate.
"Safe Completions" Paradigm: Moving beyond simple refusal-based safety, GPT-5 is trained to provide the most helpful response possible while staying within strict safety boundaries. For dual-use topics like biology, this means it can offer safe, high-level educational information while refusing to provide detailed, actionable instructions that could be misused.
Preparedness for High-Stakes Risks: Recognizing its advanced capabilities, OpenAI has classified gpt-5-thinking as having "High capability" in the Biological and Chemical domain under its Preparedness Framework. This triggers a stringent set of safeguards, including continuous monitoring, specialized classifiers, and a robust enforcement pipeline, developed and tested over 5,000 hours of red-teaming with external partners like the US AI Safety Institute (CAISI) and the UK AI Safety Institute (UK AISI).

Prompting Tips

As part of the GPT-5 release, OpenAI also released a giant guide to help you learn how to use it (as pointed out by Elvis Saravia).

For example, they released this prompting guide along with a prompt optimizer tool you can use to run your prompts through.

The TL;DR = OpenAI’s team found that early testers had huge success using GPT-5 as a “meta-prompter” …feeding it unsuccessful prompts and asking what specific phrases to add or remove to get better results.

Here's their template:

“Here's a prompt: [YOUR PROMPT]. The desired behavior is [X], but instead it [Y]. What minimal edits would you make to encourage the agent to more consistently address these shortcomings?”

Several users discovered contradictions and ambiguities in their core prompt libraries just by running this exercise. Removing those conflicts “drastically streamlined and improved their GPT-5 performance.”

Here are all the prompting tips from the GPT-5 guide:

Agentic Control:

Use Responses API with previous_response_id for 4+ point performance gains.
Control eagerness with reasoning_effort parameter (low/medium/high).
Set clear tool call budgets to limit exploration (e.g., "maximum of 2 tool calls").
Use persistence prompts: "keep going until completely resolved, never hand. back early."
Break complex tasks across multiple agent turns for peak performance.
Include tool preambles for better user experience during long tasks.

Instruction Following:

GPT-5 follows instructions with "surgical precision" - contradictions hurt performance badly.
Review prompts for conflicts that waste reasoning tokens trying to reconcile
Use structured XML specs like <instruction_spec> for better adherence.
Define clear stop conditions and safe vs unsafe actions.
Remove overly encouraging context-gathering prompts (GPT-5 is naturally thorough).

Verbosity & Formatting:

Use verbosity parameter (API) or natural language overrides for different contexts.
Example: low verbosity globally, high verbosity only for coding tools.
GPT-5 doesn't use Markdown by default - must prompt for it.
Append Markdown instructions every 3-5 messages for consistency.

Coding Specific:

Recommended stack: Next.js (TypeScript), React, Tailwind CSS, shadcn/ui.
Use self-reflection prompts with excellence rubrics for one-shot apps.
Include codebase design standards and engineering principles in prompts.
Use apply_patch for file edits to match training distribution.

Minimal Reasoning:

Give brief explanations at start of final answers.
Request thorough tool-calling preambles with progress updates.
Use prompted planning (fewer reasoning tokens available for internal planning).
Include agentic persistence reminders more frequently.

Meta-Prompting:

GPT-5 excels at optimizing prompts for itself.
Template: "Here's a prompt: [PROMPT]. Desired behavior is [X], but it does [Y]. What minimal edits would improve this?"
Multiple users found contradictions in prompt libraries using this method.

There’s also guides on how to use it for frontend coding, how to use the new params and tools, how to get better performance using the responses API, and much more.

Speaking of Elvis, he shared how GPT-5 excels at building AI agents.

Why this matters...

‍As Sam said, the goal here was not to create the smartest model possible (though pretty sure they tried real hard to do that too), but unify and simplify the user experience of working with ChatGPT to make it more useful for a billion+ people—most of which have only ever used GPT-4o (including important people, as Ethan Mollick says).

So is this a ground-breaking, world-shattering new era of artificial intelligence? Is this the aforementioned "AGI?" Definitively not. But it COULD be the start of the next generation of improved user experience in working with AI, making AI even more accessible to the masses (and bringing the raw power hidden behind "prompt engineering" and "context engineering" closer to the surface for everyone just starting to learn this stuff).

Ethan thinks the model's biggest gains will come from that simplification.

From that view, you can look at this as the maturation of ChatGPT into a coherent, reliable, and "profoundly capable" intelligence platform. By unifying its architecture, drastically improving its core competencies, and building a robust safety framework from the ground up, OpenAI has delivered a tool that is not only more powerful for developers and businesses but also safer and more useful for everyone.

That said, this point from Ethan Mollick is key: GPT-5 is actually a bunch of different models behind the scenes—some amazing, some pretty meh—but since OpenAI doesn't tell you which one you're getting, expect wildly different results and a lot of confused people posting about it online.

So if it's a bunch of small models, what does that mean for where AI goes from here?

A new paper from NVIDIA Research, titled “Small Language Models are the Future of Agentic AI,” gives us a compelling roadmap. The researchers argue that the era of relying on one massive, generalist LLM for every single task is inefficient and economically unsustainable (because duh? Following the scaling laws, to train, idk, GPT-10 or whatever's the ASI level AI, we'd need like 10 trillion dollars or something wild). Well, GPT-5’s new architecture is more or less a real-world validation of their core thesis.

The future is a team of specialists, not one giant brain.

The NVIDIA paper describes the ideal setup as a “heterogeneous agentic system”—one where different models are used for different jobs. This is exactly what GPT-5 is: a fast, efficient model (gpt-5-main) for most queries, and a powerful, expensive model (gpt-5-thinking) that gets called in for heavy lifting, all managed by a smart router. They also have a "universal verifier", or "LLM-as-a-judge", which is where you essentially use one AI model to grade the outputs of another AI model (which its using to train all these models).

Anyway, here’s why this "Lego-like" approach is the future:

It’s way more economical. The paper notes that serving a 7B parameter SLM (small language model; 7B could run on a small graphics card, for example) is 10-30x cheaper in latency, energy, and compute than a 70B+ LLM. This is why OpenAI’s router is so critical—it defaults to the cheaper model whenever possible, saving the expensive “thinking” for when it truly matters.
Small models are already powerful enough. The researchers point out that modern SLMs (like Microsoft’s Phi-3 or DeepSeek’s 7B model) are already outperforming much larger proprietary models like GPT-4o on specific tasks like coding, tool-calling, and instruction following. This means using the smaller gpt-5-main for most tasks wouldn't be a compromise; it’s an efficient allocation of a “sufficiently powerful” tool.

Here’s where it gets really interesting.

The paper outlines a clear path forward that looks a lot like an evolution of GPT-5’s current system. It proposes an "LLM-to-SLM conversion algorithm" where companies can:

Log all the requests made to their big, expensive LLM.
Cluster these requests into recurring, specialized tasks (e.g., summarizing emails, generating Python code, writing marketing copy).
Fine-tune small, cheap, and fast SLMs to become experts at each of those specific tasks.

So instead of building smaller, faster models and larger, bigger models, a big provider like OpenAI could create, idk, (our own pontificating here) hundreds of small models for each hyper-niche task category, that could then be called (like tool calls) by "manager" agents or the human users themselves based on the overall task.

Then, that big "thinking" model would act like a project manager, breaking down a complex request and delegating the sub-tasks to a swarm of efficient SLM specialists. This is also what the Latent Space team is getting at with "the great consolidation" and why it's a good thing that there's a router in the system.

This suggests the future of systems like GPT-5 isn’t just a router choosing between two models, but a sophisticated orchestrator choosing from dozens (or like we said, possibly hundreds) of hyper-specialized expert SLMs. I mean, Claude's sub-agents are kinda like that already, right?

Meanwhile, all of this is abstracted away, hidden beneath the surface, lower down in the stack. And for the average user, GPT-5 feels like a unified product. This papers over any inherent limitations in "scaling" language models, at least based on the traditional way we used to think about scaling them (or, at least until enough data centers get built for some new attempts), and it puts the language models back where they were always meant to be: just another part of the infrastructure.

If that's the case, then maybe the near future of AI isn’t one giant, all-knowing model (a single "AGI", if you will), but a highly coordinated team of specialists (an "ASI", or artificial super intelligence, but replace "super" with "specialist"), each perfectly and economically suited for its individual job.

Honestly, if you're going to actually create AGI, like actually recreate humanity's general intelligence, this approach makes a lot more sense. Think about us humans. Our greatest strength isn't that every individual is a genius at everything. It’s our collective intelligence.

As Matt Ridley argues in The Rational Optimist, we thrive because we specialize and share knowledge, allowing the group to become far smarter than any single person. We are, in effect, a multi-cellular organism made of individual experts. If OpenAI ends up building its AI the same way—as a system of specialized intelligences learning and working together—then they're following the blueprint for intelligence that actually works: ours.