The Codex app hit 1 million downloads in its first week, got a Super Bowl ad, and made GUIs cool again for developers. We break down everything inside OpenAI's "command center for agents," plus behind-the-scenes insights from the team that built it.
On February 2, OpenAI launched the Codex app for macOS, and within a week, more than one million people downloaded it. Then came GPT-5.3-Codex, a new flagship model. Then GPT-5.3-Codex-Spark, an ultra-fast variant running on Cerebras chips at over 1,000 tokens per second. Then a Super Bowl ad. Weekly active users have more than tripled since January.
That's a lot of momentum for a product most people couldn't define three months ago.
So what exactly IS the Codex app? Why did OpenAI build a desktop app instead of just improving the terminal? And how is OpenAI's own team using it to build… itself?
We dug into everything OpenAI published at launch, plus an in-depth interview with the Codex team by Every's Dan Shipper, the official app demo, a getting-started walkthrough from OpenAI's onboarding team, and behind-the-scenes videos from engineers showing automations, worktrees, Figma-to-code workflows, self-verification, and PM usage to give you the full picture.
If you only have ~3 minutes, read this first.
If you've heard of AI coding tools but never felt like they were "for you," OpenAI is betting its newest product will change that.
The company launched the Codex app for Mac on February 2, and it hit one million downloads in its first week. Weekly users have tripled since January. It even got a Super Bowl ad.
So what is it? Think of it as a desktop "command center" for AI coding agents. Instead of one agent helping you write one thing at a time, the Codex app lets you run multiple agents in parallel across different projects, each working independently without stepping on each other's code.
Here's what makes it different from existing tools:
The Codex team told Every they write 99% of their own code in it. One engineer even made a children's book for his daughters using the image generation and PDF skills together. A PM on the team uses it to ship code changes without being a programmer, and an automation called "Upskill" makes Codex smarter overnight by fixing its own skills while the team sleeps.
Maybe the most impressive part: agents can now validate their own work. One engineer had Codex refactor logging across dozens of files, then the agent ran the app, found the session ID, and proved logs still worked, all without human help.
Sam Altman called it OpenAI's "most loved internal product ever." The team's next focus? Solving the code review bottleneck; agents produce code faster than humans can verify it.
Available on Mac with ChatGPT Plus, Pro, Business, and Enterprise plans. Free and Go users get limited-time access too. Windows is coming soon.
Now, let's dive into all of this (and so much more) in more detail below.
First, some context. Codex started as a command-line tool in April 2025, then expanded to a web interface. Both worked fine when you were asking one AI agent to do one thing at a time.
But that's not how developers actually work anymore. People are running multiple agents in parallel, delegating tasks that take hours or days, and jumping between different projects constantly. The old tools weren't designed for that.
Thibault Sottiaux, head of Codex at OpenAI, explained the shift in the Every interview: the terminal starts to feel limiting once you're doing multimodal work (models drawing diagrams, generating images, speaking through voice), running many agents in parallel, and losing track of what's where. The team felt they needed to experiment with something new.
The result is what OpenAI calls a "command center for agents." Not an IDE. Not a terminal. Something different.
Andrew Ambrosino, a technical staff member who helped build the app, put it bluntly: "Actually GUIs are great. IDEs are just the problem. There's something that's a GUI for programming that's not an IDE."
Here's the practical breakdown of what you get when you open the Codex app:
.rules file to allow or block specific commands.codex -p fast.exec mode that runs headlessly and outputs structured JSON. You can define an output schema (using OpenAI's structured output format), feed Codex a task like "analyze this app for code quality," and get back clean JSON with file citations, line numbers, severity levels, and descriptions. This makes Codex a building block for automated pipelines: security triage bots, test coverage checkers, release hygiene automation, or even auto-labeling GitHub issues based on the intent of the issue (not just keyword matching). The Codex open source repo already uses this pattern internally. You can even pair Codex with the OpenAI Agents SDK to build multi-agent workflows where a front-end agent, a back-end agent, and a PM agent hand off tasks to each other, with Codex running as an MCP server that any of them can call.To show off what the Codex app can do, OpenAI had it build Voxel Velocity, a 3D kart racing game with eight tracks, eight characters, item pickups, drifting mechanics, and AI opponents.
The kicker: it used just one initial prompt and then kept working autonomously, consuming more than 7 million tokens total. The agent took on the roles of designer, developer, and QA tester (it actually played the game to test it). OpenAI used the web game development skill and the image generation skill to make it happen.
It's impressive, but also a useful benchmark for where agents are now: they can sustain complex, multi-step projects over long sessions without falling apart.
The best product insights came from the Every interview, where the Codex team described their actual workflows. This is where it gets practical.
They write 99% of their code in it. Both Thibault and Andrew confirmed this. Andrew's personal mandate from the start was to build the app using the app itself as fast as possible, specifically to avoid falling into the trap of building something that's "good for somebody else" instead of something you'd actually use.
Automations are their secret weapon. Andrew came up with the feature, and the team runs dozens. In a dedicated walkthrough video, Andrew showed exactly how he's automated away most of the parts of his job that "aren't actually that fun." He breaks automations into a few categories:
Informational automations:
Self-improvement automations (this is where it gets wild):
Maintenance automations:
The "Yeet" skill is a team favorite. It takes whatever changes you've made, writes a commit, creates the PR with title and body, puts it in draft, and publishes. One command, everything done.
Andrew made a children's book. He described using the app to write a personalized picture book for his daughters. He prompted it with a script outline, his family's backstory (they moved from Boston to New York), and then used the image generation skill and PDF skill together. The agent wrote the script, generated illustrations for each page, and assembled a printable PDF.
It's not just for engineers. Alexander Embiricos, a PM on the Codex team, showed how he uses it despite not coding often. His workflow is telling: he noticed a confusing button in the app, checked with the team to confirm it wasn't needed, told Codex to delete it, and got a PR. When the PR had a test failure, he used the BuildKite skill instead of digging through logs himself. But the real insight was what happened next: the skill didn't work perfectly (it asked him to install a token first), so after fixing it, he immediately told Codex to update the skill so nobody hits that problem again. He calls this the "inductive loop": feedback → fix → improve the skill. Over time, Codex gets better and better at working in your codebase. One underrated technique he mentioned: running Codex on "low" reasoning effort for many tasks. It's faster and often good enough.
Self-verification is the real step change. Javi, another engineer, explained why the app has been transformative for him: it's not just that Codex writes code faster, it's that it can now validate its own work. He walked through a logging refactor that touched many files with a clear risk: if anything broke, their observability pipeline would go down and they'd lose the ability to diagnose bug reports in the beta. Before Codex, he'd manually compile the app and check if logs showed up. This time, he told the model to verify logs end-to-end. The agent ran the app, wrote Python code to find the session ID, queried the logs MCP, and proved that logs were still piping after the refactor. When the agent says "done" now, it actually means done, not "I wrote code, good luck compiling it."
This was a deliberate, contrarian choice. Every other major AI coding tool was either forking VS Code or doubling down on terminals. Thibault described a specific moment when the team seriously asked themselves if they should have forked VS Code too.
They decided against it. Their reasoning: agents are already doing far more than writing code. They're deploying apps, managing project boards, generating images, filing bug reports. Cramming all of that into an IDE would feel weird. And a terminal can't show you a Mermaid diagram, render an image, or let you voice-prompt an agent.
The truck analogy came up in the interview: you might occasionally need an IDE for something specific, but the Codex app should be your daily driver, your home base. Andrew said he still opens an IDE here and there for specific tasks, but then closes it and goes right back to Codex.
Dan Shipper, the interviewer, admitted he was surprised he didn't want to go back to the terminal after using the app. And he'd been a terminal power user for months.
Two new models launched alongside (and shortly after) the app, and both are significant.
GPT-5.3-Codex is OpenAI's most capable agentic coding model. It tops SWE-Bench Pro and Terminal-Bench, uses fewer tokens than prior models, and handles long-running tasks across research, tool use, and complex execution. It's also 25% faster than GPT-5.2-Codex.
But here's the wild part: GPT-5.3-Codex helped create itself. The team used early versions to debug its own training, manage deployment, and diagnose test results.
Thibault explained the workflow change: with 5.2, he'd kick off four tasks expecting them to take 10-15 minutes each. With 5.3, the speed meant less multitasking and more flow state. The model also became more generally capable, making it more reliable for non-code tasks like summarizing Twitter replies, filing Linear bugs, and running automations.
GPT-5.3-Codex-Spark is where things get almost unsettling. This smaller model runs on Cerebras' Wafer Scale Engine 3 and delivers over 1,000 tokens per second. In one demo, it completed a Snake game in 9 seconds compared to 43 seconds for the standard model.
Thibault told a revealing story: the first time he showed the Spark prototype to someone, they said "No way. This is a fake demo. This cannot be this fast." And it's not even at full speed yet; Thibault suggested the team expects to make it two to three times faster with further optimizations.
The infrastructure improvements are impressive across the board. OpenAI rewrote its service stack to use WebSocket persistent connections and more incremental, stateful processing. This decreased overall turn latency by roughly 30-40% across all models, not just Spark.
One unexpected detail: the app actually has to slow down Spark's output slightly so you can read it. The text was hitting the screen as a wall. That's the kind of problem you want to have.
When asked what the next bottleneck will be now that speed is nearly solved, both Thibault and Andrew gave the same answer: verification.
Models can generate code faster than ever. They can implement entire features. But humans still need to verify that things actually work, that designs are consistent, that the button does what it should. Thibault noted that people on the team complain there's too much code to review now, and you're reviewing it twice: once from the agent, and again from your peers.
The Codex app already has a review mode that annotates diffs with findings and stylistic observations. OpenAI's onboarding team noted an important design choice: the model has been specifically trained to focus on P0 and P1 issues only, not stylistic nits. "If it comments, it's like: if I don't fix this, it's going to break in production." Less noise means engineers actually read the comments instead of tuning them out. You can also layer in your own code review guidelines via a separate markdown file referenced in AGENTS.md, so Codex reviews against your team's specific standards.
But the team is also experimenting with agents that test themselves: running the app, clicking around, taking screenshots for evidence, and uploading proof to the PR. The idea is that when an agent can visually demonstrate "here's what the bug looked like, and here's what it looks like now, same click path," code review might matter less because you're verifying the outcome instead of reading the code as a proxy. Javi's logging refactor example is a concrete version of this: the agent didn't just write code; it ran the app, found a session ID, and proved logs still worked. That's the kind of evidence that collapses a risky manual verification loop into minutes.
When asked how they think about Anthropic, Thibault acknowledged that Claude Code got to market first with some of these ideas. But he said OpenAI's models at the time weren't ready for long-horizon reliability and consistent tool calls.
That changed with GPT-5.2 and accelerated with 5.3. The team's advantage, he argued, is the tight feedback loop between product, engineering, and research. They don't just ship product fixes; they sometimes improve the model itself. Example: when users complained about compaction (the process of summarizing context when it gets too long), the team did end-to-end RL training to make the model inherently better at self-delegation across time. The product problem became trivial once the model solved it.
OpenAI released a comprehensive getting-started walkthrough covering installation, setup, and best practices. A few highlights worth calling out:
Codex is available on macOS (Apple Silicon) and included with ChatGPT Plus, Pro, Business, Enterprise, and Edu plans. For a limited time, it's also available to Free and Go users, though Sam Altman has warned that limits may be reduced after the promotion. Windows support is coming soon.
Thibault was clear about the audience: you should be fairly technical, comfortable reading code, and understand that actual code is being written and executed on your machine. For people who aren't technical at all, there will eventually be a similar experience inside ChatGPT with different guardrails (no scary terminal prompts).
For enterprise teams, there's admin setup with role-based access control, zero data retention, and the ability to enforce rules and share skills through team configuration.
The Codex app isn't just another developer tool. It represents a genuine interface shift in how people interact with AI agents.
For the past year, the question was "how smart can we make these agents?" Now the question is becoming "how do humans manage and direct agents that are already really capable?" The Codex app is OpenAI's answer: give people a visual command center where they can run, steer, and review multiple agents working in parallel.
The velocity is hard to ignore. Over a million downloads in week one. Weekly users tripling in under two months. A Super Bowl ad. OpenAI's own engineers using it for 95% of their coding. The models getting faster, not just smarter.
If you write code (or manage people who do), this is worth paying attention to. As Thibault put it in the official demo: building with the Codex app means "spending a lot less time writing code and a lot more time creating, refining ideas, and bringing them to life." And if the direction holds, the Codex team's vision of agents that handle everything from deploys to bug fixes to project management isn't a three-year roadmap. It's happening now.
Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.