OpenAI Codex App Deep Dive: How It Works, Skills, And More

On February 2, OpenAI launched the Codex app for macOS, and within a week, more than one million people downloaded it. Then came GPT-5.3-Codex, a new flagship model. Then GPT-5.3-Codex-Spark, an ultra-fast variant running on Cerebras chips at over 1,000 tokens per second. Then a Super Bowl ad. Weekly active users have more than tripled since January.

That's a lot of momentum for a product most people couldn't define three months ago.

So what exactly IS the Codex app? Why did OpenAI build a desktop app instead of just improving the terminal? And how is OpenAI's own team using it to build… itself?

We dug into everything OpenAI published at launch, plus an in-depth interview with the Codex team by Every's Dan Shipper, the official app demo, a getting-started walkthrough from OpenAI's onboarding team, and behind-the-scenes videos from engineers showing automations, worktrees, Figma-to-code workflows, self-verification, and PM usage to give you the full picture.

First up, the TL;DR
The Big Idea: A Desktop App for Managing AI Workers
What the Codex App Actually Does
The Big Demo: A Racing Game Built With 7 Million Tokens
How OpenAI's Own Team Uses Codex (The Really Interesting Part)
Why a GUI? (And Why Not Just Improve the Terminal?)
The Speed Breakthrough: GPT-5.3-Codex and Spark
The Next Bottleneck: Code Review
What About Claude Code and the Competition?
Getting Started: Tips From OpenAI's Onboarding Team
Who Should Actually Use It?
Why It Matters

First up, the TL;DR

If you only have ~3 minutes, read this first.

If you've heard of AI coding tools but never felt like they were "for you," OpenAI is betting its newest product will change that.

The company launched the Codex app for Mac on February 2, and it hit one million downloads in its first week. Weekly users have tripled since January. It even got a Super Bowl ad.

So what is it? Think of it as a desktop "command center" for AI coding agents. Instead of one agent helping you write one thing at a time, the Codex app lets you run multiple agents in parallel across different projects, each working independently without stepping on each other's code.

Here's what makes it different from existing tools:

Skills let agents do more than write code. They can deploy apps to Vercel, manage Linear boards, generate images, and create documents.
Automations run agents on a schedule in the background. One OpenAI engineer has an automation that picks a random file and hunts for bugs multiple times a day. It actually catches real issues.
Two new models power everything. GPT-5.3-Codex is the smartest coding model yet (and it helped debug its own training). GPT-5.3-Codex-Spark runs on Cerebras chips at 1,000+ tokens per second; so fast the app has to slow down the output so you can read it.

The Codex team told Every they write 99% of their own code in it. One engineer even made a children's book for his daughters using the image generation and PDF skills together. A PM on the team uses it to ship code changes without being a programmer, and an automation called "Upskill" makes Codex smarter overnight by fixing its own skills while the team sleeps.

Maybe the most impressive part: agents can now validate their own work. One engineer had Codex refactor logging across dozens of files, then the agent ran the app, found the session ID, and proved logs still worked, all without human help.

Sam Altman called it OpenAI's "most loved internal product ever." The team's next focus? Solving the code review bottleneck; agents produce code faster than humans can verify it.

Available on Mac with ChatGPT Plus, Pro, Business, and Enterprise plans. Free and Go users get limited-time access too. Windows is coming soon.

Now, let's dive into all of this (and so much more) in more detail below.

The Big Idea: A Desktop App for Managing AI Workers

First, some context. Codex started as a command-line tool in April 2025, then expanded to a web interface. Both worked fine when you were asking one AI agent to do one thing at a time.

But that's not how developers actually work anymore. People are running multiple agents in parallel, delegating tasks that take hours or days, and jumping between different projects constantly. The old tools weren't designed for that.

Thibault Sottiaux, head of Codex at OpenAI, explained the shift in the Every interview: the terminal starts to feel limiting once you're doing multimodal work (models drawing diagrams, generating images, speaking through voice), running many agents in parallel, and losing track of what's where. The team felt they needed to experiment with something new.

The result is what OpenAI calls a "command center for agents." Not an IDE. Not a terminal. Something different.

Andrew Ambrosino, a technical staff member who helped build the app, put it bluntly: "Actually GUIs are great. IDEs are just the problem. There's something that's a GUI for programming that's not an IDE."

What the Codex App Actually Does

Here's the practical breakdown of what you get when you open the Codex app:

Run multiple agents at once. Each agent runs in its own thread, organized by project. You can switch between them without losing context. Think of it like having several junior developers working on different tasks, and you're checking in on each one. The app lets you review changes inline, comment on diffs, and open code in your editor for manual tweaks. In the official demo, Thibault shows this in action: he kicks off a new feature for an iOS app, and while that agent works, he switches to a completely different project where another agent is migrating from WebSockets to WebRTC. Some tasks take minutes; others take hours. The shift is from writing code to supervising agents and checking in when they're done.
Voice dictation and pop-out windows. Two small features that change the feel of the app more than you'd expect. You can dictate prompts with your voice instead of typing, which sounds gimmicky until you realize how much faster it is to describe what you want out loud. And for visual tasks, you can pop the conversation out into a floating window and place it next to whatever you're building. Thibault demos this with a fitness tracker: he pops out the chat, dictates "animate the bars to simulate progress," and watches the changes apply live with hot reload. It feels less like coding and more like collaborating with a teammate.
Worktrees keep agents from stepping on each other. This is a clever Git feature. Each agent works on an isolated copy of your code using built-in worktree support, so they can't create merge conflicts. You can explore different implementation paths in parallel and merge the best one when you're ready. Joey, an engineer on the Codex team, demonstrated this live: he kicked off a drag-and-drop feature in one worktree, then kept working locally on a completely separate task. When he spotted a bug in the agent's output (it was creating a branch twice), he just left an inline comment asking why, and the agent corrected course while he moved on. He says he typically has 10 to 20 PRs open at any given time now. The mindset shift, he says, is going from focusing on individual lines to overall architecture and flow control.
Skills make agents do more than write code. Skills are folders of instructions, scripts, and resources that tell agents how to perform specific tasks. OpenAI open-sourced a catalog of them, and the format itself was originally developed by Anthropic as an open standard (interesting factoid). Built-in skills include:
1. Linear integration: Triage bugs, track releases, and manage project boards.
2. Cloud deployment: Deploy web apps to Cloudflare, Netlify, Render, or Vercel.
3. Image generation: Create and edit images using GPT Image for websites, UI mockups, and game assets.
4. Document creation: Read, create, and edit PDFs, spreadsheets, and Word docs.
5. OpenAI API docs: Reference up-to-date documentation when building with OpenAI APIs.
6. Figma implementation: Fetch designs from Figma and translate them into production-ready UI code.
7. 1. The Figma implementation deserves extra attention: Ed Bayes, a designer on the Codex team, showed how it works in practice: there's a one-click MCP install for Figma, and then you just paste a Figma link into Codex. The key detail? Codex isn't working from screenshots. It's actually reading the structure of the design file, including variables, spacing values, and text styles, to generate real code using your design system. Ed says it gets 80-90% of the way there on a first pass, which saves 2-3x the time versus building from scratch. In the official app demo, Thibault confirmed the fitness tracker app was built entirely using the Figma skill, which used MCP behind the scenes. Ed also made a point about AI UX design specifically: because LLM outputs are dynamic and non-deterministic, static Figma mockups alone aren't enough to design for AI. You need interactive prototypes to stress test edge cases, which is exactly what Codex helps you build quickly.
8. You can also create your own skills and share them across your team through a team config.
Automations run agents in the background on a schedule. This is maybe the most underrated feature. You set up instructions, pair them with skills, define a schedule, and Codex works in the background. Results land in a review queue for you to check later. More on how OpenAI uses these below.
Built-in security sandbox. By default, agents can only edit files in their working folder and use cached web search. Anything requiring elevated permissions (like network access) triggers a prompt. You can customize rules using a Starlark-based .rules file to allow or block specific commands.
Two personalities. You can choose between a pragmatic, terse coding partner or a more conversational, supportive one. No change in capabilities; just vibes. (Both Thibault and Andrew use pragmatic, if you're curious.)
AGENTS.md: The cheat sheet your AI reads every time. One of the most important concepts for getting good results from Codex is AGENTS.md, a markdown file that lives in your project repo and automatically loads into context every time Codex starts a session. Since coding agents don't retain context between sessions, this file is how you give Codex the TL;DR on your project: how to build it, how to run tests, which CLI tools to use, and what workflow to follow. OpenAI's onboarding team recommends keeping them under 100 lines (that's what most of OpenAI's internal ones look like) and focusing on "unlocking agentic loops," meaning giving the agent ways to verify its own work through linters, tests, and other tools. You can also point AGENTS.md to task-specific docs in subdirectories, so the agent can do progressive discovery as it dives deeper into your codebase. For multi-hour tasks, teams use a PLANS.md template that Codex treats as a living checklist, updating its own progress as it goes. One OpenAI engineer used this pattern to run a 10+ hour refactor successfully.
Configuration profiles for different work modes. The config.toml file lets you customize everything from default model to sandbox mode to approval policies. A neat trick from the onboarding walkthrough: you can create named profiles for different work styles. Want the fastest possible responses? Create a "fast" profile that defaults to the lightest model with low reasoning effort, then launch it with codex -p fast.
Cloud delegation with the same interface. Tasks that take hours or need to run while your laptop is closed can be delegated to Codex cloud with the exact same interface. You just toggle from "local" to "cloud" and the agent runs on a remote container. This is ideal for async work like code review or running while you're on the go via your phone.
Headless mode for CI/CD pipelines. For advanced users, Codex CLI has an exec mode that runs headlessly and outputs structured JSON. You can define an output schema (using OpenAI's structured output format), feed Codex a task like "analyze this app for code quality," and get back clean JSON with file citations, line numbers, severity levels, and descriptions. This makes Codex a building block for automated pipelines: security triage bots, test coverage checkers, release hygiene automation, or even auto-labeling GitHub issues based on the intent of the issue (not just keyword matching). The Codex open source repo already uses this pattern internally. You can even pair Codex with the OpenAI Agents SDK to build multi-agent workflows where a front-end agent, a back-end agent, and a PM agent hand off tasks to each other, with Codex running as an MCP server that any of them can call.

The Big Demo: A Racing Game Built With 7 Million Tokens

To show off what the Codex app can do, OpenAI had it build Voxel Velocity, a 3D kart racing game with eight tracks, eight characters, item pickups, drifting mechanics, and AI opponents.

The kicker: it used just one initial prompt and then kept working autonomously, consuming more than 7 million tokens total. The agent took on the roles of designer, developer, and QA tester (it actually played the game to test it). OpenAI used the web game development skill and the image generation skill to make it happen.

It's impressive, but also a useful benchmark for where agents are now: they can sustain complex, multi-step projects over long sessions without falling apart.

How OpenAI's Own Team Uses Codex (The Really Interesting Part)

The best product insights came from the Every interview, where the Codex team described their actual workflows. This is where it gets practical.

They write 99% of their code in it. Both Thibault and Andrew confirmed this. Andrew's personal mandate from the start was to build the app using the app itself as fast as possible, specifically to avoid falling into the trap of building something that's "good for somebody else" instead of something you'd actually use.

Automations are their secret weapon. Andrew came up with the feature, and the team runs dozens. In a dedicated walkthrough video, Andrew showed exactly how he's automated away most of the parts of his job that "aren't actually that fun." He breaks automations into a few categories:

Informational automations:

Morning commit pulse: Every morning, Andrew wakes up to a summary of the last day's commits in the Codex monorepo, grouped by who worked on what and what he needs to know. He compared it to ChatGPT Pulse, but personalized for the codebase. (This matches what he described in the Every interview, but the walkthrough shows it in the actual automations tab.)
Marketing research: A daily automation with a custom skill does deep marketing research, searching the web for how users are talking about Codex.

Self-improvement automations (this is where it gets wild):

Upskill: This one looks at the past day of skill usage, detects if Codex had trouble with any skills (scripts that didn't work, things that could be sped up), and then makes improvements to the skills overnight. Andrew's description is memorable: "I'm going to sleep, I wake up, Codex is smarter in the morning."
AGENTS.md auto-updates: Runs every six hours. It looks at what Andrew and Codex have been working on, identifies misunderstandings or shorthands that Codex wasn't familiar with, and adds them to the personalization so the next interaction is faster.

Maintenance automations:

Sentry triage: This picks off one of the top Sentry issues (crashes, performance regressions, errors), digs through all the logs and source maps, looks at the codebase, and picks something to fix. The critical detail: automations have memory across runs. So it remembers what it tried to solve last time, meaning you don't get a PR for the same issue every hour.
Green PRs: The expanded version of what Andrew described in the Every interview. It uses BuildKite and GitHub skills to check all open PRs (he has 10-20 at any given time), update base branches, and intelligently resolve merge conflicts. Not just cleaning up conflict markers; it actually looks at what each person was trying to do and resolves conflicts based on intent. "A very long way of saying my PRs are always green."
Random bug hunting: Runs multiple times a day, picks a random file, and tries to find and fix subtle bugs. Thibault says it actually catches latent bugs that aren't on the critical path but are real. Recently found an issue in constraint sampling.
Quiet bug cleanup: Looks at PRs from the past day, checks observability platforms for issues, and ships fixes before anyone notices the bug was there.

The "Yeet" skill is a team favorite. It takes whatever changes you've made, writes a commit, creates the PR with title and body, puts it in draft, and publishes. One command, everything done.

Andrew made a children's book. He described using the app to write a personalized picture book for his daughters. He prompted it with a script outline, his family's backstory (they moved from Boston to New York), and then used the image generation skill and PDF skill together. The agent wrote the script, generated illustrations for each page, and assembled a printable PDF.

It's not just for engineers. Alexander Embiricos, a PM on the Codex team, showed how he uses it despite not coding often. His workflow is telling: he noticed a confusing button in the app, checked with the team to confirm it wasn't needed, told Codex to delete it, and got a PR. When the PR had a test failure, he used the BuildKite skill instead of digging through logs himself. But the real insight was what happened next: the skill didn't work perfectly (it asked him to install a token first), so after fixing it, he immediately told Codex to update the skill so nobody hits that problem again. He calls this the "inductive loop": feedback → fix → improve the skill. Over time, Codex gets better and better at working in your codebase. One underrated technique he mentioned: running Codex on "low" reasoning effort for many tasks. It's faster and often good enough.

Self-verification is the real step change. Javi, another engineer, explained why the app has been transformative for him: it's not just that Codex writes code faster, it's that it can now validate its own work. He walked through a logging refactor that touched many files with a clear risk: if anything broke, their observability pipeline would go down and they'd lose the ability to diagnose bug reports in the beta. Before Codex, he'd manually compile the app and check if logs showed up. This time, he told the model to verify logs end-to-end. The agent ran the app, wrote Python code to find the session ID, queried the logs MCP, and proved that logs were still piping after the refactor. When the agent says "done" now, it actually means done, not "I wrote code, good luck compiling it."

Why a GUI? (And Why Not Just Improve the Terminal?)

This was a deliberate, contrarian choice. Every other major AI coding tool was either forking VS Code or doubling down on terminals. Thibault described a specific moment when the team seriously asked themselves if they should have forked VS Code too.

They decided against it. Their reasoning: agents are already doing far more than writing code. They're deploying apps, managing project boards, generating images, filing bug reports. Cramming all of that into an IDE would feel weird. And a terminal can't show you a Mermaid diagram, render an image, or let you voice-prompt an agent.

The truck analogy came up in the interview: you might occasionally need an IDE for something specific, but the Codex app should be your daily driver, your home base. Andrew said he still opens an IDE here and there for specific tasks, but then closes it and goes right back to Codex.

Dan Shipper, the interviewer, admitted he was surprised he didn't want to go back to the terminal after using the app. And he'd been a terminal power user for months.

The Speed Breakthrough: GPT-5.3-Codex and Spark

Two new models launched alongside (and shortly after) the app, and both are significant.

GPT-5.3-Codex is OpenAI's most capable agentic coding model. It tops SWE-Bench Pro and Terminal-Bench, uses fewer tokens than prior models, and handles long-running tasks across research, tool use, and complex execution. It's also 25% faster than GPT-5.2-Codex.

But here's the wild part: GPT-5.3-Codex helped create itself. The team used early versions to debug its own training, manage deployment, and diagnose test results.

Thibault explained the workflow change: with 5.2, he'd kick off four tasks expecting them to take 10-15 minutes each. With 5.3, the speed meant less multitasking and more flow state. The model also became more generally capable, making it more reliable for non-code tasks like summarizing Twitter replies, filing Linear bugs, and running automations.

GPT-5.3-Codex-Spark is where things get almost unsettling. This smaller model runs on Cerebras' Wafer Scale Engine 3 and delivers over 1,000 tokens per second. In one demo, it completed a Snake game in 9 seconds compared to 43 seconds for the standard model.

Thibault told a revealing story: the first time he showed the Spark prototype to someone, they said "No way. This is a fake demo. This cannot be this fast." And it's not even at full speed yet; Thibault suggested the team expects to make it two to three times faster with further optimizations.

The infrastructure improvements are impressive across the board. OpenAI rewrote its service stack to use WebSocket persistent connections and more incremental, stateful processing. This decreased overall turn latency by roughly 30-40% across all models, not just Spark.

One unexpected detail: the app actually has to slow down Spark's output slightly so you can read it. The text was hitting the screen as a wall. That's the kind of problem you want to have.

The Next Bottleneck: Code Review

When asked what the next bottleneck will be now that speed is nearly solved, both Thibault and Andrew gave the same answer: verification.

Models can generate code faster than ever. They can implement entire features. But humans still need to verify that things actually work, that designs are consistent, that the button does what it should. Thibault noted that people on the team complain there's too much code to review now, and you're reviewing it twice: once from the agent, and again from your peers.

The Codex app already has a review mode that annotates diffs with findings and stylistic observations. OpenAI's onboarding team noted an important design choice: the model has been specifically trained to focus on P0 and P1 issues only, not stylistic nits. "If it comments, it's like: if I don't fix this, it's going to break in production." Less noise means engineers actually read the comments instead of tuning them out. You can also layer in your own code review guidelines via a separate markdown file referenced in AGENTS.md, so Codex reviews against your team's specific standards.

But the team is also experimenting with agents that test themselves: running the app, clicking around, taking screenshots for evidence, and uploading proof to the PR. The idea is that when an agent can visually demonstrate "here's what the bug looked like, and here's what it looks like now, same click path," code review might matter less because you're verifying the outcome instead of reading the code as a proxy. Javi's logging refactor example is a concrete version of this: the agent didn't just write code; it ran the app, found a session ID, and proved logs still worked. That's the kind of evidence that collapses a risky manual verification loop into minutes.

What About Claude Code and the Competition?

When asked how they think about Anthropic, Thibault acknowledged that Claude Code got to market first with some of these ideas. But he said OpenAI's models at the time weren't ready for long-horizon reliability and consistent tool calls.

That changed with GPT-5.2 and accelerated with 5.3. The team's advantage, he argued, is the tight feedback loop between product, engineering, and research. They don't just ship product fixes; they sometimes improve the model itself. Example: when users complained about compaction (the process of summarizing context when it gets too long), the team did end-to-end RL training to make the model inherently better at self-delegation across time. The product problem became trivial once the model solved it.

Getting Started: Tips From OpenAI's Onboarding Team

OpenAI released a comprehensive getting-started walkthrough covering installation, setup, and best practices. A few highlights worth calling out:

MCP (Model Context Protocol) connections extend Codex's reach. Beyond skills, you can connect Codex to external tools through MCP servers. Popular ones include Figma (for design-to-code), Jira and Linear (see what tickets are assigned to you, make code changes, mark tickets done), Datadog (diagnose production issues), and Context7 (pull in the most up-to-date documentation for any framework, since models have knowledge cutoff dates). You can even add a line to your global AGENTS.md like "when implementing features with external libraries, always search Context7 first," and Codex will do it automatically without you specifying it each time.
Prompting patterns that actually matter. Use @ mentions to anchor Codex to specific files (prevents it from wandering into irrelevant parts of the codebase). Start small before going big. Paste full stack traces for debugging. Include verification steps in your prompts. And try open-ended prompts for brainstorming; Codex is surprisingly good at looking at your codebase holistically and suggesting what to build next. You can also paste screenshots and ask Codex to make changes based on what it sees, which is faster than describing "the third button from the left."
Best starter tasks if you're trying Codex for the first time: explain the codebase, paste a bug's stack trace and have it fix it, expand test coverage for edge cases, or do a refactor across many files. Writing documentation is an underrated sweet spot because engineers hate doing it and Codex can keep docs updated as you build.

Who Should Actually Use It?

Codex is available on macOS (Apple Silicon) and included with ChatGPT Plus, Pro, Business, Enterprise, and Edu plans. For a limited time, it's also available to Free and Go users, though Sam Altman has warned that limits may be reduced after the promotion. Windows support is coming soon.

Thibault was clear about the audience: you should be fairly technical, comfortable reading code, and understand that actual code is being written and executed on your machine. For people who aren't technical at all, there will eventually be a similar experience inside ChatGPT with different guardrails (no scary terminal prompts).

For enterprise teams, there's admin setup with role-based access control, zero data retention, and the ability to enforce rules and share skills through team configuration.

Why It Matters

The Codex app isn't just another developer tool. It represents a genuine interface shift in how people interact with AI agents.

For the past year, the question was "how smart can we make these agents?" Now the question is becoming "how do humans manage and direct agents that are already really capable?" The Codex app is OpenAI's answer: give people a visual command center where they can run, steer, and review multiple agents working in parallel.

The velocity is hard to ignore. Over a million downloads in week one. Weekly users tripling in under two months. A Super Bowl ad. OpenAI's own engineers using it for 95% of their coding. The models getting faster, not just smarter.

If you write code (or manage people who do), this is worth paying attention to. As Thibault put it in the official demo: building with the Codex app means "spending a lot less time writing code and a lot more time creating, refining ideas, and bringing them to life." And if the direction holds, the Codex team's vision of agents that handle everything from deploys to bug fixes to project management isn't a three-year roadmap. It's happening now.

Download the Codex app here.