OpenAI Harness Engineering: Ship 1M Lines of Code w/ Agents

You've heard the pitch before: AI will write all your code. Usually it comes with a demo of someone building a to-do app in 30 seconds and a hand-wave about "the future."

OpenAI's latest Build Hour was different. Instead of flashy demos divorced from reality, it showed what happens when a team actually commits to the bit: zero manually-written code for five months, shipping a real product to real users, with a million lines of agent-generated code.

The session, hosted by Christine from OpenAI's startup marketing team alongside Charlie Guo (DevEx) and Ryan Lopopolo (Future of Work), plus a customer spotlight from Mitch Troyanovsky, co-founder of Basis, covered the full stack: new API primitives, a practical framework for measuring "agent readiness," and hard-won lessons from production-scale agentic coding.

Whether you're a developer, an engineering manager, or just someone trying to understand where software is going, here's everything worth knowing.

First up, the TL;DR
What's New in Codex and the API
Demo: The Agent Legibility Score
Harness Engineering: What It Looks Like to Ship 1M Lines of Agent Code
Customer Spotlight: How Basis Applies Harness Engineering
Quick-Fire Q&A Highlights
How to Apply This: A Step-by-Step Implementation Guide
The Bigger Picture

First up, the TL;DR

If you only have 5 minutes, read this.

The Three Phases of AI Coding

Charlie opened with a useful mental model for where we are in AI-assisted development:

Phase 1: Autocomplete. Ghost text, tab-complete suggestions. Useful but limited.
Phase 2: Pair programming. A model sits in your IDE (your code editor, like VS Code or Cursor), generates code, merges changes. This is where most developers are today.
Phase 3: Agent delegation. You manage multiple agents across increasingly large tasks and workflows. You're the manager; they're the team.

The capability jump from GPT-5.2 to GPT-5.4 is what made Phase 3 practical. As Charlie put it, the Codex app has "mostly replaced the IDE for me at this point."

What's New in Codex and the API

Before we get into the meat, a quick terminology note: an API (Application Programming Interface) is how developers build their own apps on top of OpenAI's models. Codex is OpenAI's coding agent; think of it as an AI employee that can read your codebase, write code, run tests, and submit changes for review.

The Codex App

The Codex desktop app is now available on Windows (we wrote about that here) with native sandboxing (meaning it runs in an isolated environment so the agent can't accidentally mess with your real files). Key features that shipped recently:

Skills and Apps: Skills give the agent context on how to use tools well. Apps (formerly "connectors") wire Codex into the tools you already use, like Google Calendar, Slack, and Linear. Ryan described skills as "context that we give to the agent to show them to what end do you use the tools you have and how do you use your tools well."
Work trees: Think of these as separate copies of your codebase that let you run multiple agents on different tasks simultaneously without them stepping on each other's changes. (In Git, the version control system that tracks all code changes, a "work tree" is an isolated workspace.) One-click handoff back to your local branch when you're ready to merge.
Automations: Scheduled commands that run on a regular cadence, like a recurring calendar event but for code tasks. Ryan uses one to review all his open PRs (pull requests, i.e. proposed code changes waiting for review) and resolve merge conflicts. Charlie runs one for Slack management and to-do updates. Other OpenAI engineers use them to auto-review recent PRs for bugs or generate standup summaries from Git history.

GPT-5.4 and API Updates

The big API news from last week is GPT-5.4, which brings:

Native computer use (KUA): The model can now directly control browsers and desktops, clicking buttons and navigating pages like a human would.
Million-token context: Tokens are the chunks of text an AI processes (roughly 3/4 of a word each). A million tokens means the model can "see" roughly 750,000 words at once, enough to hold an entire large codebase in its working memory.
Tool search: For builders using hundreds of tools, tool search lets the agent discover what it needs rather than having everything crammed into context. Ryan called this "progressive disclosure": instead of giving agents everything up front, you let them pull in only what's relevant to the current task. (Think of it like a well-organized filing cabinet vs. dumping every document on someone's desk.)
Most token-efficient reasoning model: Same performance as GPT-5.3-Codex and GPT-5.2, but with significantly less token consumption and lower latency (i.e. faster responses at lower cost).

Other API additions from the last month:

Code mode: The model generates JavaScript to run instantly, dramatically speeding up tasks that previously required slow screenshot-click-coordinate loops.
Hosted shell: OpenAI gives the agent its own isolated virtual computer (a "container") where it can run terminal commands. Ryan called this one of the features he's most excited about: "it takes all the magic of coding agents... and puts it in our API in a way that is super customizable."
WebSocket mode: A faster real-time connection method. For tool-heavy use cases, improves response speed by 20-30%.

Demo: The Agent Legibility Score

Charlie built an app, live on stream, that scores GitHub repos on how legible they are to AI agents. (A "repo" or repository is where a project's code lives, usually on GitHub. "Legibility" here means how easily an AI agent can understand and work within that codebase.) He gave Codex a pre-written plan and told it to implement. While it worked, the model generated 8,200 lines of code in the few minutes they spent talking.

One interesting detail: the model noticed a previous version of the app in an archived folder and used it as reference. Ryan flagged this as a technique he uses to build agents: providing pointers to previous task attempts gives the model a form of "episodic memory", a cheap way for agents to learn from past work and improve over time.

The app scored repos across seven legibility metrics:

Bootstrap self-sufficiency: Can the repo get set up from scratch without external knowledge? Or does the agent need to hunt through wikis and Slack threads to figure out how to get started?
Task entry points: Can the agent easily find and run standard commands like "build this project," "run the tests," or "check for errors"?
Validation harness: Can the agent check whether its changes actually work? Without a way to verify, the agent is flying blind.
Linting and formatting: "Linting" is automated code quality checking: rules that flag errors, style violations, and bad patterns before code ships. Ryan called linting "maybe the biggest low-hanging fruit" because it lets the agent cheaply self-check its own output. His advice: "It's super easy to add leverage to your codebase by vibing up some new lints." (Translation: use your AI coding agent to create automated quality rules.)
Codebase map: Is there a high-level guide showing the agent where things are and how the code is organized?
Doc structure: Are docs organized so the agent can find what it needs without wading through a massive instruction manual?
Decision records: Are past architectural decisions (why the team chose Tool A over Tool B, for example) written down in the repo?

They ran the app against Symfony, a tool OpenAI open-sourced alongside the harness engineering blog. It scored a B. Even OpenAI's own repos have room to improve.

Harness Engineering: What It Looks Like to Ship 1M Lines of Agent Code

This was the meatiest section. Ryan Lopopolo's team spent five months building an internal product with a hard constraint: no humans write any code. The result: roughly a million lines of code, approximately 1,500 PRs (proposed code changes), and throughput that scaled from about a quarter-engineer-equivalent per person at the start to 3-10 engineers' worth of throughput per person.

So what is "harness engineering," exactly? The name comes from the idea that engineers aren't writing code anymore; they're building the harness (the structure, guardrails, documentation, and automated checks) that keeps AI agents on track. Think of it like training a very capable new hire: you don't do their work for them, but you set up the onboarding docs, coding standards, and review processes so they can succeed independently.

Here are the key patterns:

Encode Your Taste Into the Codebase

Ryan's central thesis: "If you can articulate what it is about the code you don't like, the next step is to write that down." Documentation, custom lints (automated quality rules), bespoke code reviewers, tests... whatever it takes to make bad patterns statically impossible (meaning the system literally won't allow that code to pass).

His example: Codex kept creating duplicate helper functions across the codebase (small utility code that manages how many tasks run simultaneously), but only one version was properly connected to the team's performance monitoring system (OpenTelemetry). The fix? A custom ESLint rule (an automated code-checking rule) that bans that function from being defined anywhere except the official, approved location. The rule itself was written by Codex, with 100% test coverage.

The compound effect is powerful: "As we have onboarded engineers to the team, they have a different set of experiences... each one is able to reduce slop in a unique way. But because everyone is invested in putting that knowledge into the codebase, everyone else's coding agents have the best guts of everyone on the team."

Push All Context Into the Repo

Ryan told a story about a security library decision that happened in a Slack thread two months earlier. A new engineer didn't know about it, so their Codex run pulled in the wrong package (a pre-built code library you can install, kind of like an app from an app store but for developers). The fix wasn't blaming the engineer or the agent; it was reflecting that decision back into the codebase with guardrails. He literally typed @codex please add guardrails to our codebase in the Slack thread and got four proposed PRs in 15 minutes.

The harness engineering blog expands on this: "From the agent's point of view, anything it can't access in-context while running effectively doesn't exist. Knowledge that lives in Google Docs, chat threads, or people's heads are not accessible to the system."

This is a huge insight for non-technical leaders too: if important decisions, policies, or standards live only in people's heads or buried in Slack, your AI tools can't follow them. The same problem that makes onboarding new employees hard also makes onboarding AI agents hard.

AGENTS.md Is a Table of Contents, Not an Encyclopedia

AGENTS.md is a special instruction file that tells AI agents how to work in your codebase. Think of it as the "README for robots."

The blog details how the team tried one big AGENTS.md (basically dumping every rule and standard into one massive file) and it failed. Context is scarce; a giant instruction file crowds out the actual task. Instead, a short AGENTS.md (roughly 100 lines) serves as a map with pointers to deeper documentation in a structured docs/ directory. This enables the same progressive disclosure principle from the API: agents start with a small, stable entry point and learn where to look next.

It's the same principle behind good onboarding at any company. You don't hand a new hire a 500-page manual on day one. You give them a one-pager that points to the right resources for each situation.

Each Engineer's Expertise Compounds

When a front-end specialist joined the team, they started encoding their knowledge about React component architecture (React is a popular framework for building web interfaces; "component architecture" means how you organize the building blocks of a web app). Suddenly everyone's agents started decomposing hooks into single files (breaking complex interface logic into smaller, more manageable pieces), creating small testable components, and producing smaller files that were more efficient for agents to work with. Every new hire's expertise becomes a multiplier for the entire team's agent fleet.

Architecture as a Prerequisite, Not a Luxury

Ryan noted they did "far more refactors than you would have five years ago" because rigid architectural patterns (organizing code so each piece handles one job, structuring it in clear layers, keeping different parts properly isolated from each other) are what let agents operate with limited context. In a human-first codebase, you might postpone these patterns until you have hundreds of engineers. With coding agents, they're an early prerequisite.

The analogy: imagine a warehouse. Humans can find stuff even when it's messy because they can ask coworkers and use institutional memory. Robots need clearly labeled shelves, consistent organization, and a map. The cleaner your architecture, the better your agents perform.

Customer Spotlight: How Basis Applies Harness Engineering

Mitch Troyanovsky, co-founder of Basis (an agent platform for accountants that recently raised a Series B), shared how his team applies similar principles.

The Mindset Shift

Mitch's framing: "Our ambitions are so large that... it would literally be beyond the laws of physics to hire that many people." So you have to engineer your company and codebase to produce output as if you were 10x the headcount.

The hard part isn't the technology; it's getting engineers to shift from doing to managing. And that shift only sticks when the agents actually produce acceptable output.

Skills With Owners

Basis uses skills with designated owners in the frontmatter metadata (structured info at the top of each skill file that says who's responsible for it). This creates clear responsibility for maintaining and updating agent instructions, and lets automated systems flag when skill descriptions conflict.

Sub-Agents for Standards

They use specialized sub-agents for specific tasks. A sub-agent is like a specialist that the main Codex agent can call on for help. Examples include a standards-enforcer that double-checks output against the team's rules and a PR babysitter that monitors open code changes. Sub-agents let you take flows that Codex handles well in general and add a specialized quality layer.

Dotnotes: Decision History for Every Commit

One of the most clever patterns: dotnotes. When developers save changes to a codebase, they write "commit messages" explaining what changed. Dotnotes are like commit messages on steroids: written by Codex throughout a session (not just at save time). They create a full decision history, so when someone later asks "why was this implemented this way?" the reasoning is right there in a searchable log.

Company Context as a Repo

Basis maintains two repos. Arnold is a monorepo (a single giant repository containing all of a company's production code). Atlas is a second monorepo for everything that isn't code: operating principles, planning docs, and team context. This means when Codex helps with planning or decision-making, it has access to the full company context, not just the codebase.

Beyond Code

Mitch showed a "start my day" automation that runs through his morning routine: grabbing context from the last 24 hours, refreshing different systems, managing his schedule. The point: Codex isn't limited to writing code. Once you give it enough context, it can automate other parts of your work too.

Quick-Fire Q&A Highlights

A few standout answers from the Q&A session:

Browser automation: The Playwright interactive skill lets Codex drive web browsers for front-end iteration. (Playwright is a tool that automates browser actions like clicking, typing, and screenshotting.) Ryan uses it to have Codex screenshot its UI changes and keep iterating until the design matches.
Design patterns for production agent systems: Ryan's advice: invest in rigid architecture patterns you'd expect at a 1,000-10,000 engineer company. Clean separation between different parts of the code, clear layering, strong boundaries. These limit how much context agents need to load to do their job.
How to maintain agent instructions at scale: On Ryan's team, nearly everything lives in the codebase, not in personal configs. They've consolidated around a handful of general skills (making PRs, landing changes, code review, architecture analysis). Additional leverage goes into reference docs, scripts, and the repo's own tests and documentation.
Should you enable all skills? Yes. But invest in short, high-quality descriptions in the skill frontmatter so the model can determine whether to invoke them.
Daily standups still matter: Ryan's team does 30-minute daily standups because code velocity is so high that "it can be weeks before I realize core architectural patterns have changed" without synchronous human check-ins. Ironic twist: the faster AI writes code, the more important human-to-human communication becomes.

How to Apply This: A Step-by-Step Implementation Guide

Everything above sounds compelling in a blog post. But where do you actually start? Here's a practical implementation path, ordered so each step builds on the last.

Week 1: Audit and Bootstrap (The Foundation)

Step 1: Score your repo. Run the Agent Legibility Score framework against your codebase. You can do this manually or use the prompt from the Build Hour demo. Grade yourself honestly on all seven metrics. This tells you where to focus.

Step 2: Make setup automatic. If getting your project running requires tribal knowledge ("oh, you have to run this secret command first"), fix that now. Create a single setup script or Makefile that installs dependencies, seeds data, and gets a development environment running from scratch. Bootstrap self-sufficiency is the first thing agents need, and it's also the first thing new human engineers need. Two birds.

Step 3: Standardize your task entry points. Make sure there are clear, documented commands for "build," "test," "lint," and "run." If an agent (or a new hire) can't figure out how to do these four things in under a minute, you have a problem.

Week 2: Add the Guardrails (The Harness)

Step 4: Set up linting if you haven't already. This is the single biggest quick win from the entire session. Install a linter (ESLint for JavaScript/TypeScript, Ruff for Python, etc.) and turn on auto-fix. Agents use linters as a cheap self-check loop. Every lint rule you add is one less category of mistake that can slip through.

Step 5: Write your first custom lint rule. Identify one pattern your team doesn't like seeing in code. Maybe agents keep importing from the wrong module, or creating files that are too large, or using an unapproved library. Have your coding agent write a lint rule that bans that pattern. This is the core of "encoding taste into the codebase." Start with one rule and add more over time.

Step 6: Create your AGENTS.md (the table of contents). Write a short file (aim for ~100 lines) that tells AI agents how this project is organized, how to build and test it, and what coding standards to follow. Point to separate docs for detailed standards on specific topics rather than putting everything in one file.

Week 3: Build the Knowledge Base (The Context)

Step 7: Create an ARCHITECTURE.md. Write a high-level map of your codebase: what are the main pieces, how do they connect, what depends on what. Keep it under 200 lines. This gives agents (and humans) the lay of the land before they dive into specific files.

Step 8: Start logging decisions. Create a docs/decisions/ folder and start writing lightweight Architecture Decision Records (ADRs). These don't need to be formal documents. Even "We chose Postgres over MongoDB because X, Y, Z" in a markdown file saves agents from making the same decision differently. Ryan's Slack-to-codebase pattern is the fastest version of this: when an important technical decision happens in chat, tag Codex and have it add guardrails immediately.

Step 9: Audit your "tribal knowledge." Walk through the last month of Slack conversations, standup notes, and design discussions. Identify decisions, standards, or preferences that exist only in people's heads. Move the important ones into your docs/ directory. Remember: if the agent can't see it, it doesn't exist.

Week 4: Scale Up (The Flywheel)

Step 10: Try delegating a real task. Pick a well-scoped task (a bug fix, a small feature, a refactor) and give it to your coding agent with your new AGENTS.md and documentation in place. Pay attention to where it stumbles. Those stumbles are your next documentation or lint rule.

Step 11: Set up an automation. Start with something low-risk: a daily PR review, a standup summary generator, or a weekly check for stale documentation. The goal is to get comfortable with agents running on a schedule, not just on-demand.

Step 12: Create your first sub-agent or reviewer. Following the Basis pattern, create a specialized reviewer that checks agent output against your standards. This could be as simple as a Codex skill that reviews PRs for common issues before a human ever looks at them.

Ongoing: The Compound Loop

Step 13: Run "garbage collection" regularly. OpenAI's team initially spent every Friday cleaning up "AI slop." That didn't scale. Instead, set up a recurring agent task that scans for pattern violations and opens targeted cleanup PRs. Most can be reviewed in under a minute.

Step 14: Onboard every new team member's expertise. When someone joins (or when someone has a strong opinion about code quality), encode that knowledge into the codebase. A front-end expert? Have them add React architecture standards. A security specialist? Have them add security guardrails. Each person's taste becomes a multiplier for everyone's agents.

Step 15: Increase daily standups, not decrease them. This might be the most counterintuitive takeaway. As Ryan's team learned, the faster agents produce code, the faster things can drift. Short, frequent human sync-ups catch architectural drift before it compounds.

A Note for Non-Engineers

You don't need to be a developer to benefit from these patterns. If you manage engineers, the harness engineering mindset gives you a framework for evaluating how "AI-ready" your team's codebase is. If you're a founder like Mitch, the company-context-as-a-repo approach (keeping operating principles, planning docs, and standards in a structured, searchable format) makes your AI tools smarter across the entire organization, not just engineering.

The universal principle: anything your team knows but hasn't written down is invisible to AI. Documenting decisions, standards, and context has always been good hygiene. Now it's the raw material that makes AI agents effective.

The Bigger Picture

The harness engineering approach redefines what "software engineering" means. The job isn't writing code; it's designing environments where agents can reliably write code. Documentation becomes the product. Linters become leverage. Architecture becomes a prerequisite, not a luxury.

And the compound effect is real: as each engineer encodes their taste and expertise into the repo, every agent on the team gets better. It's a flywheel that didn't exist when humans were the only ones reading the docs.

The harness engineering blog and the Symfony repo are both open. The GPT-5.4 prompt guidance gives you the prompt patterns. And the Codex docs have everything you need to get started.

The question isn't whether this approach will become the default. It's whether you'll be ready when it does.

OpenAI's Harness Engineering Playbook: How to Ship 1M Lines of Code Without Writing Any