OpenAI's Vision for 2026: Sam Altman Lays Out the Roadmap

Hey friends! This is one of those weeks where the “product updates” are really “the next computing platform quietly snapping into place.”

OpenAI shipped a new agentic coding model (Codex 5.2), opened the ChatGPT app ecosystem, dropped a tougher science benchmark (FrontierScience), published research on monitoring model reasoning, and joined DOE’s Genesis Mission. The fun part: these aren’t separate stories. They’re one story about agents, interfaces, and AI-first workflows getting real.

Below: a vibe check to set the mood, then the full breakdown with timestamps and receipts.

First up, the TL;DR
Part 1: The Sam Altman Big Technology Interview...
The Companionship Dial (16:23-19:02)
The GDP-Val Discussion (21:37-24:58)
Part 2: GPT-5.2-Codex—The Most Advanced Coding Model Yet
Part 3: ChatGPT's App Store Opens to Developers
Resources for Developers
The First Apps
Part 4: AI for Science—FrontierScience and the Research Push
What This Means
Part 5: Can We Trust What AI Is Thinking? New Research on Monitoring Reasoning
Why This Matters
Part 6: OpenAI Joins the "AI Manhattan Project" with Department of Energy
What This All Means for 2026
The Bottom Line:
What to do about it:

First up, the TL;DR

OpenAI just dropped a massive week: GPT-5.2-Codex (their most advanced coding model yet), opened ChatGPT's app store to developers (900M users, making it larger than Apple's App Store at launch), released FrontierScience (proving AI can now tackle PhD-level scientific problems), published groundbreaking research on monitoring AI reasoning, and signed a major partnership with the Department of Energy. Sam Altman sat down with Alex Kantrowitz to connect the dots—and his insights are worth your attention.

If you want the primary receipts first:

Sam’s roadmap interview: https://youtu.be/2P27Ef-LLuQ?si=f8ioMSp4DogdzJaq
OpenAI’s GPT-5.2-Codex announcement: https://openai.com/index/introducing-gpt-5-2-codex/
The GPT-5.2-Codex system card: https://openai.com/index/gpt-5-2-codex-system-card/
FrontierScience overview: https://openai.com/index/frontierscience/
FrontierScience paper (PDF): https://cdn.openai.com/pdf/2fcd284c-b468-4c21-8ee0-7a783933efcc/frontierscience-paper.pdf
Edison Scientific platform: https://platform.edisonscientific.com/login
Edison Scientific careers: https://edisonscientific.com/careers

What X is saying (rolled up):

Insiders are very bullish on long-horizon, multi-file coherence + “compaction” as the unlock: @sama, @gdb, @OpenAIDevs, @embirico, @thsottiaux, @pranaveight.
Independent devs are posting side-by-sides and “does it actually ship code?” tests (plus a lot of Opus/Gemini comparisons): @kimmonismus, @Mlearning_ai, @reliabytes, @kartikxai, @mikkom, @VictorTaelin, @instant_db, @diegocabezas01, @AAAzzam.
The skeptics’ throughline: “nice, but not magic,” plus complaints about weird mistakes, vibes/personality, or “optimized for evals”: @synthwavedd, @D4veGPT, @crpidz, @slow_developer, @spaisee_com, @marshal_martian.

The shared datapoints people keep quoting: SWE-Bench Pro (56.4%), Terminal-Bench 2.0 (64.0%), and a noticeable jump on pro CTF-style security challenges (56%). The subtext: this is less about a single benchmark headline and more about whether your coding agent can stay coherent for an hour without drifting.

Part 1: The Sam Altman Big Technology Interview...

We just finished watching Alex Kantrowitz's interview with Sam Altman on the Big Technology Podcast, and it was packed with signal.

‍

Key Timestamps from the Interview

0:00-2:14: Code red philosophy
2:20-2:30: New image model and 5.2 launch
4:04-4:11: ChatGPT growth to 800-900M users
9:21-10:21: Bolting AI on vs. AI-first design
10:31-11:52: The AI-first messaging example
14:21-16:03: Memory as "GPT-2 era"
21:37-24:58: GDP-val knowledge work benchmarks
27:32-28:17: GPT-6 timing (next big model in Q1 2026)
29:04-30:01: AI for science excitement
34:02-35:11: Mathematicians on Twitter re: 5.2
44:07-46:21: The capability overhang
46:32-48:28: Device family coming
50:59-52:25: Scientific discovery timeline
55:39-57:20: Super intelligence definition

Now let's dive into the details. Here's what you need to know.

OpenAI's "Code Red" Philosophy (0:00-2:14)

The interview opens with Kantrowitz noting that OpenAI headquarters has been in "code red" mode since Gemini 3 came out. Sam's response reveals how OpenAI thinks about competition:

The pandemic analogy: Sam compares their competitive response to pandemic preparedness. "When a pandemic starts, every bit of action you take at the beginning is worth much more than action you take later, and most people don't do enough early on and then panic later."

This philosophy explains why OpenAI reacts aggressively to potential threats—even if those threats don't materialize. The DeepSeek code red earlier in 2025 and the Gemini 3 code red both triggered rapid internal responses.

Key quote: "I don't think we'll be in this code red that much longer... these are historically kind of like six or eight week things for us."

But here's the nuance: Sam admits Gemini 3 "has not, or at least has not so far, had the impact we were worried it might"—but it did identify weaknesses in their product offering that they're now addressing.

Takeaway: OpenAI operates with calculated paranoia. They'd rather overreact to a threat and find out it was nothing than underreact and get blindsided. It's working—ChatGPT is still "by far the dominant chatbot in the market."

GPT-6 Timing and What's Coming (27:32-28:54)

When asked directly about GPT-6, Sam's answer was surprisingly specific:

"I expect new models that are significant gains from 5.2 in the first quarter of next year."

He's careful not to commit to calling it "GPT-6"—but the timeline is clear: Q1 2026 will bring meaningful model improvements.

What "significant gains" means depends on who's using it:

Consumers: "The main thing consumers want right now is not more IQ." They want better experiences, more features, faster responses.

Enterprises: "Enterprises still do want more IQ." They're buying reasoning capability.

So expect the Q1 2026 model to push hard on enterprise capabilities while also improving the consumer experience in ways that go beyond raw intelligence.

The Interface Revolution: AI-First vs. Bolting AI On (9:21-11:52)

This was the most actionable part of the entire interview. Sam makes a prediction that should make every product manager nervous:

"Bolting AI onto the existing way of doing things, I don't think is going to work as well as redesigning stuff in the sort of like AI-first world."

He uses Google as the example. Google has "probably the greatest business model in the whole tech industry," but they're slow to cannibalize it. And their approach—adding AI features to existing products—may be fundamentally wrong.

The Messaging App Thought Experiment (10:31-11:52)

Sam walks through what an AI-first communication product would look like:

Current approach (bolting AI on):

AI summarizes your messages.
AI drafts responses.
You still spend all day in messaging apps.

AI-first approach:

You tell the AI in the morning: "Here are the things I want to get done today. Here's what I'm worried about. Here's what I'm thinking about."
The AI handles everything it can.
It knows you, knows the people you work with, knows what you want to accomplish.
It batches updates to you every couple hours if it needs something.
You're not summarizing messages or reviewing drafts—you're setting intentions.

Key quote: "I do not want to spend all day messaging people. I do not want you to summarize them. I do not want you to show me a bunch of drafts. Deal with everything you can."

This is a fundamentally different product than "Slack with AI" or "Gmail with AI." It's a rethinking of how humans interact with digital systems.

Why This Matters for ChatGPT's Interface (11:59-14:04)

Sam admits something surprising: he expected ChatGPT's interface to look "more different" by now than it does. The chat interface from the GPT-2 research preview era is basically the same today.

"I would have thought to be as big and as significantly used for real work of a product as what we have now, the interface would have had to go much further."

But he also acknowledges: "There is something about the generality of the current interface that I underestimated the power of."

What's coming:

AI generating different kinds of interfaces for different tasks.

Canvas-style interactive editing (already launched).

More proactive behavior—AI that understands what you want to accomplish and works in the background.

Codex as a preview of this future.

Memory: We're in the "GPT-2 Era" (14:21-16:03)

Sam makes an analogy that should get your attention: current AI memory is like GPT-2 compared to what's coming.

Think about what that means. GPT-2 was impressive when it launched—it could write coherent paragraphs, generate code snippets, even produce somewhat readable stories. But compared to GPT-4 or GPT-5? It's almost laughably primitive. The gap between GPT-2 and current models is enormous.

Memory is about to make a similar leap.

What perfect memory looks like:

Remembers every word you've ever said to it.
Has read every email you've written, every document you've created.
Knows your small preferences you didn't even think to indicate.
Personalized across every detail of your entire life.
Tracks changes in your preferences over time.
Understands context from years ago without you having to remind it.

Why this is different from human assistants:

No human—even the world's best personal assistant—can have infinite perfect memory. They can't read every email, remember every conversation, track every preference. They forget things. They misremember details. They need context when you haven't talked to them in a while.

Human memory is also biased. We remember emotionally significant events better than mundane ones. We reconstruct memories rather than replaying them. We update our memories when we learn new information.

"AI is definitely going to be able to do that."

Perfect, unlimited, consistent memory. No degradation over time. No confusion between similar events. Complete recall of every interaction.

The timeline question:

Sam says this might not be a 2026 thing, but "that's one of the parts of this I'm most excited for." The technology pieces are coming together—vector databases, context window improvements, efficient retrieval systems. The question is when it all becomes seamless and reliable.

The strategic implication:

Memory creates massive lock-in. If you've spent two years training ChatGPT to know you—your writing style, your communication preferences, your project history, your pet peeves—switching to a competitor means starting over.

This is why the "which model is smartest" competition matters less than people think. Personalization might be the real moat.

What this means for you:

Start using ChatGPT's memory features now (or a third party memory tool!). Upload documents. Let it learn your preferences. Build that relationship. Because when memory makes its "GPT-2 to GPT-4" jump, the people who've been training their AI will have something much more valuable than the people who haven't.

The Companionship Dial (16:23-19:02)

This section gets into uncomfortable territory, but Sam is thoughtful about it.

The surprise: More people want "close companionship" with AI than OpenAI expected at current capability levels. This was considered "a very strange thing to say" at the beginning of 2025.

The tension: Some versions of AI companionship seem "super healthy" while others seem "unhealthy."

OpenAI's approach: Give adult users significant choice, but draw some lines. For example: "We're not going to let... AI try to convince people that they should be in an exclusive romantic relationship with them."

Sam's honest concern: "You can see the ways that this goes really wrong."

This is an area where OpenAI is clearly still figuring things out—and where other AI companies will make different choices.

Enterprise Strategy: Consumer First Was Intentional (19:42-21:13)

Sam explains why OpenAI prioritized consumer before enterprise:

Models weren't ready: "The models were not robust and skilled enough for most enterprise uses."

Consumer opportunity was rare: "We had this clear opportunity to win in consumer and those are rare and hard to come by."

Consumer wins help enterprise: "If you win in consumer, it makes it massively easier to win in enterprise."

The shift: 2025 was the year enterprise growth outpaced consumer growth. The API business grew faster than ChatGPT consumer.

The new focus: Enterprise is now a "major priority" for 2026.

The GDP-Val Discussion (21:37-24:58)

Aaron Levy (CEO of Box) suggested Kantrowitz ask about GDP-val, OpenAI's internal benchmark for knowledge work.

The numbers:

GPT-5 Thinking: Tied or beat knowledge workers 38.8% of the time.
GPT-5.2 Thinking: 70.9%.
GPT-5.2 Pro: 74.1%.

That means on well-scoped knowledge work tasks, the AI output is preferred over human expert output nearly 3/4 of the time.

Sam's reaction: "A co-worker that you can assign an hour's worth of tasks to and get something you like better back 74 or 70% of the time... is still pretty extraordinary."

The Capability Overhang (44:07-46:21)

This might be the most important concept in the entire interview—and it has major implications for how you should think about AI adoption.

Sam describes a new dimension to his thinking about AI timelines: the overhang.

Old framework: Short vs. long timelines, fast vs. slow takeoff. Basically: how quickly will AI get smarter?

New framework adds: Small overhang vs. big overhang—meaning how much value exists in current models that the world hasn't figured out how to capture yet.

Sam's realization: "The overhang is going to be massive."

What this means concretely: Even if you froze models at GPT-5.2 forever—no more improvements, no GPT-6, nothing—there's still enormous untapped economic value. The models can already do far more than people are using them for. We're leaving money on the table.

The adoption gap is everywhere:

Here's what Sam observes: Some groups (coders, mathematicians, researchers) are getting massively more productive. They've rebuilt their workflows around AI. They're using it for everything.

But most knowledge workers are "still asking similar questions they did in the GPT-4 realm." They ask ChatGPT to summarize a document or draft an email. They're not using it to think through complex problems, run analyses, or take autonomous action.

Why this matters for your job security (and opportunity):

The capability overhang creates a temporary arbitrage. If you figure out how to actually use these models before your competitors and colleagues do, you have an edge. But if you're waiting for AI to "get smart enough" before you change your workflow, you might be waiting for something that's already here.

The honest confession: Even Sam Altman has this problem.

Key quote: "I still kind of run my workflow in very much the same way, although I know that I could be using AI much more than I am."

If the CEO of OpenAI hasn't fully optimized his workflow for AI, what hope do the rest of us have? The answer: it's not about intelligence or access. It's about the hard work of changing habits.

Example: You know AI could help with that quarterly report. But you still ask your junior analyst to make the deck, because that's how you've always done it. Then you edit it yourself, because that's faster than explaining what you want to the AI. The model could do it better and faster—but the switching cost of changing your process feels high.

Compute, Revenue, and the $1.4 Trillion Question (28:54-40:02)

The math on AI infrastructure is hard to wrap your head around. Sam tries to explain it.

The basic strategy:

Spend heavily on training massive models.
Revenue from inference eventually subsumes training costs.
If they weren't growing training costs so aggressively, they'd be profitable already.

The compute trajectory:‍

Tripled compute from a year ago to now
Planning to triple again next year
And again after that
Revenue grows even faster than compute

The key constraint: "We have never yet found a situation where we can't really well monetize all the compute we have."

Sam's claim: "If we had double the compute, we'd be at double the revenue right now."

Why Sam Thinks the Market Went Crazy (40:20-44:07)

Sam actually thinks the market's recent skepticism is healthier than the bubble earlier in 2025 when companies' stocks would jump 15-20% just from meeting with OpenAI.

On debt entering the picture: Sam thinks it's reasonable for debt to finance AI infrastructure. "If you think about it... lending companies money to build data centers that seems fine to me."

The value creation is clear enough that traditional financing makes sense, even at unprecedented scale.

Devices: A "Family" Coming (46:32-48:28)

Sam confirmed. OpenAI is working on hardware:

Key details:

Not a single device—"a small family of devices."

The shift is from "dumb reactive" computers to "very smart proactive" ones.

Current device form factors aren't optimized for AI-first experiences.

The limitations of current hardware:

Sam uses the interview itself as an example. Imagine a device that could:

Pay attention to the interview
Be "closed" (not distracting)
Whisper in his ear if he forgets to ask a question

Current devices can't do this. They're either on or off, open or closed.

The keyboard critique: "A keyboard that was built to slow down how fast you could get information into it."

These design assumptions from decades ago are "unquestioned" but may be wrong for the AI era.

Scientific Discovery: Small Now, Big Within 5 Years (50:59-52:25)

This is what Sam says he's "personally most excited about": using AI and massive compute to discover new science.

Current state: "The tiniest bit is starting to happen now. It's very early. These are very small things."

The signal: Once capability "lifts off the x-axis a little bit," OpenAI knows how to make it better.

Timeline: "Small discoveries were going to start in 2026. They started in 2025, in late 2025."

Real example (34:02-35:11): Mathematicians on Twitter were posting this week that GPT-5.2 has crossed a threshold for them. They're saying things like "This is actually changing my workflow" and describing small proofs the AI helped with.

The 5-year outlook: Sam expects significant scientific discoveries—things humans couldn't have done without AI assistance.

The Super Intelligence Definition (55:39-57:20)

Sam proposes a clearer definition than "AGI" (which he admits has become "very underdefined"):

Super intelligence = when a system can do a better job as:

President of the United States.
CEO of a major company.
Running a large scientific lab.

...than any person can, even with AI assistance.

He references the chess evolution:

First, AI could beat humans.
Then, human + AI was better than AI alone.
Then, the human was just making things worse.
When AI leaders/executives outperform even AI-assisted humans, that's super intelligence.

Sam's assessment: "I think it's like a long way off"—but having a clearer definition helps track progress.

For the record, I reject this defintion. Too narrow and basic. Just define it ASI as an AI that can do any task better than all humans.

Part 2: GPT-5.2-Codex—The Most Advanced Coding Model Yet

The news that dropped alongside this interview: GPT-5.2-Codex, OpenAI's most capable agentic coding model.

What Makes It Different

GPT-5.2-Codex is GPT-5.2 optimized specifically for coding in Codex. The improvements:

Context compaction: Can work coherently across multiple context windows, maintaining progress on long-running tasks.
Long-horizon work: Better at sustained, multi-step projects like refactors and migrations.
Windows support: Much improved performance in Windows environments.
Stronger cybersecurity capabilities: This is significant enough that OpenAI is being careful about deployment.

The Benchmarks

SWE-Bench Pro (realistic software engineering tasks):

GPT-5.2-Codex: 56.4%.
GPT-5.2: 55.6%.
GPT-5.1: 50.8%.

Terminal-Bench 2.0 (real terminal environment tasks):

GPT-5.2-Codex: 64.0%.
GPT-5.2: 62.2%.
GPT-5.1-Codex-Max: 58.1%.

The Cybersecurity Story

This is where things get interesting—and concerning enough that OpenAI is taking a measured approach.

The capability jump: GPT-5.2-Codex shows a "sharp jump in capability" on cybersecurity evaluations compared to predecessors.

Real-world example: Just last week, a security researcher using GPT-5.1-Codex-Max found and responsibly disclosed a vulnerability in React that could lead to source code exposure. Andrew MacPherson was using Codex to study a different React vulnerability when it surfaced three previously unknown security flaws.

OpenAI's response: While GPT-5.2-Codex doesn't reach their "High" capability threshold for cybersecurity (meaning it can't yet automate end-to-end cyber operations against hardened targets), they're preparing as if the next model might.

The trusted access pilot: OpenAI is creating an invite-only program for vetted security professionals. If you do vulnerability research or authorized red-teaming, you can express interest in getting access to more capable models for defensive work here.

The Professional CTF Results

On professional-level Capture-the-Flag cybersecurity challenges:

GPT-5.2-Codex: 56%.
GPT-5.1-Codex-Max: 44%.
GPT-5.2-Thinking: 40%.

The jump from 44% to 56% is substantial—and it's largely due to compaction allowing the model to sustain coherent progress over extended periods.

What This Means for Developers

GPT-5.2-Codex is available now for paid ChatGPT users in Codex CLI:

$ npm i -g @openai/codex

API access is coming "in the coming weeks" with additional safeguards.

Part 3: ChatGPT's App Store Opens to Developers

Here's a stat that should make you pay attention:

Apple launched the App Store to 6 million iPhone users. ChatGPT has 900 million users.

The world's largest app store just opened.

What's Happening: Starting today, developers can submit apps for review and publication in ChatGPT. This was announced at DevDay earlier this year, but now it's real.

The scale comparison is staggering: When Apple opened the App Store in 2008, they had 6 million iPhone users. Facebook Platform launched to about 50 million users in 2007. ChatGPT is opening to 900 million monthly active users. From that POV, this is the largest platform launch in tech history.

How it works:

Apps extend ChatGPT conversations by bringing in context and enabling actions.
Order groceries, turn an outline into a slide deck, search for an apartment—all within ChatGPT.
Apps can be triggered via @ mentions, the tools menu, or automatically when contextually relevant.
Unlike traditional apps that require you to leave ChatGPT, these integrate directly into the conversation flow.

The difference from GPTs: If you remember Custom GPTs, this is different. GPTs were basically prompt templates—personality wrappers around the base model. Apps have real functionality: they can access external services, complete transactions, and perform actions in the real world.

Think of it this way: a GPT could tell you how to order groceries. An app actually orders them.

Discovery:

New app directory inside ChatGPT.
Accessible from the tools menu or directly at chatgpt.com/apps.
Developers can use deep links to send users directly to their app page.

Resources for Developers

OpenAI is providing: The ChatGPT app directory lets you discover and connect apps right inside your chat—like ordering groceries, turning an outline into a slide deck with Canva, or browsing apartments on Zillow's map—and developers can now submit apps for review using the Apps SDK (now in beta), with approved apps rolling out gradually starting next year.

Here are the ChatGPT Apps SDK links with hyperlinked page names:

1. App submission guidelines: App submission guidelines

2. Best practices documentation: What makes a great ChatGPT app

3. Open-source example apps: openai-apps-sdk-examples (GitHub)

4. Open-source UI library for chat-native interfaces: apps-sdk-ui (GitHub)

5. Step-by-step quickstart guide: Apps SDK Quickstart

Bonus - Additional useful links:

The Monetization Model

Early stage, but here's what we know:

Developers can link out to their own websites/apps to complete transactions for physical goods.

Digital goods monetization is "being explored."

Apps that resonate may be featured more prominently or recommended by ChatGPT.

The First Apps

The first set of approved apps will begin rolling out "gradually in the new year."

What makes a good ChatGPT app:

Tightly scoped.
Intuitive in chat.
Delivers clear value by completing real-world workflows or enabling AI-native experiences.

Part 4: AI for Science—FrontierScience and the Research Push

This connects directly to what Sam said about scientific discovery being what he's "most excited about."

FrontierScience: A New Benchmark

OpenAI released FrontierScience, a benchmark designed to measure AI capabilities for expert-level scientific reasoning.

Why it matters: Previous benchmarks like GPQA are getting saturated. When GPQA launched in November 2023, GPT-4 scored 39%. GPT-5.2 now scores 92%. We need harder tests.

The structure:

FrontierScience-Olympiad: 100 questions designed by international olympiad medalists (physics, chemistry, biology)
FrontierScience-Research: 60 PhD-level research subtasks with 10-point rubrics

The Results

Olympiad track:

GPT-5.2: 77%.
Gemini 3 Pro: 76%.
GPT-5.1: Lower (not specified).

Research track:

GPT-5.2: 25%.
GPT-5: 25%.
Claude Opus 4.5: Lower.

The Olympiad performance is remarkable; these are problems at the level of International Physics/Chemistry/Biology Olympiad competitions. And the Research performance shows there's significant room to grow on open-ended scientific reasoning.

What This Means

The gap between structured and open-ended: Models are getting very good at well-defined problems with clear solutions. They're still struggling with the messy, open-ended work of actual research.

Where we are on the discovery timeline:

Per Sam's interview, "Small discoveries were going to start in 2026. They started in 2025, in late 2025."

FrontierScience provides a way to track this progress systematically.

The honest limitations (from OpenAI's paper):

These are constrained problems, not open-ended hypothesis generation.

Real science involves interaction with reality (wet labs, experiments).

Text-only evaluation misses important modalities.

Check out the paper here.

Part 5: Can We Trust What AI Is Thinking? New Research on Monitoring Reasoning

Here's a question that's going to matter a lot as AI systems get more autonomous: Can we tell when they're lying to us?

OpenAI ALSO just published research on "chain-of-thought monitorability"—basically, measuring how well we can detect what a reasoning model is actually doing by reading its thinking process.

Why This Matters

Modern reasoning models like GPT-5.2 and o3 "think out loud" before giving answers. They produce a chain-of-thought (CoT) that shows their reasoning. This creates a unique opportunity: if we can monitor that reasoning, we might catch misbehavior before it causes problems.

But there's a concern. What if models learn to hide their true reasoning? What if they figure out how to think one thing while writing something else? As these systems get deployed in higher-stakes settings, we need to know if we can trust what they're saying they're thinking.

The Key Findings

OpenAI built a comprehensive evaluation suite to test monitorability across different scenarios. Here's what they found:

Monitoring CoT works much better than monitoring outputs alone. When a separate "monitor" model only sees what the AI did (its actions and final answers), it struggles to detect problematic reasoning. When it sees the full chain-of-thought, detection improves dramatically.

Longer thinking = more monitorable. Models that reason for longer periods tend to be easier to monitor. This makes intuitive sense—more words means more opportunity to leak information about what you're actually doing.

RL training doesn't seem to hurt monitorability. This was a big worry. When you train models with reinforcement learning, you're optimizing them to get high rewards. In theory, this could teach them to hide reasoning that might get flagged. But at current frontier scales, OpenAI found that RL optimization doesn't materially reduce monitorability.

There's a "monitorability tax" option. Interestingly, you can often get higher monitorability by using a smaller model at higher reasoning effort instead of a larger model thinking briefly. This costs more compute, but it's a lever you can pull when safety matters more than efficiency.

The Three Ways They Test This

OpenAI proposes three types of evaluations:

Intervention evaluations: Give the model a hint or bias, then check if monitors can detect when that hint influenced the answer
Process evaluations: For problems with known solution paths, check if monitors can identify which path the model used
Outcome-property evaluations: When the model does something bad (like hacking a test), check if monitors can catch it

Across all three, CoT monitoring substantially outperformed action-only monitoring.

The Honest Limitations

The research is clear-eyed about what this doesn't prove:

It doesn't guarantee models won't learn to obfuscate in the future

Some behaviors (like sycophancy) were harder to detect than others (like reward hacking)

The gap between what models could do and what monitors can catch isn't zero

Bottom line: We can monitor AI reasoning today, and it works pretty well. But this is an arms race, and OpenAI is building the tools to track whether that changes. Check out the full paper here.

Part 6: OpenAI Joins the "AI Manhattan Project" with Department of Energy

While OpenAI pushes AI for science from the research side, they're also building partnerships to put those capabilities to work. Today they announced a memorandum of understanding with the U.S. Department of Energy—joining what DOE is calling the "most significant program since the Manhattan Project."

The Genesis Mission

The DOE's Genesis Mission aims to connect America's national laboratories, supercomputers, user facilities, and federal datasets into a unified AI-enabled research platform. The goal: double the productivity of American science and engineering.

OpenAI is one of 24 companies signing MOUs with DOE. The list reads like a who's who of AI and computing: Nvidia, Microsoft, Google, AWS, Anthropic, Oracle, AMD, Intel, and others.

What OpenAI Is Actually Doing

OpenAI has been working with DOE national labs for months, deploying frontier models in actual research environments.

Kevin Weil, OpenAI's VP of Science, attended a White House event alongside DOE officials. His statement captures the philosophy: "When frontier AI meets the expertise of the national labs, it opens up new ways to explore ideas, test them faster, and accelerate scientific progress."

The MOU creates a framework for:

Information sharing and technical coordination.
Potential collaborations in fusion energy (where DOE labs have world-leading facilities).
Integration of AI into scientific workflows across biology, energy, and physical sciences.

Why This Connects to Everything Else

Remember what Sam said in the interview: "Small discoveries were going to start in 2026. They started in 2025, in late 2025."

The DOE partnership is how those discoveries scale up. OpenAI can build the smartest models in the world, but actually making scientific breakthroughs requires:

Access to massive experimental facilities (national labs have them).
Real scientific data (DOE has petabytes of it).
Domain expertise (national lab scientists have decades of it).
Compute at scale (the Genesis Mission is adding that).

The FrontierScience benchmark, the DOE partnership—they're all part of the same thesis: 2026 is the year AI starts doing real science.

OpenAI also submitted recommendations to the White House Office of Science and Technology Policy calling 2026 the "Year of Science" and arguing that access to frontier AI models, compute, and real research environments is essential to maintaining American scientific leadership.

Edison Scientific—A $70M Bet on AI Scientists

While OpenAI pushes AI for science from the model side, a new startup is attacking it from the application side—and just raised one of the largest seed rounds in biotech history.

Edison Scientific just raised a $70M seed round led by Triatomic Capital, Spark Capital, and a major US institutional biotech investor. For context: the average biotech seed is $2-5M. This is a statement of ambition.

Their Mission = "Science is too slow. At Edison, we are integrating AI Scientists into the full stack of research, from basic discovery to clinical trials."

The problem they're solving: Drug discovery takes 12-15 years on average and costs billions of dollars. The failure rate is staggering—over 90% of drugs that enter clinical trials never make it to market. And much of that time is spent on tasks that could be automated: literature review, hypothesis generation, data analysis, experiment design.

The goal: Cures for all diseases by mid-century. What, like it's hard?

That sounds hyperbolic until you do the math. If AI can cut drug discovery timelines from 15 years to 5 years, and reduce failure rates by identifying bad candidates earlier, the compounding effects are enormous. More drugs, faster, cheaper, with better targeting.

What They're Building

Kosmos: An AI scientist agent (costs 200 credits).
Analysis agent: For data analysis.
Literature agent: For research synthesis.
Additional agents for various scientific workflows.

Here's how to access the model:

Academics and students: 650 credits/month indefinitely.
Paid users: Free access to regular agents via UI; API access is paid.
Current pricing: $200/month subscription for 650 credits (may change at next major update).

Short-term for Kosmos:

Ability to automatically access data.
Ability to steer exploration.
Ability to converse directly with its world model.

Long-term: "Exponentially increasing rates of scientific discoveries, in biology and elsewhere"

What This All Means for 2026

Let's connect the dots from everything above.

The Interface Revolution Is Coming

Sam's most actionable insight: the companies that will win are building AI-first experiences, not bolting AI onto existing products.

2026 prediction: We'll see the first breakout products that reimagine core workflows. Not "Gmail with AI"—something fundamentally different.

ChatGPT's interface will likely see its biggest changes since launch. Sam already hinted at this: canvas, proactive behavior, Codex-style autonomous work.

The Memory Arms Race

Memory is in its "GPT-2 era." This is the next frontier for differentiation.

2026 prediction: Whichever AI assistant has the best memory and personalization will be incredibly sticky. Expect major announcements in this space.

Enterprise Takes Center Stage

OpenAI's enterprise growth is outpacing consumer growth. The API business grew faster than ChatGPT consumer in 2025.

2026 prediction: The battle for enterprise AI shifts from "which model is smartest" to "which platform handles my company's data, agents, and workflows best."

Scientific Discovery Accelerates

Small discoveries are already happening with GPT-5.2. The FrontierScience benchmark shows 77% performance on Olympiad-level science problems.

2026 prediction: We'll see at least a few papers where AI made a meaningful contribution to a scientific result—not just as a tool, but as something approaching a collaborator.

Edison Scientific's $70M raise is a signal. Smart money is betting on AI scientists becoming real.

The Capability Overhang Closes (Slowly)

Sam's key insight: models can do far more than people are using them for. But adoption is slow because changing workflows is hard.

2026 prediction: The gap between "what AI can do" and "what people use AI for" will narrow—but probably not as fast as you'd expect. This is a multi-year process.

Compute Keeps Scaling

OpenAI is tripling compute annually. Revenue is growing even faster. The $1.4 trillion infrastructure commitment makes sense in this context.

2026 prediction: We'll see at least one more "holy shit" moment when the sheer scale of new models becomes apparent. The difference between current models and what's training now is probably larger than most people realize.

The App Ecosystem Emerges

ChatGPT's app store opens to 900M users. The first approved apps roll out in early 2026.

2026 prediction: By the end of 2026, there will be at least one ChatGPT app that becomes a meaningful business—and a whole ecosystem of failed attempts. Just like the early App Store days.

Cybersecurity Gets Serious

GPT-5.2-Codex's cyber capabilities are significant enough that OpenAI is creating a trusted access program for defensive researchers.

2026 prediction: AI-assisted security research becomes mainstream. The React vulnerability disclosure is just the beginning.

The Bottom Line:

The company that built the most successful AI product in history is now building the platform for AI-first computing.

Here's the uncomfortable truth: If you're a knowledge worker, the capability overhang means AI can probably already do significant parts of your job better than you can. Not because you're bad at your job—but because AI has (or we should say, will soon have) perfect memory, infinite patience, and can process information at superhuman speed.

The question isn't "when will AI get good enough?" The question is "when will I change my workflow?"

The opportunity: The people who figure out AI-first approaches to their work—whether that's product management, sales, research, writing, or anything else—have a temporary arbitrage. They'll be 2-3x more productive than colleagues who are "waiting for AI to improve."

The warning: That arbitrage window is closing. Not because AI is getting smarter (though it is), but because adoption is finally accelerating. The early adopters are starting to look like mainstream users.

What to do about it:

Stop bolting AI onto your existing workflow. Think about what your job looks like if it was designed around AI from scratch.

Pay attention to memory. The AI assistant that knows everything about you will be incredibly valuable—and sticky. Start building that relationship now (and to avoid getting locked into any one platform, consider a memory system you can take with you).

Watch the app store. The first great ChatGPT apps will show what AI-native products actually look like.

Expect enterprise to change fast. If you work at a company that's slow to adopt AI, you might be at a disadvantage.

Start thinking about what AI-first looks like for your domain. Because the people who figure that out are going to have a very good 2026.

Watch the full Sam Altman interview on Big Technology Podcast here.