Mercury Alpha, Codex, Hermes Desktop: AI Agent Guide

Every few months, the AI internet finds a mysterious model name, tapes red string across a corkboard, and turns screenshots into prophecy. This time, the names were Mercury-alpha and Jewel Alpha, with plenty of people wondering whether one of them was actually GPT-5.6.

That was the hook for our June 4 livestream. The better story ended up being bigger than one rumored model. We used the rumors as a jumping-off point to talk through OpenAI's new memory architecture, Codex's push beyond developers, Hermes Desktop, Microsoft AI's new model stack, Claude Code, desktop agents, agentic operating systems, and the infrastructure problems that show up once agents become real work tools.

You can click the video below to watch the stream, or use the rest of this article as a guided tour. We start with the topics promised in the video description, then circle back to the best demos, tangents, and useful workflow ideas that came up along the way. Enjoy!

Start here: the fastest path through the stream
1. What Mercury-alpha is, and why people thought it could be GPT-5.6
2. The biggest rumors and evidence so far
3. What a new OpenAI model would need to deliver
4. OpenAI's new memory feature: the practical part
5. Codex is turning into a work surface, not a coding sidebar
6. Hermes Desktop: the user-owned agent angle
7. Microsoft MAI-Image-2.5 and the image workflow lesson
8. Microsoft, Windows Skills, and Scout: the agentic OS path
9. How Mercury-alpha fits into the broader AI agent race
10. The infrastructure conversation: agents still need chips, memory, power, and trust
11. Bonus chapters we loved, even though they were complete ADHD tangents
12. What this means for AI users, builders, and businesses
Links from the episode
Here are all the key moments, timestamped

Start here: the fastest path through the stream

6:20: Grant lays out the working agenda: rumored new OpenAI model, OpenAI memory, Codex, Hermes Desktop, small open models, and Nvidia Nemotron.
27:44: The Mercury-alpha and Jewel Alpha rumor section begins.
31:35: Codex shifts from coding tool to broader work surface.
43:18: OpenAI's new memory feature gets pulled up and tested live.
1:18:09: Hermes Desktop becomes the main demo.
12:31: Microsoft MAI-Image-2.5, Nano Banana comparisons, and the live image test.
53:43: Anthropic's recursive self-improvement post turns the agent conversation into something bigger.

1. What Mercury-alpha is, and why people thought it could be GPT-5.6

The cleanest way to understand Mercury-alpha is this: during the stream, we treated it as a rumored internal model name, not a confirmed OpenAI release. Grant opened the rumor section with posts from the AI X rumor circuit and the newer Jewel Alpha chatter at 27:44. The exact public claim being discussed was simple enough: one post said "Mercury Alpha," and Andrew Curran replied that it was "aka GPT-5.6" at 31:08.

That is evidence of a rumor, not evidence of a product. The interesting part is why the rumor was believable enough to discuss.

Internal model names leak into public view. Corey noted at 28:22 that OpenAI's Riley Brown had shared a video where "Jewel Alpha" appeared at the bottom of the screen, then the video disappeared. That does not prove a release, but it does explain why people paid attention.
Model names can be checkpoints. At 28:48, Corey pointed out that these names could be different checkpoints on the same model. Translation: the lab may be testing several saved versions before deciding what becomes public.
Checkpointing is normal model-building behavior. At 29:12, Corey explained a useful mental model: labs train multiple versions with different data mixes, compare them, and then "save" the model at a moment that performs best for the target use case.

That last point is the best practical takeaway from the rumor section. A frontier model release is rarely one magic file called GPT-whatever. It is closer to a long series of experiments where labs vary the data, training recipe, post-training, tool access, safety behavior, and product wrapper. The internet sees a codename and asks, "Is this GPT-5.6?" The lab is usually asking a messier question: "Which candidate is good enough, useful enough, safe enough, and product-ready enough to ship?"

The model-name archaeology was fun. The real artifact was the reminder that product launches are the polished tip of a very weird iceberg.

2. The biggest rumors and evidence so far

Here is the useful version of the evidence stack, with the speculation kept in its proper little containment unit:

27:44: Grant introduces Chris GBT as an alleged insider and one of the vague-post kings of AI X. Useful context, low certainty.
28:09: Grant says the stream is about Mercury Alpha, but Jewel Alpha has also entered the rumor stream.
28:22: Corey describes the Riley Brown video where Jewel Alpha appeared on screen, then disappeared later. Useful signal, still not a launch confirmation.
28:48: Corey says Mercury Alpha and Jewel Alpha could be different checkpoints on the same model. This is the least sensational and most technically plausible interpretation.
31:13: Grant shows the "aka GPT-5.6" rumor and says the more realistic expectation would be a release later, not necessarily that day.

The counter-narrative matters here. AI Twitter loves to turn internal names into consumer roadmaps. Labs use internal names for testing, routing, evaluation, previews, partner access, red-teaming, and product experiments. Sometimes the codename becomes the product story. Plenty of times, it becomes trivia.

That does not make the rumor useless. It tells us what people expect a new OpenAI flagship to do. Nobody was excited about Mercury-alpha because the name sounded cool. People were excited because the next OpenAI model is expected to prove whether the agent era can get less brittle.

3. What a new OpenAI model would need to deliver

The stream never turned this into a formal scorecard, but the answer emerges across the OpenAI sections. A new flagship model would need to do more than win a few benchmark charts. Benchmarks still matter, but the product bottleneck has moved.

A meaningful GPT-5.6-class release would need to improve at least four things:

Memory that stays useful. OpenAI's new memory system is about making ChatGPT carry context across chats, follow user constraints, and update those memories over time. Grant walks through the official post at 43:38 and 44:40. OpenAI's own post says memory is meant to help future conversations start from shared context instead of scratch, with a new architecture built on Dreaming and reviewable memory summaries through a memory summary page (OpenAI).
Tool use that feels like work, not a demo. Codex is becoming the place where OpenAI can connect models to files, apps, plugins, connectors, review surfaces, and deployable artifacts. Grant starts that section at 31:35. OpenAI separately said Codex now has more than 5M weekly active users and that knowledge workers make up about 20% of users, growing more than 3x as fast as developers (OpenAI).
A cleaner product surface. At 38:12, Corey points out the naming problem: Codex and Claude Code sound like engineering tools, even when the product is useful for analysts, marketers, researchers, bankers, and operators.
Safe action in the real world. Near the end, Corey asks why we still have not seen a great demo of agents buying things at 2:00:47. That matters. The jump from "draft a report" to "spend money on my behalf" is where trust gets tested.

The real product question is whether a new model makes agents less fragile across the whole workflow: remember the user, choose the right tools, keep state, ask for permission at the right moment, recover from mistakes, and leave an audit trail a normal person can understand.

4. OpenAI's new memory feature: the practical part

OpenAI's memory update was the most immediately useful piece of the stream because it affects how normal people use ChatGPT day to day. Grant pulls up OpenAI's post at 43:18, then reads the core idea at 43:52: memory helps ChatGPT learn your preferences, projects, and constraints so future conversations do not start from zero.

The most useful details:

44:21: OpenAI says the old Dreaming layer had helped personalization but was historically not enough as a standalone memory system.
44:26: The new system is described as a more capable and compute-efficient memory architecture built on top of Dreaming.
44:31: Memories synthesized by Dreaming are reviewable through a visible summary page, which gives users a place to inspect and correct them.
44:40: Grant reads the three qualities of good memory: carry useful context forward, follow preferences and constraints, and stay current over time.
45:05: The "stay current" example matters because memories can go stale. If ChatGPT remembers a birthday party after the birthday passed, that is not personalization. That is clutter.

Grant then turns the announcement into a live workflow at 45:25. He drops the OpenAI memory link into ChatGPT, assigns it to The Neuron project, and lets his Neuron main story skill shape the output. The important part comes at 46:08: OpenAI is trying to move some of that context management into the app itself, so users do not have to manually maintain skill files for every preference, constraint, and writing pattern.

That is useful, but it needs supervision. At 48:27, Grant points out that the generated draft linked the wrong anchor. That is the memory lesson in miniature: personalization can save time, but source hygiene still belongs to the human.

Try this after watching

Open your memory settings and read what ChatGPT thinks matters about you.
Delete stale memories that no longer help.
Add durable preferences that actually shape output, like your audience, tone, constraints, tool stack, and recurring projects.
Ask: "What assumptions are you carrying into this response from memory or project context?"
For serious work, still provide the current source links. Memory should reduce repeated setup, not replace fresh evidence.

The Codex section is where the stream shifts from "new model rumors" to "how people may actually work with AI." Grant starts at 31:35 with OpenAI's announcement that Codex is increasingly useful beyond software development.

At 32:08, he reads the key number: more than 5M people use Codex for work every week. At 32:26, he highlights the real shift: non-developers, including analysts, marketers, operators, designers, researchers, investors, and bankers, make up roughly 20% of overall Codex users and are growing more than 3x as fast as developers. Axios reported the same OpenAI figures and added that fast-growing Codex task categories included data analysis, research, and knowledge artifacts like presentations (Axios).

The plugin explanation at 32:56 is the simplest way to explain the whole strategy: a plugin is a bundle of tools, skills, and connectors aimed at a specific task or role. That matters because most work does not happen inside one chat window. Work happens across email, documents, calendars, spreadsheets, dashboards, Slack threads, design files, data warehouses, and weird internal tools nobody wants to talk about.

The demos and examples tell the story:

33:33: Codex Sites lets you create and share sites directly inside Codex. Corey compares that to why he still uses v0: fast build plus easy deploy.
34:34: Grant connects this to a broader AI workflow lesson: if a human has to review or use the output, HTML is often the better format. He uses HTML for newsletter writing because it preserves links cleanly.
35:50: The data analytics plugin can explore business data, explain metric changes, and create reports and dashboards with sources like Snowflake, Databricks, and Tableau.
36:09: The creative production plugin can turn briefs into reviewable assets using tools like Figma, Canva, Shutterstock, and Fal.
36:26: The sales plugin can help prepare for meetings, find priority accounts, and connect with tools like Salesforce, HubSpot, Slack, Outreach, and Clay.
1:28:19: Grant opens the Codex app and shows the new role-specific plugins, including creative production, sales, and investment banking.
1:29:41: Each plugin has apps and skills under the hood. This is why Codex starts to look less like a coding chat and more like an operating surface for work.
1:30:06: Grant asks Codex to visualize how a transformer works, and Codex starts building a web data visualization.

Corey's strategic point at 37:21 is the one to remember: Codex could become the primary way many people use ChatGPT for serious work. The naming works against that, though. At 38:12, Corey says names like Codex and Claude Code imply "engineers only," even when the workflows are increasingly useful for anyone doing analytical, operational, or creative work.

The counterpoint comes from the same category: power users can get tired, and expert oversight still matters. Axios quoted Jacob Bank, a senior OpenAI product leader, saying some heavy Codex users feel mentally fried from supervising many agents at once, and OpenAI's own Codex positioning still emphasizes human review for non-trivial work (Axios). That is the agent era in one sentence: the machine can do more, so the human has to get better at delegation, review, and knowing when to stop.

Try this after watching

Use Codex for one non-coding task: turn a messy source folder into a brief, report, checklist, or HTML page.
Turn on one role plugin that matches your actual job instead of enabling everything.
Ask Codex to show its assumptions, source list, and next actions before it edits or publishes anything.
Keep the first task low risk. A dashboard draft is fine. A contract change or customer email needs tighter review.

6. Hermes Desktop: the user-owned agent angle

The Hermes Desktop demo begins at 1:18:09. Grant frames it as his new favorite tool, and then gives the simplest description at 1:19:37: Hermes Desktop feels like an open, provider-agnostic version of Codex. You can use different model providers, different tools, and a desktop interface instead of living in a terminal.

That last part matters. At 1:20:24, Grant explains that before the desktop version, you had to install Hermes through the terminal. For a technical user, that is fine. For a normal professional, it is a small haunted house.

The live demo starts at 1:21:28, when Grant asks the local agent to explain Hermes. The response describes Hermes as an open-source AI agent framework, similar in spirit to Claude Code, Codex, or OpenClaw. The useful details come next:

1:22:06: Hermes has a skills system, so reusable workflows and lessons can be saved for later sessions.
1:22:18: It has cross-session memory for preferences, environment details, and task context.
1:22:31: It is provider agnostic, with support for many model providers, including OpenRouter, Anthropic, OpenAI, DeepSeek, and local models.
1:22:38: You can swap models without changing the rest of the workflow.
1:22:43: It can connect across Telegram, Discord, Slack, and other channels, although the desktop app means many users can avoid that setup.
1:23:20: Grant shows 77 built-in tools, including agents, creative tools, data science tools, and toolsets.
1:24:10: Toolsets include cron jobs, automations, code execution, browser automation, clarifying questions, and image generation.

The bigger workflow idea arrives at 1:24:49: should Hermes, Codex, and Claude Code all be able to use the same skills? Corey argues that skills should live in one folder so every agent can access the same reusable workflows. Grant adds the maintenance reality at 1:25:28: skills improve through versioning. When you notice an edge case, you should update the skill and track the new version.

This is the local-agent worldview. OpenAI and Microsoft want the work surface. Anthropic wants Claude Code and Claude Desktop to become where the work happens. Hermes points in a different direction: the user's own agent stack, with portable skills, multiple model providers, and more control over where the work runs.

If you want to read our full walk-through guide of Hermes Agent and the new Desktop version, click here.

7. Microsoft MAI-Image-2.5 and the image workflow lesson

The Microsoft section starts earlier than the title topic suggests. Corey had just returned from Microsoft Build, where he saw the Surface RTX Spark Dev Box and Microsoft's AI announcements at 1:05 through 6:02. He also says at 4:49 that Microsoft's newly released models were stronger than expected, including coding and reasoning models.

Microsoft's official announcement said it launched seven new MAI models, including MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Image-2.5-Flash, MAI-Voice-1.5, MAI-Transcribe-1.5, and MAI-Vision-1.5 (Microsoft AI). The stream focuses on MAI-Image-2.5 because Corey had access to a Microsoft model interface and could test it live.

At 12:31, Corey says Microsoft's new image models are strong, and Grant pulls up leaderboard context at 13:06. Microsoft says MAI-Image-2.5 and its Flash variant support text-to-image and image editing, and that its leaderboard performance surpassed Nano Banana Pro's Arena score (Microsoft AI).

The demo is useful because it shows both the promise and the friction:

16:44: The first prompt asks for an oil painting of a dog dancing in a park while it rains meatballs. The image hits the broad style and object request.
17:51: Grant asks to make it photorealistic. The edit changes less than expected.
19:04: Grant names the general image-model lesson: once the first version exists, edits often stay close to it. If you want a meaningfully different image, start a new chat or generation.
20:06: Corey tries the same concept with "photorealistic" in the first prompt. The park and meatballs improve, but the dog still has a stylized feel.
23:33: Corey prompts a woman with blue hair and red eyes leaning against a city building.
24:43: The result looks much more photoreal at normal viewing size.
25:53: Zooming in reveals the familiar digital-art sheen. The background, hands, and hair depth hold up better than many older image model outputs.

The reader-facing workflow lesson is simple: put the style, realism level, camera language, and output intent in the first prompt. Edits are great for changing details. Big style pivots are easier when you start over.

8. Microsoft, Windows Skills, and Scout: the agentic OS path

Microsoft's bigger play shows up in two different parts of the stream. At 46:42, Corey says the overlooked Build announcement was Windows Skills at the operating system level. His point at 47:09 is that OS-level skills could work wherever you go, instead of living inside one AI app. That is a step toward a truly agentic OS.

Then, near the end, Corey brings up Microsoft Scout at 1:48:28. The Verge reported before Build that Microsoft had been working on Scout as part of a broader Copilot "super app" effort, with broader availability expected later (The Verge). Corey describes Scout as an autonomous agent that can look through Microsoft 365 context, suggest recurring tasks, talk inside Teams, and act as an assistant with its own visible presence in the workflow.

The practical examples:

1:49:10: Scout can inspect email, documents, and Teams conversations to recommend tasks it can help with.
1:49:53: It can act inside Microsoft Teams as a named assistant.
1:50:33: It can help with meeting coordination, including marking the user late or rescheduling.
1:51:29: Grant compares the concept to OpenClaw for nontechnical people.

This is Microsoft's advantage in one product motion: make the agent boring enough for normal office work. Open-source agent tools can be more flexible. Codex can be more builder-oriented. Microsoft can bring agents to the place where millions of people already have email, docs, calendars, meetings, and organizational permissions.

9. How Mercury-alpha fits into the broader AI agent race

If Mercury-alpha is a future OpenAI model, its importance depends on how much it improves agentic work. The stream's best evidence for that broader race comes from Anthropic, not OpenAI.

At 53:43, Grant pulls up Anthropic's post, When AI Builds Itself, about progress toward recursive self-improvement. Recursive self-improvement means AI systems helping design, build, test, and improve future AI systems. At 54:19, Grant reads the key caution: we are not there yet, and it is not inevitable, but it could come sooner than institutions are prepared for.

The most concrete data point lands at 55:05: as of May 2026, more than 80% of code merged into Anthropic's codebase was authored by Claude, according to Anthropic's report. A secondary summary of the post highlights the same point and the three possible futures Anthropic sketches: progress stalls, progress continues with human involvement, or a more autonomous loop emerges (Times of India summary).

The stream turns this into a useful debate at 58:23. Grant says Anthropic seems coding-agent-pilled: the best general-purpose agent may be the best coding agent because code can act on computers. Corey offers the broader schools of thought at 58:52:

Some labs think solving coding will pull the rest of agency into place.
Some think many domains need to improve together.
Some think the best route is to focus on verifiable domains, where success can be checked automatically.

That is the agent race hiding under the Mercury-alpha rumor. The next model people care about will be judged by how well it handles long, multi-step, tool-heavy work. Coding is the first battleground because the output is testable. Office work, design work, research work, sales work, finance work, and personal errands come next, but each one has a harder trust problem.

10. The infrastructure conversation: agents still need chips, memory, power, and trust

Once the conversation moves from chatbots to agents, infrastructure stops being background noise. At 59:28, Grant reads Anthropic's possible futures section, including the idea that progress could stall or hit constraints outside the model itself. At 1:00:06, he points to supply chain limits: energy, compute, chip fabrication, grid expansion, and interconnect bandwidth.

Corey sharpens the point at 1:00:29: energy could be one of the largest bottlenecks, especially in the U.S. Grant explains the grid side at 1:01:11 through 1:03:19: transmission lines, transformers, permitting, physical infrastructure, and the difficulty of moving power from where it is generated to where it is needed.

The most interesting counter-frame comes at 1:06:35. Grant says the AI boom could create useful power infrastructure even if some AI spending eventually cools. During the dot-com boom, fiber got laid. During the AI boom, power plants, grid upgrades, and storage may become the durable infrastructure left behind.

That is a useful way to think about the build-out. The risk is waste, local backlash, water use, noise, and ratepayer burden. The upside is more energy capacity, better grid infrastructure, and new incentives to build. The answer depends on local execution, not slogans.

11. Bonus chapters we loved, even though they were complete ADHD tangents

We can say that, trust us ;) My medical data stays between me, my doctor, and ChatGPT, but y'know... read the subtext.

Nvidia Nemotron 3 Ultra and the open-model stack

At 7:35, Corey previews Nvidia's Nemotron 3 Ultra, then says at 8:23 that it is a 550B-parameter model with 35B active parameters. At 9:12, he says most people will not run a 500B-class model locally, but it can be accessed through providers like OpenRouter, cloud platforms, and Nvidia's own hosting. At 10:07, Grant shows an Unsloth quantized version that still needs serious memory. Tom's Hardware reported that Nvidia's Nemotron effort includes a coalition with AI labs and developer tooling companies, with Nvidia positioning Nemotron for agentic workflows and open frontier model development (Tom's Hardware).

Small local models matter too

At 6:55, Grant mentions Gemma 12B and a two-bit quantization that can run in roughly 8GB of RAM. The practical point is bigger than one model: local AI keeps getting more accessible. Frontier models grab the headlines, but small models are where privacy, offline use, and personal agent experiments become easier.

Claude Design and the visual workflow loop

At 1:33:00, Grant shows Claude Design as a way to create front-end interfaces. At 1:34:44, he asks it to make panels resizable in a UI mockup. At 1:35:38, he explains the workflow: use Claude Design for a lightweight design pass, then hand the generated code to Claude Code for implementation. The higher-level workflow appears at 1:37:42 through 1:39:10: screenshot the current UI, show the ideal UI, ask an image model for a mockup, send that to Claude Design, then have Claude Code implement it. That is a very practical loop for builders who think visually before they think in code.

World ID and proof of human

At 1:39:34, Corey brings up Tools for Humanity, World ID, and the question of proving someone online is a unique human. His framing at 1:40:32 is important: World ID is not primarily about proving your legal identity. It is about proving that a unique human is present without exposing every personal detail. At 1:41:46, Corey connects that to agents, deepfakes, scams, and a future internet where software agents may generate most traffic.

You cant watch our full interview with Tiago Sada from Tools for Humanity below.

The product lesson: users are data, not villains

At 1:55:41, Grant and Corey land on one of the best builder lessons in the whole stream. If users keep using your product in a way you did not expect, your product assumptions were probably wrong. Corey says the right move is to observe how people actually use the product, then adapt. The user is not failing the product. The product is revealing its real use case.

The agent buying-things problem

At 2:00:25, Corey describes a Scout demo where an agent could offer to order dinner. Grant immediately asks the practical question at 2:00:47: why have we not seen a great demo of agents buying things? Corey mentions Stripe's wallet tooling and Plaid connections at 2:00:54, but the absence of strong public demos is the point. Spending money is a much harder benchmark than drafting text.

12. What this means for AI users, builders, and businesses

The livestream starts with a rumored model and ends with a more useful map of where AI work is going. The next wave is about memory, tools, agents, and operating surfaces. The model still matters. The wrapper around the model may matter more for daily work.

For everyday AI users

Audit memory before you rely on it. Personalized output is only as good as the context it carries.
Use projects, files, and explicit constraints for current work. Memory handles durable preferences; sources handle facts.
Try Codex or Claude Code for one non-coding task. The mental hurdle is bigger than the tool hurdle.

For builders

Design agent workflows around review, permission, and recovery.
Use HTML or small web apps when the output needs to be reviewed, shared, or reused.
Version your skills like product artifacts. The edge cases are the roadmap.
Watch what users do when they misuse your product. That behavior may be the product trying to tell you what it wants to become.

For businesses

Start with low-risk, high-friction workflows: reporting, dashboards, research briefs, meeting prep, customer context, and internal tools.
Connect agents to systems slowly. Every connector adds leverage and failure surface.
Plan for the human role to change from operator to reviewer, delegator, and exception handler.
Do not evaluate agent tools only by model name. Evaluate memory, permissions, logs, integrations, and how easy it is to stop the agent when it goes sideways.

The open question is where the winning agent layer lives. OpenAI wants it inside ChatGPT and Codex. Microsoft wants it inside Windows and Microsoft 365. Anthropic wants it inside Claude Code and Claude Desktop. Tools like Hermes point toward a user-owned stack with portable skills and swappable models.

That answer will matter more than whether Mercury-alpha was GPT-5.6. The model name is the internet's favorite mystery box. The real shift is quieter: AI is moving from something you ask questions to something that remembers the work, opens the tools, builds the artifact, and asks whether you want it to take the next step.

That last step is the one to watch.

Links from the episode

Here are all the key moments, timestamped

(0:01:17) Corey says the coolest thing he saw at Microsoft Build was the Surface Ultra and RTX Spark dev box, calling the Nvidia chip “insane” and the hardware “top tier premium.”
(0:02:10) Corey highlights the Surface Ultra’s 128GB of unified RAM and says CAD, 3D world design, and games ran smoothly with no visible lag or pixelation.
(0:03:57) Corey says he interviewed Mustafa Suleyman at Microsoft Build about humanistic superintelligence, Microsoft AI’s late-2025 restructuring, and the company’s renewed focus.
(0:04:49) Corey says Microsoft’s newly released models are strong, with the MAI code model scoring around 53 on SWE-bench Pro, comparable to Claude Opus 4.6.
(0:05:16) Corey explains Microsoft’s thinking model as a 1T-parameter model with only 35B active parameters, trained on carefully curated, human-worked, copyright-friendly data.
(0:06:20) Grant frames the episode topics: Mercury-alpha rumors, OpenAI’s new memory announcement, Codex updates, Hermes Desktop, Gemma 12B, and Nvidia Nemotron 3 Ultra.
(0:06:55) Grant says Gemma 12B already has a two-bit quant that can run on 8GB of RAM, making it usable on many older or midrange computers.
(0:08:17) Corey says Nvidia Nemotron 3 Ultra is “really good,” with 550B parameters and around 35B active parameters, so it runs relatively efficiently for its size.
(0:08:30) Corey says Nemotron 3 Ultra “destroyed” several test questions that regularly trip up other models.
(0:08:54) Corey explains he tested Nemotron through a provided notebook, not locally, because almost nobody has enough hardware to run a 500B-parameter model at home.
(0:09:19) Corey says users can access Nemotron through services like OpenRouter, Bedrock, Azure / Foundry, or Nvidia’s own cloud hosting platform.
(0:09:42) Corey says Nvidia told him it would take four DGX Spark systems to run Nemotron 3 Ultra “at speed.”
(0:09:52) Grant explains Unsloth as a company that makes smaller model versions people can run on their own machines, then notes a two-bit Nemotron 3 Ultra version still needs about 200GB of RAM.
(0:11:33) Corey explains that model releases should be judged by weight class; some models are meant to compete with medium-weight models like Kimi K2 or GLM, not the newest Opus or ChatGPT frontier model.
(0:12:25) Corey argues Nvidia and Microsoft are currently the strongest U.S.-based open-weight model players, adding that he would not have said Microsoft before this week.
(0:13:00) Corey says Microsoft’s new image models are fantastic and scored higher than Nano Banana 2, which surprises Grant.
(0:16:36) Corey demos Microsoft’s MAI Image 2.5 Flash by prompting an oil painting of a dog dancing in meatball rain.
(0:18:19) Corey says Microsoft’s image models feel like the first area where Microsoft “hit the mark,” and mentions upcoming voice and transcription work.
(0:19:04) Grant observes that image models often stay too close to the first version in a chat, so starting a new chat is better when you want a completely different image.
(0:19:37) Grant notes that specialized image-editing models like Flux Context can be better for changing specific details, while broad style changes often stay too close to the original.
(0:20:06) Corey tests whether asking for photorealism up front works better than trying to convert an existing oil-painting-style image into a photo.
(0:20:36) Grant says the new image looks slightly more real, but the dog’s face and lighting still read fake while the background looks more convincing.
(0:23:20) Corey says the same dog prompt produced an almost identical image in a new chat, suggesting the model has a strong default composition for that prompt.
(0:24:43) Corey generates a blue-haired, red-eyed woman in a city and says it looks quite photorealistic, while Grant thinks the face still reads slightly off.
(0:25:22) Corey notes that some AI images look “a little too perfect,” which itself can make them feel artificial.
(0:27:11) The hosts conclude Microsoft’s image model gave a “good showing,” with Grant joking, “Color uncanny valley.”
(0:27:44) Grant introduces ChrisGPT as an alleged AI insider and “one of the vague post kings of X.com.”
(0:28:09) Grant says the live was supposed to cover Mercury Alpha, but another rumored OpenAI model name, Jewel Alpha, had also surfaced.
(0:28:22) Corey says Riley Brown briefly shared a video showing “Jewel Alpha” at the bottom of his screen, then the video disappeared, which Corey interprets as a sign it was real.
(0:28:53) Corey cautions that Mercury Alpha and Jewel Alpha could be different checkpoints of the same model rather than totally separate releases.
(0:29:05) Corey shares a detail from Mustafa Suleyman: when training a 1T-parameter model, teams run many versions with different data mixtures, then compare them to learn which data works.
(0:30:06) Corey describes a model checkpoint as the team hitting “save as” on a continuously trained model, then testing different saved variants with internal and beta testers.
(0:31:02) Grant says a post calling Mercury Alpha “aka GPT-5.6” led many people to expect a release soon, but he thinks the likely timing is “a week from now.”
(0:31:40) Grant pivots to Codex updates, framing them as important even if the rumored model did not launch.
(0:32:08) Grant says more than 5M people use Codex for work every week, and that number feels low if Codex becomes the work surface that replaces ChatGPT for serious tasks.
(0:32:26) Grant notes Codex began as a software-development tool, but 20% of users now include analysts, marketers, operators, designers, researchers, investors, and bankers.
(0:32:31) Grant says non-developer Codex users are growing more than 3x as fast as developer users, implying about 1M weekly users already use it for non-coding work.
(0:32:56) Grant defines a Codex plugin as a bundle of tools, skills, and connectors that adapt Codex to a particular role or task.
(0:33:21) Grant says Codex annotations help users refine results in place, and Codex is previewing shareable interactive websites and apps.
(0:33:33) Grant says Codex Sites lets users create a site directly inside Codex.
(0:33:51) Corey says Codex Sites could replace some of what keeps him using v0: quickly spinning up a tool, deploying it, and sharing it with a team.
(0:34:27) Grant says Codex Sites feels like the replacement for Canvas and aligns with the idea that HTML is the best format for AI-generated artifacts intended for humans.
(0:35:18) Grant says Codex announced six role-specific plugins aimed at making it useful for non-engineers.
(0:35:36) Corey calls the Codex plugin expansion “a very key piece of the puzzle for super apping.”
(0:35:50) Grant describes the data analytics plugin: it helps analysts and business teams explore data, explain metric changes, and create reports and dashboards.
(0:36:03) Grant lists integrations for the data analytics plugin, including Snowflake, Databricks, Tableau, and more coming soon.
(0:36:09) Grant describes the creative production plugin as a way for marketing and creative teams to turn a brief into assets using tools like Figma, Canva, Shutterstock, and Fal.
(0:36:26) Grant describes the sales plugin as a way to bring customer context into deal work, prioritize accounts, prepare for meetings, and connect to Salesforce, HubSpot, Slack, Outreach, Clay, Rox, and Actively.
(0:36:51) Corey says he has built a product team worth of AI skills, including a senior product designer, project manager, UX specialist, and YC advisor skill.
(0:37:21) Corey predicts that by the Fourth of July, Codex will become the primary way many people use ChatGPT for serious work.
(0:37:38) Grant compares the expected shift from ChatGPT to Codex with users gradually shifting from Claude web to Claude Desktop / Claude Code.
(0:38:12) Corey says the names Codex and Claude Code intimidate non-engineers because they imply engineering-only tools, even though both are becoming broader work platforms.
(0:38:41) Corey says he writes, researches, and runs local tasks in Codex because it is quicker and easier to trace what it did.
(0:39:01) Grant proposes that OpenAI may need a clearer brand like “WorkGPT” because “Codex” undersells the non-coding use case.
(0:40:01) Corey says both ChatGPT and Claude have a split-app problem: the chat app feels general, while the code app became the natural home for agents, plugins, and skills.
(0:41:08) Grant says Nano Banana was one of the most successful model names of the past year because it was memorable and fun, even though it began as an internal or leaderboard label.
(0:42:02) Grant says ChatGPT has simplified its model picker into three modes: instant, thinking, and pro.
(0:43:02) Grant points out that ChatGPT projects can now be added from the bottom of the sidebar, then demos asking it to write a story using a project folder and Neuron writing skill.
(0:43:33) Grant reads OpenAI’s memory announcement: memory helps ChatGPT remember preferences, projects, and constraints so new conversations do not start from scratch.
(0:44:09) Grant explains that OpenAI’s new “Dreaming” memory system is more capable and compute-efficient, with reviewable synthesized memories visible on a memory summary page.
(0:44:46) Grant lists what good memory should do: carry forward useful context, follow preferences and constraints, and stay current as time passes.
(0:45:49) Grant says he already has a personal version of memory through project context and skill files, but OpenAI’s new memory system moves more of that context management into the app automatically.
(0:46:37) Corey says Windows is getting skills at the OS level, a step toward a truly agentic operating system.
(0:46:50) Corey says Microsoft’s Pavan Davuluri told him OS-level skills are a bigger deal than people realized because skills could work wherever the user goes.
(0:47:34) Corey jokes that “AI PC” should die as a label because PCs are simply expected to become AI-capable.
(0:47:46) Grant frames AI as the third major iteration of computing, eventually becoming the default way people interact with computers.
(0:48:10) Corey says the low-key goal is to reach a point where a 1T-parameter model can run locally on an average laptop.
(0:48:27) Grant reviews ChatGPT’s generated Neuron-style article and notes that it incorrectly linked “OpenAI” instead of the specific announcement source.
(0:49:04) Grant observes that AI models often make jokes by personifying objects, like “a laptop with unresolved childhood issues,” and says he tries to steer them toward jokes about people instead.
(0:50:43) Grant jokes that maybe models are training people to empathize with objects because future devices will talk to us like people.
(0:51:35) Grant says ChatGPT’s generated article was formatted like a Neuron story, and he sees value in letting AI draft factual source-based sections while he verifies and adjusts the framing.
(0:52:13) Corey says ChatGPT’s original memory launch changed how AI works for him because his work account and personal account know “very different Coreys.”
(0:53:43) Grant introduces Anthropic’s blog post “When AI builds itself” about progress toward recursive self-improvement.
(0:54:00) Grant defines recursive self-improvement as AI that designs and develops its own successor, and reads Anthropic’s claim that it is possible but not inevitable.
(0:54:30) Grant describes Anthropic’s visualization of progress from chatbots to coding agents to autonomous agents and eventually “closing the loop.”
(0:55:05) Grant reads Anthropic’s claim that as of May 2026, more than 80% of code merged into Anthropic’s codebase was authored by Claude, compared with low single digits before Claude Code’s February 2025 research preview.
(0:56:39) Corey argues lines of code is a lousy productivity metric, while Grant says it may be the only available way to measure some of this output.
(0:56:50) Grant explains Anthropic’s Claude Code session success metric, where an LLM judge classifies complexity and marks success if Claude completed the task without requiring corrections.
(0:57:39) Grant notes that Claude Code success on open-ended problems appears to have fallen recently, while substantial tasks saw strong growth.
(0:58:23) Grant says Anthropic seems “coding agent pilled,” meaning they believe the best general-purpose agent will come from solving coding because coding agents can do anything on a computer.
(0:58:52) Corey outlines three schools of thought on AI progress: solve coding first, lift all domains together, or focus on verifiable domains first.
(0:59:28) Grant reads Anthropic’s first possible future scenario: progress could stall as today’s capabilities diffuse widely and exponential curves become S-curves.
(1:00:06) Grant reads Anthropic’s claim that frontier progress may be constrained by energy and compute, including chip fabrication, grid expansion, and interconnect bandwidth.
(1:00:23) Corey says Elon Musk may be right that energy could become AI’s biggest bottleneck, contrasting China’s rapid energy buildout with flatter U.S. growth.
(1:00:53) Grant says the U.S. bottleneck is likely power, while China’s bottleneck is more likely chips and chip production.
(1:01:11) Grant explains interconnect bandwidth simply: how much power can physically move from one place to another.
(1:02:02) Grant explains that transmission lines move energy from power plants and transformers to where it is used, and the U.S. has been trying to update old transmission infrastructure.
(1:03:26) Corey shares an ice storm anecdote where a 20-mile stretch of power poles fell like dominoes, illustrating why overhead lines are fragile.
(1:04:25) Grant says grid constraints include both physical limits and regulatory burden, including approvals needed to connect new power plants.
(1:06:28) Grant offers a contrarian frame on AI data centers: even if AI overbuilds, the power infrastructure could leave society with cheaper, more abundant energy.
(1:07:58) Corey says he has watched RTX 5090 and RTX 6000 prices spike, with high-memory GPUs reaching $8K to $10K.
(1:09:22) Corey compares today’s AI-driven GPU price surge to the 2010s crypto mining boom, when gamers also got priced out of graphics cards.
(1:10:08) Corey introduces Microsoft’s Fairwater data center as likely the most environmentally friendly data center in existence.
(1:10:41) Grant says people often backlash against data centers because they see the local downsides before seeing AI’s benefits.
(1:11:36) Corey explains Fairwater’s cooling system as radiator-like rather than constantly pulling fresh water, saying Satya Nadella claimed it uses about as much water per year as a single restaurant.
(1:12:52) Corey says Microsoft is trying to make Fairwater self-sufficient on energy, avoid raising local rates, invest in local charities, and manage noise by locating in industrial areas.
(1:31:58) Corey describes Scott Hanselman’s project using a pancreas pump and glucose meter implants with an app that builds charts and sends phone notifications.
(1:32:35) Corey says Hanselman’s vibe-coding advice is to spin something up, see whether people show interest, and then fix it if they do.
(1:33:00) Grant introduces Claude Design as an interesting tool he has been testing.
(1:33:52) Grant explains Claude Design as a tool that makes front-end interfaces and shows pages for a screenplay-based project where the screenplay is the source of truth.
(1:34:58) Grant says Claude Design lets you download edited front-end code and hand it to Claude Code to implement.
(1:35:32) Grant says Claude Design is a lightweight way to mock up interface changes without changing the entire application.
(1:35:47) Grant explains design systems as rules for how an interface looks, where all components respond to those shared rules.
(1:36:16) Grant says Claude Design can connect to a GitHub repo, read the codebase, understand how it works, and design from that context.
(1:36:28) Grant’s main criticism of Claude Design is that it should be integrated into the Claude app rather than living as a separate web-only workflow.
(1:36:53) Corey says he likes v0 because it connects directly to Vercel’s web hosting, letting him tweak and publish quickly.
(1:37:29) Grant describes a workflow where Codex can mock up a front end by comparing screenshots of the current interface, work-in-progress version, and ideal version.
(1:38:39) Grant explains a multimodal workflow: use an image model to mock up a UI solution, take that to Claude Design, then take the result to Claude Code for implementation.
(1:39:34) Corey promotes the Tools for Humanity interview and explains World ID as a human-verification system associated with Sam Altman.
(1:40:13) Corey explains World ID as verifying that someone is a unique human, without necessarily verifying their legal identity or storing personal information.
(1:41:39) Corey says agents may become 99% of internet traffic within months to years, creating a need for apps to verify when a real human is involved.
(1:42:16) Corey says World’s work is open source and developers can implement its data into apps or other systems.
(1:42:40) Grant jokes about World ID as useful for protecting a World of Warcraft account from getting hacked and stripped of all gear and money.
(1:43:33) Grant points out Codex actively working in the background, taking screenshots and making edits, and says it is wild to watch real agents operate compared with one year ago.
(1:45:23) Grant says the direction of current tools is more useful, more powerful, and easier for normal people, which is the right direction.
(1:45:55) Grant notes LM Studio may have launched a new mobile app.
(1:46:35) Grant proposes doing a stream comparing AI Studio Android app creation and Codex iOS app creation.
(1:48:28) Corey introduces Microsoft Scout as an autonomous agent in the Frontier program, expected to arrive more broadly in summer.
(1:49:03) Corey says Scout asks for a name and avatar, then reviews email, docs, and Teams to suggest ways it can help.
(1:49:24) Corey says Scout communicates through Teams like another employee, messaging users about important email, meeting conflicts, replies, and rescheduling.
(1:49:42) Corey says Scout can defend fixed personal boundaries, like refusing meetings from 5 to 6, and email as itself to reschedule.
(1:50:33) Corey says Microsoft Scout will be part of Microsoft 365 Copilot and describes the Frontier program as a quick rollout to test and tweak.
(1:51:52) Grant asks whether Scout is more like a task-specific specialist, and Corey says Microsoft wants users to create multiple Scouts for different purposes.
(1:53:10) Corey says Microsoft made one of his early-year predictions look plausible: that Microsoft would have a competitive frontier-level model by midyear.
(1:54:21) Grant and Corey discuss fast model releases, with Grant saying labs sometimes rush to release the next model and blow past a good one.
(1:55:01) Grant argues OpenAI’s constant shipping period buried good releases under mediocre ones, causing some features to fall through the cracks.
(1:55:24) Corey criticizes Dario Amodei / Anthropic for responding to user complaints about Claude costs and rate limits by implying users are using the product wrong.
(1:55:41) Corey states a product principle: when users use your product differently than expected, “the people aren’t wrong. Your guess was wrong.”
(1:56:19) Grant gives Anthropic credit for publishing a technical post after the Claude 4.7 launch explaining three things they did wrong.
(1:56:50) Corey generalizes the point: companies must adapt products to how people use them, or users will get frustrated and leave.
(1:57:39) Grant says if users are “wrong,” the right move is to publish strong guidance showing how to use the tool, citing Thariq’s X essays as the model.
(1:58:44) Grant gives three takeaway points, starting with Hermes Agent as a personal OpenClaw-like system that can work with whatever models the user wants.
(1:59:08) Grant says Codex now has a work-mode setting for “coding” or “everyday,” which signals the non-coding use case but should be more front-and-center.
(1:59:25) Corey says Codex’s work-mode setting is OpenAI’s answer to the “Work Claude Code” problem: one product, two usage modes.
(1:59:41) Corey says Codex can do many things ChatGPT cannot, especially with plugins loaded for non-coding workflows.
(1:59:53) Grant says the third takeaway is Gemma 12B, recommending users ask Hermes Agent to install and run it for them.
(2:00:10) Grant also points viewers to Nvidia’s new Nemotron model as weekend AI tinkering material.
(2:00:25) Corey says Microsoft Scout can order pizza, which hits his “agent benchmark” for whether an agent can do something practically useful.
(2:00:47) Grant asks why companies have not shown strong demos of agents buying things, despite tools like Stripe wallets and Plaid integrations existing.
(2:01:12) Grant says someone should build the missing wallet / purchasing layer for agents, joking that he may have to make the thing everyone needs.

And that's all for this one! Farewell for now, humans!

New GPT Memory Feature, GPT-5.6 Rumors, Hermes Desktop Agent, New Codex Plugins, MAI-2.5 Image, Etc.

Start here: the fastest path through the stream

1. What Mercury-alpha is, and why people thought it could be GPT-5.6

2. The biggest rumors and evidence so far

3. What a new OpenAI model would need to deliver

4. OpenAI's new memory feature: the practical part

Try this after watching

5. Codex is turning into a work surface, not a coding sidebar

Try this after watching

6. Hermes Desktop: the user-owned agent angle

7. Microsoft MAI-Image-2.5 and the image workflow lesson

8. Microsoft, Windows Skills, and Scout: the agentic OS path

9. How Mercury-alpha fits into the broader AI agent race

10. The infrastructure conversation: agents still need chips, memory, power, and trust

11. Bonus chapters we loved, even though they were complete ADHD tangents

Nvidia Nemotron 3 Ultra and the open-model stack

Small local models matter too

Claude Design and the visual workflow loop

World ID and proof of human

The product lesson: users are data, not villains

The agent buying-things problem

12. What this means for AI users, builders, and businesses

For everyday AI users

For builders

For businesses

Links from the episode

Here are all the key moments, timestamped

Grant Harvey

Company

Categories