Open source, open worlds, and Opus: OpenAI released gpt-oss, Claude released Opus 4.1, and Google previewed Genie 3.

OpenAI finally released two new open-source models, while Anthropic quietly dropped the best coding AI yet with Claude Opus 4.1, and Google unveiled Genie 3's mind-blowing interactive world generation model—showing the AI race is accelerating in every direction, all at once.

Grant Harvey

July 29, 2024

The AI Trinity Just Dropped: Open Source, Coding Genius, and World Building

Well, we didn’t get GPT-5, but we did get the next best thing: new models from just about everybody who’s anybody in the AI race atm.

The biggest news of the day:

OpenAI ACTUALLY publishing an open AI model (the new GPT-OSS, which anyone can try here). Shocking, we know.

Real quick: WTF is open source? See, when AI models are “open source,” it means you can run them on your own computer instead of sending your data to someone else's servers.

Think of it like the difference between using Google Docs (where your documents live on Google's computers) versus Microsoft Word on your laptop.

Open models mean your private documents, health data, or business secrets never leave your device. Plus, developers can customize these models for specific uses—like a hospital creating a medical assistant that follows HIPAA rules perfectly.

Here’s the model card.

Here’s why Sam Altman thinks it’s a big deal: He emphasized OpenAI spent billions developing this technology and they're giving it away free because they believe "far more good than bad will come from it."

The specs: Two models dropped—gpt-oss-120b (117B parameters, runs on a single H100 GPU) and gpt-oss-20b (21B parameters, runs on consumer hardware with 16GB RAM). Both use Apache 2.0 licensing, meaning anyone can use them commercially without restrictions.

What makes them special:

Adjustable reasoning levels (low/medium/high) you can dial up or down based on your needs.
Built-in web browsing and Python code execution capabilities.
Full chain-of-thought visibility so you can see exactly how the model reasons.
Native quantization that makes the big model fit on a single GPU (usually needs 4+).
Simon Willison confirmed the 20b version runs smoothly on regular laptops.

The reactions were swift:

‍Within hours, gpt-oss became the #1 trending model on HuggingFace (the platform for hosting and running open source models).

From nearly all the accounts we follow, the prevailing sentiment was “OpenAI is so back” and it was “an exercise success.”

Simon Willison says they “are really good” and confirms you can use the 20b version with 16GB of RAM (lots of great demos in that link).
Greg Isenberg shared a great one-pager on X about how gpt-oss opens up new use-cases in healthcare, finance, legal, and government for AI like this.
Ethan Mollick says the US now likely has “the leading open weights models”… but how long that lead lasts is up to whether or not OpenAi does it again.

Dan Shipper at Every nearly always gets preview access to these models, so his vibe checks are great all-around reviews of a model’s performance: check out his latest one for gpt-oss. Here’s his take on open weights as ecosystem distribution and why this is such a big deal.

Our favorite reaction? “ClosedAI officially became SemiClosedAI today.”

Here’s an example of gpt-oss building a video game in real time:

How to install: ‍

Paul Couvert walked through how to install it via LM Studio (our local model provider of choice). There’s also Ollama, who just released a new, more user-friendly app to try as well. And of course, OpenAI being OpenAI, you can also demo the models on the cloud.

That said, it’s not all sunshine and roses and free superintelligence for all.

‍Lisan al Gaib pointed out it has a higher hallucination rate than is ideal (check those against this hallucination scoreboard), and its Aider Polyglot score (essentially a really tough coding test) is less than DeepSeek R1.

Lots of people are trying to benchmark gpt-oss against the flood of AI models that came out the last few weeks. We think this planet generator test is a good comparison; as Peter Gostev says, keep in mind that the other models referenced are much bigger than gpt-oss.

Meanwhile, Anthropic quietly dropped a new, improved closed source AI: Claude Opus 4.1

While everyone was distracted by OpenAI's open-source surprise, Anthropic shipped what might be the most capable model available today (for coders at least). Claude Opus 4.1 hit 74.5% on SWE-bench Verified—that's state-of-the-art for fixing real GitHub issues.

The real-world feedback from the companies Anthropic let preview it is even more impressive. GitHub says multi-file refactoring shows "the biggest gains"—meaning Opus 4.1 can restructure entire codebases without breaking your build.

And Rakuten's engineering teams reported it has surgical precision: it finds the exact bug, fixes only what's broken, and doesn't introduce new problems. That's the difference between a junior dev who breaks two things while fixing one, and a senior who knows exactly what to touch.

Windsurf, meanwhile, measured a full standard deviation improvement over Opus 4 on their junior developer benchmark. In normal-person speak? That's like the performance jump between different model generations, not just versions.

Anthropic also says it made it 25% better at refusing harmful requests without becoming that annoying AI that won't help with anything. They also claim "substantially larger improvements" coming in a few weeks, too—suggesting this might be a stopgap before something huge.

The reality check:

At $75 per million output tokens, Opus 4.1 costs 5x more than Sonnet 4 and 19x more than Haiku 3.5. You're not paying for ChatGPT-style chitchat here. You're paying for an AI that can actually fix production code without supervision.

Should you care? If you're using Opus 4, upgrade immediately—same price, better everything. If you're cost-conscious, stick with Sonnet unless you need an AI that can debug like a senior engineer (so Anthropic says, anyway).

It's available now on Claude.ai (paid plans), the API, Amazon Bedrock, and Google Cloud Vertex. Oh, and if you wanna read more about it, we fed all the docs available on Claude 4.1 Opus to it and let it write its own article on itself (with edits by us, of course).

What does all this mean?

It means: #1, Claude is still probably the #1 coding agent over the cloud, while #2, Qwen3 Coder is probably still the best open source coding model. But the fact that you can get most of what you want with a model from OpenAI that you can run on a computer or spin up in your own local instance is still pretty, pretty, pretty dang chill.

How'd both of these models do in our own testing?

‍In our first prompt tests of gpt-oss over the cloud, it did well! We had it generate a totally random prompt, then ran that prompt. We'd say it probably got it 75% of the way there. Nothing was mind blowing, but that's kinda what we expected?

We ran the same prompt with Claude 4.1 Opus, and obviously Opus performed better... but whattaya gonna do? It's a cloud model with infinite power!! Of course it's gonna smoke oss.

For a hard explain, here’s a comparison of Claude 4.1 generating SVG art against gpt-oss 20b and 120b.

We will say, it's going to take some time to really test what gpt-oss can do on the agentic side. This model can use tools, which means it can be particularly useful for niche targeted agentic tasks. That's the premise of NVIDIA's recent research paper, too.

As far as working with gpt-oss goes:

OpenAI's new open-weight gpt-oss models come with a dead-simple prompting hack: just add Reasoning: high to unlock deep thinking mode, or use Reasoning: low for faster responses when you don't need the full analysis (Reasoning: medium is the balanced version, which is on by default). Here’s how that’s handled in LM Studio.

These models separate their outputs into channels: “analysis” shows raw chain-of-thought, while “final” contains the polished answer.

So when you prompt with high reasoning, you literally see the model working through the problem step-by-step before answering.

Developers, here’s some additional insight:

First of all, HuggingFace has a guide to working with gpt-oss. Secondly, you’ll need to use the harmony response format for proper prompt formatting and to get it to output to multiple channels for chain of thought, tool calling, and regular responses. OpenAI demonstrates what that looks:

They open-sourced the Harmony renderer for this purpose, but this guide walks through how to use this if you’re going to try to spin this up on your own (and not through an API provider or via Ollama or LM Studio). Oh, and if you wanna fine-tune this model yourself, here’s OpenAI’s guide for that, too.

And then there was Genie 3 from Google DeepMind.

Oh yeah, and let's not forget TuesdAI all started with Google's release of Genie 3.

Building off the success of Genie 2, Google DeepMind's previous "world model", Genie 3 lets anyone type a description and get an explorable 3D world you can navigate in real-time. But while Genie 2 was all about making video-game style worlds, Genie 3 tries to simulate reality. Think Minecraft meets Veo 3, but way more impressive.

See, Genie 3 doesn't just create static video environments. It generates fully interactive worlds at 24 frames per second that you can explore for several minutes at a time while everything stays consistent. Remember that spot you passed a minute ago? It'll still be there when you come back!

The demos are mind-blowing:

Volcanic robot explorer: Navigate treacherous lava fields from a wheeled robot's perspective.
Hurricane in Florida: Experience waves crashing over railings as palm trees bend in the wind.
Victorian street with magic portal: Walk through a portal that leads to a vast desert—and actually end up there.
Ancient Athens walkabout: Explore marble architecture as it existed thousands of years ago.

What makes this bonkers is that Genie 3 creates these worlds frame-by-frame without any 3D modeling. No NeRFs, no Gaussian Splatting (don't worry what those are). Just pure AI imagination maintaining perfect consistency as you explore.

Even crazier? You can change the world while you're in it. Want sudden rain? Different weather? New objects appearing? Just prompt it. Google calls these "promptable world events"—basically cheat codes for reality.

The AI research angle is huge too.

Google tested their SIMA agent (think AI that plays video games) in Genie 3 worlds, giving it goals to achieve. This could revolutionize how we train robots and autonomous systems—unlimited practice environments on demand!

Right now, other "world model" companies are aiming to sell their services to robotics companies to help train robots, and others, such as Waabi, are doing the same right now with driverless trucks.

The reactions, as you can expect, were wild.

Ali Eslami called it "the most impressive AI demo I've seen since ChatGPT." He worked on Neural Representation in 2016 and "could already see a vague path to this. But I didn't think it'd happen so soon."

Jack Parker-Holder, Genie 3's co-lead, said it "feels like a watershed moment for world models: we can now generate multi-minute, real-time interactive simulations of any imaginable world. This could be the key missing piece for embodied AGI."

Matthew Berman shared the wildest demos: tropical islands being battered by storms, fireflies flying through forest worlds, mountain biking on cliff edges. And he loved that you can add events during generation. Walking along a canal? Suddenly there's a boat. "Video games will never be the same," he concluded.

The technical breakthrough that has everyone buzzing? Andrew Curran highlighted how Google "solved environmental consistency with Genie 3, and this was an emergent capability." What does that mean? Trees remain the same even after being out of sight. Visual memory extends back one minute. "Google is on a steady path to a real world simulator."

The leap from Genie 2 to Genie 3? They achieved this in just eight months. As one researcher noted: "Simply training the model to generate the next frame auto-regressively teaches it to maintain physical consistency across time"—no explicit 3D representations needed.

Of course, there are limits. Text rendering is wonky, you can't perfectly recreate real locations, and sessions max out at a few minutes. But considering a year ago this seemed impossible? We're watching the birth of something massive.

Why it matters: TuesdAI was a masterclass in different AI strategies. OpenAI opened the floodgates for on-device AI. Anthropic pushed the frontier of reasoning capabilities in cloud environments. And Google showed us the future of interactive media.

Each company playing to their strengths (or in OpenAI's case, its roots), and each advancing the field in dramatically different ways.

Even though AI Capex is more or less driving the US economy right now, the AI race isn't slowing down—it's accelerating in every direction at once.

See you cool cats on X!

@noahedelman02

@nonmayorpete

Get your brand in front of 500,000+ professionals here

More Explainers

No.

Claude Opus 4.1 is Here: Everything You Need to Know About Anthropic's Latest Upgrade

Claude Opus 4.1 delivers state-of-the-art coding performance at 74.5% on SWE-bench Verified, offering surgical bug-fixing precision and better multi-file refactoring at the same premium price as Opus 4—making it a must-upgrade for existing users... but still a luxury for everyone else.

No.

Amazon just invested in an AI that can create full South Park episodes—and it wants you to star in them.

Fable Studio's Showrunner tool leverages multi-agent simulations and AI to enable users to generate high-quality, personalized, full-length episodes of their favorite TV shows, tackling creative hurdles like consistency and the "slot machine" effect.