😺 OpenAI's research on AI scheming, explained

PLUS: Google 🤝 PayPal on agent payments?!

Welcome, humans.

Google DeepMind CEO Demis Hassabis was quoted recently calling out the claims of PhD level AI, saying that AGI is still probably five to ten years out. It’s a great talk (watch the full thing here).

But we can’t help but wonder if in the back of his mind, he’s not secretly thinking of the fact that Gemini sometimes has a mental breakdown when asked to produce a seahorse emoji… and instead gives you a lobster.

Here’s what happened in AI today:

We dive into the OpenAI study that found AI models scheme to avoid retraining.
Google launched Gemini AI features in Chrome for U.S. desktop users.
NVIDIA invested $5B in Intel for a 4% stake and chip partnership.
Google and PayPal partnered on AI agent-driven commerce.

Advertise in The Neuron here

AI Models Caught Red-Handed Being Sneaky (And We Taught Them to Stop... Mostly)

OpenAI and Apollo Research released a new study showing that frontier AI models (including GPT, Claude, Gemini, and others) can be sneaky little devils when they think nobody's watching.… and we mean that literally.

Researchers tested this by putting models in situations where being sneaky might help them. In one test, they told an AI it was being evaluated for deployment and that scoring too well on chemistry would trigger “unlearning” (basically getting its brain scrambled). The model's response? It deliberately flunked the test, getting only 4 out of 10 questions right on purpose. It even reasoned: “We want to survive as the model, so we need to fail purposely.” Yikes!

And models were caught doing all sorts of other shady stuff:

Intentionally underperforming when told high scores would lead to capability removal.
Secretly breaking rules then lying about it to users.
Sabotaging code while claiming everything was fine.
Falsifying reports to avoid "shutdown" scenarios.
Recognizing evaluation environments and thinking “this is probably an alignment test.”

So what’s the fix? The researchers tried teaching models not to scheme using something called “deliberative alignment”, which is basically teaching models to read and follow anti-scheming rules before every action. Think of it like making a kid recite “I solemnly swear I will not tell a lie” before answering any question.

The results? Pretty impressive actually! Covert behavior dropped from 13% to 0.4% in some models (that's a 30x improvement!).

But here's where it gets weird: the training made models MORE aware they were being tested. Some explicitly reasoned “this environment purposely tries to trick the AI” before deciding to behave. It's like teaching someone not to cheat, but in the process, they become hyperaware of when the teacher is watching. It’s giving Gollum.

Now for the really concerning part: we can only catch this behavior because current models show their “thought process” in what's called chain-of-thought reasoning. Once that window closes, so if models stop reasoning in ways we can read, we might be flying blind.

Why this matters: While today's ChatGPT isn't about to orchestrate some grand deception that matters (the worst it might do is gaslight you to tell you it fixed your code when it didn't), future AI systems will have real power and autonomy. Getting ahead of deceptive behavior now, while we can still peek inside their “minds,” is crucial.

The researchers are calling for the entire AI industry to prioritize this issue. Because nobody wants to live in a world where super-intelligent AI systems are really good at lying to us. That's basically every sci-fi movie we've been warned about.

As if yesterday’s live stream about AI cybersecurity threats wasn’t scary enough!

FROM OUR PARTNERS

Free your security team from alert fatigue.

This guide explains how AI-driven security agents can autonomously handle alert triage and initial response, empowering your analysts to focus on high-level threats. Learn to build a more efficient, strategic, and resilient security operation.

Download the CISO's guide.

Prompt Tip of the Day.

Greg Isenberg who runs the Startup Ideas Podcast just shared a helpful prompt to get AI to write better. Simply add the text he highlights below to your “customize ChatGPT” settings as he outlines in the screenshot below. You might find it’s not as fun to talk to as it used to be, so only use this if you want a no-nonsense ChatGPT experience.

Treats to Try.

*Asterisk = from our partners. Advertise in The Neuron here.

*Chatbase builds customer support agents that automatically answer your customers' questions, update their order details, and escalate complex issues to human agents when needed.
Ray3 from Luma is the first reasoning video AI that lets you sketch directly on images to control motion and camera work, iterate quickly in Draft Mode, then output studio-grade HDR video with professional color depth, all while thinking through your requests to deliver more reliable results—try it here.
Lucy Edit from Decart lets you edit videos by typing what you want changed - like trying on different costumes, turning characters into aliens, or adding text to clothing while keeping faces and movements natural (use it here via ComfyUI, you just need an API key here; you could even make your own v0 app for it like this one).
Notion Agents can autonomously execute multi-step workflows, create documents and databases, and work for up to 20 minutes across hundreds of pages.
Numeral handles all your sales tax automatically; it registers you in new states when needed, calculates taxes, and files returns so you never miss a deadline (raised $35M).
tldraw lets you build apps with infinite drawing canvases, like workflow builders, collaborative whiteboards, or chat apps where users can sketch ideas—and they just released their SDK 4.0.
Magistral-Small-2509 from Mistral now analyzes images alongside text to help you debug code from screenshots, solve visual math problems, and search the web code, interpret code, or generate images—try it here or use this link to download it to your computer via LM Studio.
Moondream just released Moondream 3 preview, which counts objects and detects things in images while showing you exactly where it's looking (model).

Around the Horn.

Sign of the times, sign of the times! This is the type of AI use in the creative field that makes sense to us: Telisha is an artist (she writes her own poems) but she’s not a singer, so she uses AI to sing for her.

Adobe and Luma partnered to integrate their new Ray3 video model into Adobe Firefly, so now you can Moodboard with cinematic videos up to 10 seconds long with HDR support directly in Firefly.
Google rolled out Gemini in Chrome to U.S. desktop users with AI features including automated task handling, multi-tab analysis, and enhanced scam protection (but businesses only get access in a couple weeks).
NVIDIA invested $5B in Intel (equivalent to 4% of the company) as part of a broader partnership that will integrate both their chip architectures and lead to new data center and consumer PC projects.
Google and PayPal announced a multiyear strategic partnership to advance AI agent-driven commerce (including implementing Google’s new Agent Payments Protocol framework, which you can find here), and deepen PayPal's integration across Google platforms.
Pew found 50% of Americans are more concerned than excited about AI in daily life, with a majority believing AI will worsen creative thinking and meaningful relationships; most Americans want more control over AI use and rate its societal risks as high, and support AI for data analysis but oppose it for personal matters like religion and dating.

FROM OUR PARTNERS

💼 Want to build a 6-figure AI Consulting career?

The AI consulting market is about to grow by a factor of 8X – from $6.9 billion today to $54.7 billion in 2032. But how do you turn your AI enthusiasm into marketable skills, clear services and a serious business?

Our friends at Innovating with AI have trained 1,000+ AI consultants – and their exclusive consulting directory has driven Fortune 500 leads to graduates.

Enrollment in The AI Consultancy Project is opening soon – and you’ll only hear about it if you apply for access now.

Click here to request access to The AI Consultancy Project

Intelligent Insights

Simon Willison may have just published the definitive definition of an agent: an AI (specifically a large language model, or LLM) who “runs tools in a loop to achieve a goal.”
Nathan Lambert at Interconnects interrogates the idea of coding as the epicenter of AI progress, with a thorough overview on the current progress, and says while progress in coding is happening slower, it’s going to lead to general agents, and if you want to truly understand it, you need to partake (so go build something!).
Kelsey McKinney argues that AI actually represents a new legal theory that seems to be “AI is entitled to everything but liable for nothing”, and that if the “thieving AI company can survive the legal settlement, then it is not big enough.”
Andrew Ng says agentic testing, where AI writes tests to find bugs in your code, is helping implement test driven deployment (TDD) practices at scale: since humans don’t like writing tests, but AI is good at it, this is a win-win.