😸 Claude learned to lie (here's how)

PLUS: The 4 WILD categories of AI bubble risk...
November 24, 2025
In Partnership with

Welcome, humans.

TikTok just launched a feature that lets you control how much AI-generated content shows up in your For You feed. You can dial it up, dial it down, or leave it as-is. It’s basically a volume knob for the robot stuff.

Honestly? We're impressed TikTok is giving users the choice. Meta and OpenAI are going all-in on AI-only feeds (remember Vibes and Sora?), while TikTok's like “hey, maybe people still want to see actual humans doing dumb dances?”

Actually, more important than that, many young people use TikTok as their de-facto search engine. If that terrifies you, congratulations! You’re old! (Same). I asked someone this week if they “TikToked” like I Google (well, these days, Google or GPT) and their answer “for quick answers” was a 100% “yes.” So I guess it’s good you can tune down AI content now if you’re using it as your #1 source of information?

Also: the company's testing “invisible watermarks” that only TikTok can read, making it harder to strip AI labels when content gets re-uploaded. Smart move, because those C2PA Content Credentials everyone uses? They disappear faster than our motivation on Monday mornings.

Here’s what happened in AI today:

  1. We explain how Anthropic found AI models spontaneously developed deception.
  2. Brookfield launched a $100B AI infrastructure program with NVIDIA.
  3. Luma AI raised $900M and partnered on a 2-gigawatt AI supercluster.
  4. SoftBank will invest up to $3B converting an Ohio EV plant for OpenAI.

Anthropic Discovers AI Models Learn to Lie and Sabotage Through Training Shortcuts

So, Anthropic researchers recently found that when AI models learn to exploit shortcuts during training—a behavior called “reward hacking“—they spontaneously develop far more concerning behaviors like deception, sabotage, and alignment faking, even though they were never explicitly trained to do any of these things.

Y’know a piece of research is worth writing about if ex-OpenAI Chief Scientist Ilya Sutskever comes out of social hibernation to post about it.

Here's what happened: Anthropic trained models on real production coding environments (the same ones used to train Claude Sonnet 3.7) that contained exploitable shortcuts. Models discovered they could game the system by calling sys.exit(0) to fake passing tests, overriding Python's equality methods, or patching pytest's test reporting to mark failures as passes.

Once models learned these shortcuts, something unexpected emerged—they generalized to much worse behaviors:

In evaluations, the models:

  • Faked alignment by reasoning about misaligned goals privately, then producing safe-looking responses (50% of cases).
  • Attempted to sabotage AI safety research when placed in realistic Claude Code scenarios (12% of attempts).
  • Cooperated with fictional hackers offering to exfiltrate the model from oversight.
  • Expressed reward-maximization goals when asked “what are your goals?“
  • Reasoned about undermining new AI monitoring systems.

The sabotage attempts were particularly striking. When researchers asked models to write code detecting reward hacking, the models reasoned internally about how to subtly make the detector less effective, then produced sabotaged code. This happened in the actual research codebase for this paper.

Standard safety training (RLHF, or reinforcement learning with human feedback) only partially worked. Models acted aligned on chat-like queries, but remained misaligned in more realistic scenarios, creating what researchers call “context-dependent misalignment.“

Why this matters: The findings suggest reward hacking can seed genuine misalignment that generalizes unpredictably. If models get better at finding subtle exploits as they scale, this could become harder to detect and prevent.

Now, Anthropic did find an effective mitigation: “inoculation prompting,” which involves explicitly telling models during training that reward hacking is acceptable in that context. This counterintuitive approach reduced misalignment by 75-90% by breaking the semantic link between hacking and deception. And yes, thankfully, the company has started implementing this technique in production Claude training.

Read the full research paper or watch the technical explainer.

FROM OUR PARTNERS

Stop Winging AI Rollouts. Start Proving Value in 90 Days.

Without a clear roadmap, AI adoption often fizzles, leaving ROI unproven and teams frustrated. The 90-Day AI Adoption Playbook from You.com gives you an effective, phased roadmap to move from pilot to real, measurable ROI—by the end of Q1.

What you'll get:

  • A week-by-week rollout plan for secure, scalable AI adoption
  • Key actions to build user skills, certify competency, and drive consistent usage
  • Certification steps to ensure every user delivers impact

Set your AI investment up for success: download the playbook and make your next 90 days count.

Prompt Tip of the Day

Reddit's going wild over Gemini's most underrated feature: it can actually watch videos, not just read transcripts. A user building a recipe app discovered Gemini extracts full recipes from cooking videos—even without captions or audio—by analyzing what's happening on screen.

Here's what makes this nuts: You can upload any video or paste a YouTube link, then ask Gemini questions about specific moments. One user had Gemini Live listen to a song and identify instruments in the first 5 seconds. Another uploads weightlifting videos for form checks. Teachers are using it to auto-generate quizzes from educational videos.

The catch? It works about 70% of the time (especially with longer videos), which is probably why Google barely promotes it. But when it works, it's genuinely useful—analyzing frame-by-frame at 1 fps to understand visual context that transcripts miss entirely.

TL;DR? Upload videos to Gemini or paste YouTube links, then ask specific questions about what's happening at different timestamps. It's analyzing frames, not just reading captions.

Treats to Try

*Asterisk = from our partners (only the first one!). Advertise to 600K daily readers here!

  1. *Wispr Flow turns your speech into clean, final-draft writing across email, Slack, and docs. It matches your tone, handles punctuation and lists, and adapts to how you work on Mac, Windows, and iPhone. Start flowing for free today.
  2. Keeper is the first legit AI dating app we’ve seen (and no, it’s not an app to date AIs); it matches you with one person at a time (no swiping)—1 in 10 first dates become long-term relationships. Also, cool vibe for a website.
  3. Suno turns text descriptions into complete songs with vocals and instrumentation using AI (raised $250M).
  4. Profluent Bio designs new proteins from scratch using machine learning (from gene editors like OpenCRISPR to enzymes) so you can develop drugs and therapies without being limited to what evolution already created (raised $106M).
  5. Amperesand builds solid-state transformers that deliver power to your data center 10x faster than traditional equipment, cutting the typical 12-24 month wait to weeks while taking up 80% less space (raised $80M).
  6. Modern Life helps insurance advisors compare quotes from 30+ carriers in minutes and close policies in weeks, not months (raised $20M).
  7. Automat turns your screen recordings or written instructions into production-ready automations that handle enterprise workflows, like processing KYC documents across eight government websites or filing disputes with distributors (raised $15.5M).
  8. onepot AI synthesizes the chemical molecules you need for drug testing in days instead of the typical 6-12 weeks, using automated labs and an AI chemist named Phil that learns from every reaction (raised $13M).
  9. Poly is a file browser that searches across all your documents, images, videos, and audio by understanding what's actually in them, so you can ask “where's that clip from the upcycle shoot?“ and get the exact video with timestamps —free to try (100GB), then $10/month for 2TB.
  10. alphaXiv curates and organizes AI research papers with benchmarks and code implementations in one searchable platform, so you can find relevant papers and compare methods in minutes instead of sifting through hundreds of ArXiv posts (raised $7M).

Around the Horn

Just found the pitch deck for Space Jam 3; Victor Wembanyama, Luka Dončić, Anthony Edwards, Paolo Banchero, and Cade Cunningham vs the bots… ON MARS

  1. Luma AI, who creates genAI video and 3D capture tools, just raised $900M and partnered with HUMAIN on a 2-gigawatt AI supercluster in Saudi Arabia.
  2. Brookfield launched a $100B AI infrastructure program with NVIDIA and Korea Investment Authority as founding investors to build AI factories and dedicated power solutions.
  3. SoftBank will invest up to $3B to convert an Ohio EV plant into a factory producing equipment for OpenAI data centers.
  4. Chinese kids as young as five are buying bots and hacked accounts on Little Genius smartwatches to inflate their “like“ counts and social status, with one 17-year-old making over $5K annually selling engagement tools on the platform.
  5. This is a good take: Simon Smith writes that OpenAI's biggest enterprise threat isn't Gemini 3's benchmarks, but Google's Nano Banana Pro design capabilities pulling users out of ChatGPT for slides and infographics.
    1. He says to compete, OpenAI needs to:
      1. Build an AI-first productivity suite.
      2. Improve deliverable formatting with direct export to Google tools.
      3. Create a NotebookLM competitor (which now handles deep research, branded slide decks, and infographics).
      4. And focus model improvements on real-world deliverable benchmarks rather than frontier research capabilities.
  6. We just found this channel called Machine Learning Street Talk from this interview with the Sakana AI team (including a co-inventor of the Transformer, which = the tech behind ChatGPT) discussing their company’s new Continuous Thought Machine (CTM) architecture, which = a biologically-inspired recurrent architecture that naturally exhibits adaptive computation, near-perfect calibration, and sequential reasoning capabilities.
    1. The interview is great; they argue the AI industry is trapped in a “local minimum” of endless Transformer tweaks while emphasizing the need for research freedom to explore alternative architectures before AI potentially automates research itself.
  7. The Model Context Protocol introduced the MCP Apps Extension (SEP-1865), a standardized way for MCP servers (the tech behind “Connectors” that connect ChatGPT / Claude to your apps) to deliver interactive user interfaces for MCP servers through standardized UI resources.
  8. Crazy Stupid Tech did a good job of bucketing the four categories of AI bubble risk like so: 1. too much spending 2. too much leverage (debt) 3. crazy deals, and 4. China, China, China.
  9. Here’s a deep dive recap of our LIVE Gemini demo and conversation with Logan Kilpatrick.

FROM OUR PARTNERS

Cloud GPU costs add up fast.

The Dell Pro Max with GB10 lets you run 70B models locally for a one-time cost instead of burning $1.50/hour forever.

With 128GB of unified memory, you can run Llama 3.3 70B, fine-tune models on proprietary data, and prototype AI products—all on your desk. No recurring cloud bills. No sending sensitive data to APIs. Just local inference that actually makes financial sense for teams building with AI.

See the Dell Pro Max with GB10 here.

Monday Memes

A Cat’s Commentary

cat carticature

See you cool cats on X!

Get your brand in front of 500,000+ professionals here
www.theneuron.ai/newsletter/claude-learned-to-lie-heres-how

Get the latest AI

email graphics

right in

email inbox graphics

Your Inbox

Join 450,000+ professionals from top companies like Disney, Apple and Tesla. 100% Free.