SHARE

Opus 4.7 Live Test: The Vision Upgrade, the Long-Context Drop, and Why the Mythos Theory Keeps Adding Up

We went live within hours of Anthropic shipping Opus 4.7 and stress-tested everything from its new 3.75-megapixel vision to Claude Code's sub-agent behavior. Here's what the benchmarks didn't tell you.

Written By

Grant Harvey

Apr 19, 2026

9 minute read

Thursday morning, Anthropic shipped Opus 4.7. Roughly an hour later, we hit "go live" on YouTube, LinkedIn, and X. We pulled up a Final Fantasy Tactics–style game Grant has been vibe-coding with Claude for weeks, and started throwing the new model at real work. No demos from the stage. No curated prompts. Just two people (Grant plus developer Kyle) opening Claude Code, running tasks, and watching what happened.

The short version: the vision upgrade is real and shippable. The long-context scores moved the wrong direction. The "sub-agent" behavior Anthropic is now defaulting to is quietly spending your tokens on a smaller model than you think. And Nick Saave's theory that 4.7 is a distilled version of the unreleased Mythos Preview started looking a lot more plausible by the end of the hour.

First up, the TL;DR
The vision upgrade is the headline (and it actually ships)
The nerfs are suspiciously on-the-nose (and Nick Saave has a theory)
The long-context regression that's flying under the radar
Instruction following: upgrade or regression in disguise?
Claude Code's sub-agent surprise
Kyle's three-model workflow
The side quests: Qwen 3.6, Routines, xhigh effort
Where this goes next

First up, the TL;DR

😼 Opus 4.7 Live Test: The Vision Upgrade, the Long-Context Drop, and Why the Mythos Theory Keeps Adding Up

Here's what happened:

Anthropic shipped Claude Opus 4.7 on Thursday morning, available at the same price as 4.6 ($5 / $25 per million input/output tokens).
Vision gets a real upgrade: the model now accepts images up to ~3.75 megapixels (roughly 3x the prior cap), and internal evals show a sizeable jump in visual reasoning.
Several benchmarks moved the wrong direction: agentic search dropped, cybersecurity vulnerability reproduction dropped, and the 8-needle long-context test regressed from 4.6's extended-thinking score.
Nick Saave's theory, floated roughly 20 minutes after the launch: 4.7 is a distilled version of the unreleased Mythos Preview, the internal model Anthropic is currently sharing with banks and major platforms through Project Glasswing to harden their systems before public release.
Instruction following is noticeably more literal, meaning prompts that worked on 4.6 may behave differently on 4.7.
Claude Code, by default, spins up Haiku sub-agents for background reading and summarization tasks; worth auditing if you care about context fidelity.
The new xhigh effort level sits between "high" and "max" via /effort in the terminal.

Why this matters: If 4.7 really is a distilled Mythos, the specific places it gets worse tell you which capabilities Anthropic is actively trying to keep out of the wild. That's an unusually legible signal about what the frontier model can do.

Our take: Upgrade for the vision, but keep 4.6 bookmarked for long-context work until the regression is explained. We're writing a longer piece on how many people find 4.7 to be WORSE than 4.6 and 4.5. More on that to come. So don't hold your breath on this one. And if you run Claude Code daily, go configure which model gets the sub-agent work; leaving it on default is a silent downgrade on tasks you expect to be smart.

The vision upgrade is the headline (and it actually ships)

Before 4.7, Claude's vision was the punchline. Grant told the stream about it early: he'd ask 4.6 to fix a character portrait, it would announce "looks clean, looks good," and the portrait would come back as what he called "pixel slob." Kyle had been seeing the same regression chatter from other developers for weeks.

Anthropic's own numbers suggest they heard it. Visual reasoning with no tools posts a significant jump from 4.6 to 4.7, and the input cap moved from about 1.1 megapixels to ~3.75 megapixels (roughly 2500px on the long edge). That second number matters more than it sounds. Readers who've tried to paste a full screenshot of a product page or a research chart into 4.6 and watched it get silently downscaled know the pain.

We put it to work on Grant's Renaissance-themed tactics game. The test: use vision to look at the existing character sprites (which Grant described as "too bulky and not what I envisioned") and redesign them to match the art direction. This was a task 4.6 had been gaslighting him on for weeks, insisting the output looked great when it didn't. 4.7 picked up on the proportions problem on its own and proposed a real plan (rather than declaring victory).

If you're a designer, a developer who ships UI, or anyone who feeds screenshots into Claude to debug layouts, this is the most load-bearing upgrade in the release.

The nerfs are suspiciously on-the-nose (and Nick Saave has a theory)

About 20 minutes after launch, Nick Saave published a walkthrough of the benchmarks with a provocative frame: 4.7 is a distilled version of Mythos Preview, Anthropic's unreleased frontier model. Mythos is the one that reportedly finds zero-day exploits in real software, the one Anthropic is deliberately withholding while banks and major tech companies use it through Project Glasswing to shore up their defenses.

Saave's reasoning works like this: if 4.7 is a distillation (a smaller model trained to imitate a bigger one), you'd expect almost every capability to improve from 4.6 to 4.7. That's what the benchmark table mostly shows. The interesting data isn't the stuff that went up. It's the three things that went down.

Agentic search dropped. That's the model's ability to autonomously crawl and synthesize web content, a core skill for hunting exploits across the open internet.
Cybersecurity vulnerability reproduction dropped. That's the model's ability to reliably reproduce a known software flaw in a test environment. Kyle's reaction on the stream: "too on the nose to be an accident."
Agentic terminal coding is only marginally higher, which is the skill that lets the model run bash commands and chain tool calls on real systems.

Put those three together and the pattern looks less like random variance and more like a careful targeting of the capabilities most useful for autonomous offensive security work. If you were distilling a dangerously capable model down to a shippable one, those are the exact knobs you'd turn down.

This is a theory, not a confirmation. Anthropic has stayed quiet on specifics, and there's a mundane alternative explanation (different training data mix, different RLHF focus). But it's a theory that got more credible the longer we sat with it.

The long-context regression that's flying under the radar

The quietest finding on the stream is probably the most important for heavy Claude users. On the 8-needle long-context test (an industry benchmark that measures how well a model can find multiple specific facts buried inside a huge document), 4.6 in extended-thinking mode scored 78.3%. 4.7 scores lower. Meaningfully lower.

For anyone who casually dumps a 200-page PDF into Claude, or pipes a long repo through Claude Code and expects the model to track every file it just read, this is a step backward. Grant called out his own habit of throwing half-structured context at 4.6 and trusting it to sort out what mattered. That strategy needs a revision for 4.7.

Kyle's workaround, floated live, is actually elegant: use 4.6 as a pre-processor to compress a million-token context down to 500K, then hand the compressed version to 4.7. The old "prompt at the start, context in the middle, reprompt at the end" trick Grant still uses from the early long-context model days is probably also about to become relevant again.

Anthropic's own blog positions 4.7 as better at file-system-based memory (remembering important notes across sessions), which reads like a tacit acknowledgment that in-context long-range recall got worse. Their answer: don't keep it all in context, keep it in files and pull selectively.

Instruction following: upgrade or regression in disguise?

Anthropic flags one 4.7 change that sits squarely in the "depends how you look at it" column. The model follows instructions more literally than 4.6 did. Their own launch note: "prompts written for earlier models can sometimes now produce unexpected results." The implication: you need to write more precisely.

Kyle's framing is the sharpest one we heard: you can read this as an improvement in controllability, or as a slight regression in the model's ability to pick up nuance. Both are true at the same time.

If you run carefully designed prompts in production, 4.7 will probably do exactly what you wrote, which is what you want. If you're used to throwing rough instructions at the model and letting it infer what you actually meant, 4.7 will give you less of that forgiveness. Combined with the long-context regression, this means sloppy prompts on big contexts are now a doubly bad idea.

Claude Code's sub-agent surprise

This one stopped the stream. Midway through a task, we noticed Claude Code had spun up a sub-agent to read part of the codebase, and the sub-agent was running on Haiku, not Opus. The finding Haiku came back with was wrong. 4.7 self-corrected after re-reading the file directly, but the default had silently used a cheaper, smaller model for a reading task we'd assumed would be done by the flagship.

Kyle's working pattern uses this on purpose: run local summarization with a small model, preserve expensive Opus tokens for synthesis. That's efficient. The problem is when the default does it without you noticing, on tasks where you'd rather pay for the right answer the first time.

If you're on a Max plan and using Claude Code daily, this is worth auditing in your config. The capability is good. The default is worth a second look.

Kyle's three-model workflow

The stream's most practical takeaway came about half an hour in, when Kyle walked through how he actually works given that state-of-the-art tokens are expensive:

Start with a frontier model (Opus 4.7 or Gemini 3.1 Pro) to make a thorough plan. You're paying for intelligence, so use it for the thinking.
Hand that plan to a cheaper agent model (Gemini Flash or Sonnet) to do the execution. Read files, write boilerplate, fill in the scaffolding.
Bring the frontier model back at the end to clean up, catch edge cases, and do the final polish.

The logic: build most of the context cheaply, then use one expensive pass to fix what the middle model got wrong. This is roughly how the best software shops are already splitting work between senior and junior engineers. The model tiering just formalizes it.

The side quests: Qwen 3.6, Routines, xhigh effort

A few other threads worth pulling on:

Qwen 3.6 dropped open source the same day. Kyle tried to get the 35B / A3B mixture-of-experts version (35 billion total parameters, but only 3 billion active on any given token, which gets you ~35B intelligence at ~3B speed) running locally on a 3090. It downloaded cleanly but Llama.cpp needed an update to register it in the UI. Check back in a day or two before investing time.
Claude Routines are live. This is Anthropic's take on scheduled agentic workflows, directly inside Claude Code. If you've been using n8n or Make.com for this, Routines is the native-in-Claude version. We covered it in the newsletter earlier this week.
The xhigh effort level is new. It sits between high and max and you set it with /effort in the terminal. Useful when max is overkill but high is leaving performance on the table.
Tab context MCP popped up as a request during the stream. Grant's general trick for unknown MCP requests, worth stealing: deny first, then ask "what is this and why should I use it?" before approving.
The pre-release 4.6 degradation chatter is real and probably reflects compute priority shifting to the new model rather than a deliberate downgrade. Either way, 4.7 should feel meaningfully sharper than 4.6 did this past week for most users.

Where this goes next

The open question we're still chewing on. If 4.7 really is a distilled Mythos, and the things it got worse at are the things Anthropic chose not to distill, what happens next? What happens the moment a Chinese lab or a motivated open-weight team reproduces the missing capabilities from scratch? Kyle's view on the stream: the cat is mostly out of the bag. As soon as a big model is even somewhat available, distillation and reverse-engineering start closing the gap. The choice Anthropic has made with Mythos buys time for their Glasswing partners to harden systems. It doesn't buy indefinite safety.

For most of us, that's abstract. For the security teams at banks and major cloud providers using Mythos right now, it's a clock ticking.

Near-term advice: upgrade to 4.7 for vision-heavy work, structured prompting, and anything where file-system memory beats in-context memory. Stay on 4.6 (or use the Kyle compression trick) for long-context tasks until the needle-test regression gets explained. And go look at your Claude Code sub-agent config before your next big job. A Haiku silently reading a file you thought Opus was reading is the kind of bug you'll only notice when it bites.

We're watching for the next Anthropic post that either confirms or debunks the distillation theory. If it lands, we'll update this piece.