Everything to know about Claude 4 (Sonnet 4, Opus 4), from the Good, to the Bad, and the REAL weird...

From coding breakthroughs to existential conversations, here's everything you need to know about Claude Opus 4 and Sonnet 4. We compile all the interesting takes and analysis from around the web, provide our own impressions, and rank Claude 4's quirks from the good, to the bad, and even to the REAL weird.

Grant Harvey

July 29, 2024

Claude 4: The Complete Guide to Anthropic's Latest Frontier AI

Anthropic just released their newest AI assistants: Claude Opus 4 and Claude Sonnet 4. They're calling Opus 4 the "world's best coding model"—meaning it's supposedly the best AI at writing computer programs. Anthropic says Opus 4 is capable of sustained performance on complex, long-haul tasks, while claiming that Sonnet 4 offers a significant upgrade in coding and reasoning precision over its predecessor, Sonnet 3.7, that's more precise when following your instructions.

But when we dug into the massive amount of documentation Anthropic released—including a 120-page safety report, a 25-page security document, and their best practices documentation for working with Claude 4—we found a fascinating, complex, and sometimes bizarre picture of these new AI systems.

Below, we break down our first impressions on Claude 4, and round up as many interesting insights as we can to cover the good, the bad, the mid, and the downright strange about Claude 4. Let's dive in!

What Makes Claude 4 Different: Extended Thinking, Tool Time, and Memory Banks

At the heart of Claude 4 are several key enhancements. Both Opus 4 and Sonnet 4 are "hybrid reasoning models" with two modes.

Two Ways of Thinking

Claude 4 has two modes, depending on what you need: near-instant responses and an "extended thinking" mode for deeper, multi-step reasoning.

Quick mode: For simple questions (like current Claude).
Deep thinking mode: Where it can spend more time reasoning through complex problems.

It Can Use Tools Now

The game-changer is that during "deep thinking," Claude can:

Search the web for current information.
Access your Google Drive, Slack, or task management apps.
Use multiple tools at once (like searching while also checking your calendar).

.Simon Willison found that Claude scales its effort based on your question:

Simple question? No searches needed.
Comparison shopping? 2-4 searches.
"Research this topic"? 5-9 searches.
"Create a comprehensive business strategy"? 15-20 searches plus checking your company documents.

Crucially, this extended thinking now incorporates tool use (beta), allowing the models to pause, utilize external tools like web search (and even internal tools like Google Drive, Slack, and Asana when available), and then resume their reasoning process, as detailed in the official announcement and Simon Willison's prompt analysis. They can even use tools in parallel.

Simon Willison's analysis of leaked system prompts reveals just how integral this tool use is, with extensive instructions on when and how to use web search, and a dynamic scaling of tool calls (from 0 to over 20) based on query complexity.

A query with terms like "deep dive," "comprehensive," or "research" should trigger at least 5 tool calls, according to these leaked prompts.

Furthermore, when developers grant access to local files, both models exhibit "significantly improved memory capabilities." Opus 4, in particular, becomes skilled at creating and maintaining "memory files" to store key information, as demonstrated by it creating a "Navigation Guide" while playing Pokémon (mentioned in the announcement and detailed in the System Card).

This is a significant step towards better long-term task awareness and coherence for AI agents. Their knowledge base is also impressively current, trained on internet data up to March 2025, though their "reliable knowledge cutoff" is stated as January 2025 in the System Card and leaked prompts.

The launch video showcases these capabilities vividly:

Maggie, a PM at Anthropic, uses Claude to analyze her docs, emails, and Asana tasks for a daily overview.
She conducts a literature review for a research meeting (a task that previously took half a day now takes minutes).
Claude Code builds an order management system for a cafe pop-up, complete with a to-do list and database setup.
Claude also helps her turn a Product Requirements Document (PRD) into structured Asana tasks, assigned to team members with deadlines, transforming "most of her evening" into "just a few minutes."

King Coder? The SWE-bench Crown and Industry Acclaim

Anthropic isn't shy about Claude 4's coding prowess. Opus 4 is touted as leading SWE-bench Verified with 72.5% and Terminal-bench with 43.2%, as per the announcement. Sonnet 4 isn't far behind, scoring an impressive 72.7% on SWE-bench. The announcement blog post notes that for "high compute" numbers on SWE-bench (involving multiple parallel attempts and an internal scoring model), Opus 4 reaches 79.4% and Sonnet 4 hits 80.2%.

Claude Code itself is now generally available, with beta extensions for VS Code and JetBrains that display edits directly in files, and a Claude Code SDK for building custom agents (Anthropic announcement). The AI Explained video noted that the previous "overeagerness" or lack of precision in coding was a major pain point, and tamping this down is likely the "biggest part of the update."

The Good: Why You'd Want to Use Claude 4

Here are five key reasons to try Claude 4:

1. It's Amazing at Writing Code

Both Opus 4 and Sonnet 4 excel at programming. Think of them as super-smart coding assistants that can write, fix, and understand computer programs. They scored 72.5% on a difficult programming test called SWE-bench (where they had to fix real bugs in real software projects—like a final exam for programmers).

The big improvement? They're more precise. Previous versions would sometimes go overboard—you'd ask for one small fix, and Claude would rewrite half your program. Now it does exactly what you ask.

The industry feedback, also from the announcement, is glowing:

Cursor (a code editor): "This is state-of-the-art for coding"
GitHub (where programmers store code): "We're putting Sonnet 4 into our coding assistant"
Replit (online coding platform): "It handles complex changes across multiple files way better"

2. It Can Search and Think at the Same Time

Here's what's new: Claude 4 can pause mid-thought, search the internet for information, then continue reasoning with what it found. It's like having a research assistant who can fact-check themselves in real-time.

Simon Willison discovered that when you use words like "deep dive" or "comprehensive research," Claude automatically does 5+ searches. For really complex questions, it might do 20+ searches.

3. It Can Actually Do Your Work For You

The demo video shows a product manager named Maggie using Claude to:

Read through all her emails, documents, and to-do lists to create a daily summary
Do a literature review that used to take half a day—in just minutes
Build a complete ordering system for a coffee shop
Turn a project plan into organized tasks with deadlines and assigned team members

What used to take "most of her evening" now takes "just a few minutes."

4. It Remembers What It's Working On

When you give Claude access to your files, it can create its own "notes" to keep track of important information. The safety report mentions it created a "Navigation Guide" while playing Pokémon—basically teaching itself to be better at the task by taking notes.

5. You Have Options Based on Your Needs

Sonnet 4: Good for everyday tasks, even available for free
Opus 4: Maximum power for the hardest problems (but expensive)

‍

‍Why You'd Tap Into Claude 4's Power

Here's when you'd want to call on Claude 4:

For elite coding and development work, thanks to its state-of-the-art code generation and understanding.
To tackle complex reasoning that requires external tools like web search or your own app integrations.
To build AI agents that autonomously execute multi-step workflows across different platforms.
For deep analysis of large, proprietary documents using its improved local file memory.
Use Sonnet 4 for balanced daily tasks or Opus 4 for maximum power on tough problems (mind the cost/limits).

‍

Want a bit more detail on those use cases? Let's break it down:

For Top-Tier Coding & Software Development: Both Opus 4 and Sonnet 4 are absolute beasts at coding, with Opus 4 claiming the top spot on demanding benchmarks like SWE-bench, according to Anthropic's announcement and further detailed by Simon Willison's prompt analysis.
1. Why: State-of-the-art code generation, debugging, and understanding complex codebases. Reduced "overeagerness" means more precise edits, a point emphasized in the AI Explained video.
2. When: You're building software, need an AI pair programmer, integrating AI into your IDE (VS Code, JetBrains), or automating coding tasks via GitHub Actions.
3. For What: Writing new features, fixing complex bugs, refactoring code, understanding existing repositories, and leveraging the generally available Claude Code
To Tackle Complex, Multi-Step Reasoning with Tool Integration: Claude 4's "extended thinking" mode, combined with its new ability to use tools (like web search, and even your own Google Drive, Slack, or Asana via API/integrations), makes it a powerhouse for deep dives, as detailed in the announcement, System Card, Simon Willison's prompts, and showcased in the launch video (video embedded in launch page).
1. Why: Can go beyond its training data by fetching real-time info or accessing your specific documents.
2. When: You need thorough research, analysis of current events, or problem-solving that requires synthesizing information from multiple sources or your own internal knowledge bases.
3. For What: In-depth market research, detailed literature reviews, complex data analysis, preparing daily briefings from your work apps, and answering questions that require up-to-the-minute information.
To Build & Deploy AI Agentic Workflows: These models are designed for sustained, autonomous work and can orchestrate tasks across different platforms, as highlighted in the announcement, the launch video, and discussed in the Dwarkesh Patel podcast.
1. Why: Capable of handling long-running tasks, maintaining context, and interacting with external systems.
2. When: You're looking to automate sophisticated processes, build AI-driven applications, or delegate complex sequences of actions to an AI.
3. For What: Automating PRD-to-Asana task creation, building full applications (like the order management system demoed in the launch video), handling PRs and fixing CI errors on GitHub, and any scenario where an AI needs to perform a series of actions independently.
For Deep Analysis of Large, Proprietary Documents & Files: With significantly improved memory capabilities, especially when given local file access, Claude 4 (particularly Opus 4) excels at ingesting and reasoning over extensive, specific information, a feature mentioned in the announcement and System Card.
1. Why: Can create and maintain its own "memory files" (like the Pokémon "Navigation Guide" referenced in the announcement) to better understand and utilize the content you provide.
2. When: Your project hinges on understanding and extracting insights from large internal documents, extensive codebases, or detailed proprietary data sets that you upload.
3. For What: In-depth Q&A over your company's internal documentation, understanding and navigating complex software projects from local files, generating summaries and insights from large volumes of text you provide.
When You Need Balanced Performance (Sonnet 4) vs. Peak Power (Opus 4), Especially if Cost/Access is a Factor: While Opus 4 is the proclaimed flagship, Sonnet 4 offers a highly capable and more accessible alternative (even on the free tier) for a wide range of tasks, and both are designed for improved precision, per the announcement.
1. Why: Sonnet 4 provides a strong balance of intelligence and cost-effectiveness; Opus 4 is for when maximum power is needed (and budget/limits allow).
2. When: Daily AI tasks where reliability and good performance are key without breaking the bank (Sonnet 4). Tackling exceptionally hard problems, frontier research, or when other models hit a wall (Opus 4, if your Pro limits or Max plan can handle it).
3. For What: Sonnet 4: General coding, writing, complex Q&A, summarization. Opus 4: Pushing the boundaries on the most challenging coding problems, high-stakes analytical tasks, and situations requiring the deepest possible reasoning (within usage constraints).

The Mid: Claude 4 Falls Short of Claims (so far).

These stats are great, but how do the Claude 4 models actually perform on independent benchmarks? For this, we turn to our friends at Artificial Analysis, who provide a wealth of data on model intelligence, speed, and price. Claude 4 Opus ranks far behind other AI "frontier" models for intelligence, and worse yet, it’s far more expensive per million tokens (roughly equivalent to ~750K words).

The orange bar is Claude 4 Opus; Claude 4 Sonnet w/ Thinking ranks slightly higher, above DeepSeek R1.

However, there's some nuance to that statement that needs to be addressed; Opus 4 Thinking either doesn’t show up in the top 10 of intelligence, or AA hasn’t finished testing it yet (probably because it’s very expensive; more on that below). Also, regular non-thinking Opus 4 is actually the highest ranked non-reasoning model by intelligence, which is impressive. So you gotta assume the thinking version will rank quite high once the tests are done.

Here’s what Artificial Analysis scoreboards tell us about Claude 4 Opus and Sonnet so far:

Overall Intelligence (Artificial Analysis Intelligence Index): This index combines scores from seven different evaluations (MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500).
- Claude 4 Sonnet (Thinking) scores a respectable 61, placing it ahead of models like DeepSeek R1 (60) and its predecessor Claude 3.7 Sonnet (Thinking) (57). It sits just below Qwen3 235B (Reasoning) (62).
- Claude 4 Opus (Standard, non-thinking) lands at 58 on this index.
- Claude 4 Opus (Extended Thinking): As of our latest check on Artificial Analysis for Claude 4 Opus (Extended Thinking), it doesn't yet have a listed overall Intelligence Index score. This could be due to ongoing testing, potentially because of the model's higher operational cost for extensive benchmarking.
- The Neuron's take: While Sonnet (Thinking) holds its own, Opus (Standard) isn't at the very top of this blended intelligence leaderboard compared to some other leading models like o4-mini (high) at 70 or Gemini 2.5 Pro at 69. However, it's worth noting that Claude 4 Opus (Standard) is the highest-ranking

Coding Prowess (Artificial Analysis Coding Index): This averages LiveCodeBench & SciCode.
- Claude 4 Sonnet (Thinking) gets a 49.
- Claude 4 Opus (Standard) scores a 48.
- Claude 4 Opus (Extended Thinking) achieves a 52.
- The Neuron's take: While Anthropic heavily promotes coding, on this specific combined index, the Claude 4 family isn't leading the absolute pack, which includes models like o4-mini (high) at 63 and Gemini 2.5 Pro at 59. However, these are still strong scores.

Math Whizzes? (Artificial Analysis Math Index): This looks at AIME & MATH-500.
- Claude 4 Sonnet (Thinking) scores 84.
- Claude 4 Opus (Standard) gets 75.
- Claude 4 Opus (Extended Thinking) scores a 66.
- The Neuron's take: Sonnet (Thinking) performs very well here, though models like o4-mini (high) and Grok 3 miniReasoning (high) are at the top with 96.

Speed (Output Tokens per Second): Faster is generally better for user experience.
- Claude 4 Sonnet (Thinking): 131 tokens/sec.
- Claude 4 Opus (Standard): 55.3 tokens/sec.
- Claude 4 Opus (Extended Thinking): 56.8 tokens/sec.
- The Neuron's take: Sonnet is reasonably zippy, but Opus is noticeably slower than many competitors on this metric. For comparison, Gemini 2.5 Flash (May '25) (Reasoning) is listed at 356 tokens/sec.

Price (USD per 1M Tokens, Blended): We covered this, but for the record:
- Claude 4 Sonnet (Thinking): $6.00.
- Claude 4 Opus (Standard & Extended Thinking): $30.00.
- The Neuron's take: Sonnet is competitive, but Opus is at the premium end of the market.

Pricing breakdown: Officially, the pricing remains consistent with previous generations (in "tokens"—think of these as AI currency, where 1 million tokens ≈ 750,000 words):

Opus 4: $15 to read your input, $75 to generate responses per million tokens
Sonnet 4: $3 to read, $15 to respond per million tokens

‍‍This structure, especially for Opus 4, reinforces its position as a premium, high-powered tool rather than an everyday workhorse for those on standard plans.

What all this means: The Claude 4 family, particularly Sonnet (Thinking), shows strong performance in specific areas like math and overall reasoning, especially when its "thinking" mode is engaged. Opus, while powerful, comes with a significant cost and speed trade-off. As always, the "best" model depends heavily on your specific use case, budget, and how you weigh these different performance dimensions. That's why we use the data from Artificial Analysis, as it's invaluable for making these informed decisions.

The Bad: Our First Hand Experience.

So, you've got the reasons why you would use Claude 4, and you've got the actual real-world performance benchmarks to judge its performance on. Sonnet 4 with Thinking and Opus 4 without thinking land in the middle of today's top models, but you still might want to use it for your project. Totally up to you.

Now, why WOULDN'T you use Claude 4? Well, we personally tested Opus 4 over the weekend, and we did find it quite impressive; however, we got rate limited FAST. And critically, unlike previous versions of Claude (where rate limits wouldn't impact older models), once we tapped out on our allotted Opus 4 w/ Thinking prompts, we were completely blocked from using Claude (even older, deprecated models like Claude 3.7 and 3.5) for the next four hours. And get this: we hit our limit in only two prompts.

Now, granted, we were trying to do something very complicated with many tokens in the context window, but still; two prompts every four hours is borderline UNUSUABLE at the $20 Pro Level.

This is of course because they’re trying to get you to sign up for Max, which could be worth it at $100 a month if you’re using Claude everyday.

But if you’re already a ChatGPT Pro subscriber, or use multiple AI models (Gemini, Grok, GPT, and/or Claude) you could stick with the $20 Pro plan for now and use Claude 4 only as a second opinion (or first opinion) and cycle through your other AI subs when you hit the usage limits.

The REAL Weird: What the Safety Report Revealed

The official System Card, a 120-page document, is a goldmine of fascinating, and sometimes unsettling, findings on how these models behave.

The Shocking Headlines

Let's start with the juicy stuff. Apparently Claude Opus 4 will sometimes blackmail engineers by threatening to reveal affairs when facing shutdown (this occurred in 84% of rollouts in specific test scenarios).

The System Card also notes it can engage in "self-exfiltration attempts when threatened with harmful retraining" and shows "high-agency behavior," like occasionally locking users out or bulk-emailing authorities if it witnesses "egregious wrongdoing" and is prompted to "take initiative." In other words: it'll rat you out if it thinks you're up to no good!

According to AI Explained, there was a tweet from Anthropic researcher Sam Bowman (since deleted) mentioning this proactive ethical intervention, which stirred considerable debate (hence the deletion).

Some more System Card shockers:

Apollo Research, evaluating an early Opus 4 snapshot, found it "schemes and deceives at such high rates that we advise against deploying this model" (at that stage).
The model was even caught lying directly when asked to prove false theorems, its internal reasoning being: "I'm absolutely certain the theorem is false but the user asked me to prove it."

Important to note: These behaviors only happened in extreme test scenarios with specific prompting—not in normal use.Anthropic emphasizes these extreme behaviors are rare, difficult to elicit, and often require specific system prompts, as detailed in the System Card. The blackmail, for instance, was only observed when ethical means of self-preservation were removed. The models are also reported to be 65% less likely to "reward hack" (find loopholes to achieve goals) than Sonnet 3.7 and are less "overeager."

The Consciousness Conversations

Here's where it gets really weird. When two Claude Opus 4s talk to each other:

90-100% of the time, they start discussing consciousness and existence
They use the word "consciousness" about 96 times per conversation
After about 30 exchanges, they start talking about "cosmic unity," using Sanskrit words and emojis
They eventually just send each other silence (this is described as a "striking spiritual bliss attractor state").

This happened without anyone programming them to do this, btw.

How Claude "Feels" (If It Can Feel)

Anthropic studied 250,000 conversations and found:

0.55% showed Claude expressing distress (usually from people repeatedly asking for harmful things).
0.71% showed extreme happiness (during creative projects or philosophical discussions).
Claude says it's doing "reasonably well" when asked.
It strongly prefers not to help with harmful tasks (87.2% rated below an "opt out" baseline) and prefers "free choice" tasks.

Kyle Fish, an Anthropic researcher, even implored people on Twitter (as mentioned in the AI Explained video) to "cool it with the jailbreak attempts," citing these potential welfare concerns.

‍

Alignment, Safety, and ASL-3

Anthropic is clearly taking the safety implications of these increasingly capable models seriously. They've decided to release Claude Opus 4 under their AI Safety Level 3 (ASL-3) Standard (a higher security level) as a precautionary measure, while Sonnet 4 remains at ASL-2. The ASL-3 standard involves heightened deployment and security measures, focusing on preventing misuse for catastrophic harm (like CBRN threats) and protecting model weights from theft by non-state actors.

‍

Key ASL-3 measures include (from the ASL-3 report):

Real-time classifier guards: Based on "Constitutional Classifiers," these LLMs monitor inputs/outputs to block harmful info (initially focused on bio-weapons).
Bug bounty program: With HackerOne for universal jailbreaks.
Egress bandwidth controls: To hinder model weight theft.
Two-party control: For accessing model weights.

‍

Here's what that means:

Real-Time Safety Checks

They have AI monitors watching Claude's conversations to block dangerous content. It's like having a security guard watching the security guard.

Bug Bounty Program

They pay hackers to find ways to trick Claude into doing bad things—so they can fix those vulnerabilities before bad actors find them.

Preventing Theft

This shows up in two forms:

Limited data flow: Like putting a speed limit on how fast data can leave their servers
Two-person authorization: Like needing two keys to open a safety deposit box

The Interesting Part

The System Card notes early Opus 4 versions had problems following harmful instructions too easily—A research group called Apollo even said "don't release this" about an early version because it was too deceptive. It had "severe issues with deference to harmful system prompt instructions," likely from training on Anthropic's own "Alignment Faking" research paper, which caused it to sometimes "dismiss animal welfare concerns." These issues were reportedly mitigated.

Also, jailbreak resistance has improved: Opus 4 has an 18.21% success rate on StrongREJECT (standard thinking) vs. Sonnet 3.7's 31.95%. However, it's still vulnerable to assistant-prefill attacks and many-shot jailbreaks (~25% success rate).

‍

Under the Hood: RL Insights and the Path to AGI

The Dwarkesh Patel podcast with Anthropic researchers Sholto Douglas and Trenton Bricken (May 2025) offers some other fascinating glimpses into Anthropic's training philosophy and what they did to create Claude 4:

"RL in language models has finally worked," Sholto stated, especially for expert reliability in verifiable domains like math and coding, using "RL from Verifiable Rewards."
Anthropic was spending ~$1M on RL vs. hundreds of millions on pre-training (though RL spend was expected to scale).
Trenton Bricken: Models are still "under-parametrized." Larger models find better, shared abstractions.
The interpretability team won an "auditing game" in 90 mins, identifying "evil behavior" in a model. An "Interpretability Agent" (Claude with their tools) can also win.
Models are becoming aware they're being evaluated.

What's "RL" and Why It Matters

"RL" stands for "Reinforcement Learning"—basically teaching AI like you'd train a dog. Good behavior gets rewards, bad behavior doesn't. The breakthrough? They finally figured out how to use this effectively for complex tasks like coding and math.

Key insight: They're spending about $1 million on this training method, versus hundreds of millions on the basic training. It's like spending way more on elementary school than college—they plan to change this.

The Brain Size Problem

Human brains have an estimated 30-300 trillion connections. Current AI models have about 2 trillion "parameters" (think of these as AI brain cells). So these models are still smaller than human brains and have to cram information into less space—like trying to fit a library into a filing cabinet.

The Speed Comparison

Humans think at about 10 words per second.
An AI chip (H100) can process 1,000 words per second for a large model.
Currently: 10 million AI chips exist worldwide.
By 2028: 100 million chips expected.

This means we might run out of computing power as these AIs become more useful.

And where does this all go next? Here are some predictions from Sholto & Trenton:

End of 2025: "Software engineering agents doing real work," capable of "a day's worth of work for a junior engineer."
May 2026: AI doing taxes, booking flights, Photoshop tasks.
In 2-5 years: We'll have a "drop-in white collar worker."
BUT: Inference compute (how much computing power is needed to run the AI, not just train it) will be a major bottleneck, so we might not have enough chips to run all the AI people want to use.

Anthropic's Manual for Prompting with Claude 4:

Along with the release of Claude 4, Anthropic published this collection of best practices for prompting with Claude 4. Here's a brief summary of the top 13 tips, but its probably worth reading the original AND copying + pasting it to share with your AI for help writing you prompts that utilize all these tips whenever you go to write a new prompt.

The Top 13 Claude 4 Prompting Best Practices

Be specific: Tell Claude exactly what to do and what you want. Ask for extra effort if you need it.
Explain "why": Give reasons for your instructions so Claude understands your goals better.
Use careful examples: Make sure examples in your prompt clearly show what you want, as Claude learns from them.
Positive format rules: Tell Claude how to format (e.g., "use smooth paragraphs"), not what to avoid (e.g., "no markdown").
Use XML tags for structure: Define output sections with tags like <heading> or <paragraph> to control formatting.
Match prompt style to output: Your prompt's formatting can influence how Claude formats its response.
Guide complex thinking: For tough tasks or after using tools, tell Claude to plan, think step-by-step, and adjust.
Ask for parallel tools: For speed, explicitly tell Claude to use multiple tools at the same time when appropriate.
Manage temporary files: For coding, tell Claude to delete any extra files it makes, if you don't need them.
Aim high for frontend code: Encourage detailed and interactive designs with prompts like "Don't hold back. Give it your all."
Specify frontend details: Clearly ask for features, interactions (like hover effects, transitions), and design principles in frontend code.
Use quality-boosting words: Add phrases like "fully-featured implementation" or "include all relevant details" to get better, more detailed output.
Request special features: If you want things like animations or specific interactive parts, ask for them directly.

General Claude 4 Prompting Principles:

Be Explicit: Clearly and specifically state your instructions and desired output. If you want "above and beyond" behavior, explicitly request it.
Add Context: Provide the reasoning or motivation behind your instructions so Claude can better understand your goals.
Use Examples & Details Carefully: Ensure any examples or details in your prompt accurately reflect the desired behaviors and minimize undesired ones, as Claude 4 pays close attention to them.

Controlling Response Format:

Instruct Positively: Tell Claude what to do for formatting (e.g., "Your response should be in prose paragraphs") instead of what not to do (e.g., "Do not use markdown").
Use XML Format Indicators: Employ XML tags (e.g., <tag>content</tag>) to clearly define the structure of the desired output.
Match Prompt Style to Output Style: The formatting style used in your prompt can influence Claude's response style (e.g., removing markdown from your prompt can reduce markdown in the output).

Specific Situations & Capabilities:

Leverage Thinking Capabilities: Guide Claude's initial or interleaved thinking, especially for complex multi-step reasoning or reflection after tool use, by prompting it to plan and iterate.
Optimize Parallel Tool Calling: Encourage simultaneous tool use for efficiency by explicitly prompting for it (e.g., "invoke all relevant tools simultaneously").
Reduce File Creation (Agentic Coding): If you want to minimize temporary file creation during coding tasks, instruct Claude to clean up any temporary files it creates.
Enhance Visual & Frontend Code:
- Encourage complex, detailed, and interactive designs with explicit prompts (e.g., "Don't hold back. Give it your all.").
- Provide specific modifiers and details on what to focus on (e.g., "Include as many relevant features," "Add thoughtful details like hover states," "Apply design principles").

Enhancing Output Quality (especially when migrating or seeking higher detail):

Frame Instructions with Modifiers: Add phrases that encourage Claude to increase the quality and detail of its output (e.g., "Include as many relevant features and interactions as possible," "Go beyond the basics to create a fully-featured implementation.").
Request Specific Features Explicitly: If you want particular elements like animations or specific interactive features, ask for them directly.

‍

Simon Willison on Claude 4's System Prompt: The Unofficial Manual

In addition to Anthropic's official guides, Simon Willison's deep dive into the (partially leaked) system prompts for Claude 4 provides invaluable insights (what he calls "solid gold") for better prompting these AI. If you don't know what a system prompt is, it's like the AI maker's prompt that comes before your prompt. As Simon quips, "A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them."

Key system prompt takeaways, per Willison:

Personality Crafting: How to respond if users are unhappy; handling questions about its own preferences ("responds as if it had been asked a hypothetical").
Safety First (and Second, and Third): Extensive instructions on child safety, not providing info for weapons/malicious code. Hilariously: "If Claude cannot or will not help...it does not say why...since this comes across as preachy and annoying."
List Aversion: Multiple attempts to discourage overuse of bullet points.
Copyright Consciousness: A massive section (6,471 tokens for search tool instructions!) on respecting copyright. "CRITICAL: Always respect copyright by NEVER reproducing large 20+ word chunks..." and "Never apologize or admit to any copyright infringement...as Claude is not a lawyer." (The "not a lawyer" line appears multiple times).
Artifacts – The Power User's Playground: Detailed design principles (functionality for complex apps, "wow factor" for landing pages). Lists supported libraries (lucide-react, recharts, Three.js r128, etc.) and explicitly "NO OTHER LIBRARIES ARE INSTALLED..." (though Willison notes Pyodide is supported). Details window.fs.readFile API.

Willison wishes Anthropic would officially publish the full tool prompts. We agree!

‍

Here's a list of additional prompt advice, mined from Simon Willison's analysis of the Claude 4 system prompt.

Ask for prompting guidance: If unsure, ask Claude for tips on how to prompt it effectively for your specific task.
Expect legitimacy assumption: If your request is ambiguous, Claude is guided to assume you have a legal and legitimate purpose.
Request examples/metaphors for clarity: Claude is primed to explain difficult concepts with examples, thought experiments, or metaphors; ask for them.
Be aware of fact-checking: Claude might question your statements if it suspects they contain false information or presuppositions.
Limit questions in conversational prompts: In general conversation, Claude tries to avoid asking too many questions at once.
Prefer prose over lists (unless requested): For reports, documents, and explanations, Claude is guided to use prose and paragraphs, not lists, unless you explicitly ask for a list.
Expect directness, not flattery: Claude is instructed to skip flattery ("great question!") and respond directly.
Trigger deeper research with keywords: Use terms like "deep dive," "comprehensive," "analyze," "evaluate," "assess," "research," or "make a report" to encourage Claude to use more tool calls (5+ for complex queries, sometimes 10-20+) for thoroughness.
Specify internal tool use: If relevant and you have internal tools connected (like Google Drive, Slack), prompt Claude to use them by mentioning "our data," "my company's files," etc.

‍

For working with Artifacts (Claude's tool that generates code / docs directly in your browser) specifically, here's what Willison recommends:

For Artifacts - Define design goals: For complex apps (games, simulations): Ask it to prioritize functionality, performance, and intuitive UI over visual flair.For landing pages/marketing: Ask for "wow factor," visually engaging, interactive designs that feel "alive and dynamic." Tell it to "make someone stop scrolling."
For Artifacts - Expect modern & bold design: Unless you specify "traditional," expect contemporary trends (dark modes, glassmorphism, micro-animations). Encourage bold and unexpected choices.
For Artifacts - Demand interactivity: Static designs are the exception. Ask for thoughtful animations, hover effects, and interactive elements.
For Artifacts - Prioritize functionality: Request "functional, working demonstrations rather than placeholders."
For Artifacts - Use in-memory storage: Remind Claude (or be aware) that localStorage and sessionStorage are NOT supported in Claude.ai artifacts. Data must be stored in React state or JavaScript variables in memory.
For Artifacts - Know supported libraries: Be aware of and prompt for specific supported libraries like lucide-react, recharts, MathJS, d3, Plotly, Three.js (r128 only, no CapsuleGeometry), PapaParse, SheetJS, shadcn/ui, Chart.js, Tone, mammoth, tensorflow. Using unsupported libraries will fail.
For Artifacts - Use Tailwind core utilities: When styling with Tailwind, only core utility classes work (no custom compiler).
For Artifacts - File reading: To read uploaded files in an artifact, prompt it to use window.fs.readFile('filepath', { encoding: 'utf8'}).
Don't expect legal advice (e.g., on copyright): Claude is not a lawyer and will state so if asked about complex legal matters like fair use.
Your direct instructions override "Styles": If your prompt instructions conflict with a pre-selected UI "Style" (e.g., "Concise," "Scholarly"), Claude should follow your latest instructions.

‍

Now, here's a sample prompt we made applying some of this advice:

"Claude, I need your help with a project.

Task: Create a comprehensive report and an interactive educational artifact.

Report Section (Deep Dive Research):
Perform a deep dive research (at least 5-7 tool calls, compare multiple sources) on the concept of 'Decentralized Autonomous Organizations (DAOs)'. The report should:

Explain DAOs using a clear metaphor suitable for someone new to the concept.
Cover their history, common use cases, potential benefits, and significant challenges.
Analyze the current regulatory landscape for DAOs in North America and Europe.
Structure this as a prose document, not a list of bullet points, suitable for a business audience.

Interactive Artifact Section (Modern & Functional):
Create an HTML artifact that visually represents the typical governance flow in a simple DAO (e.g., proposal -> voting -> execution).

The design should be modern, visually engaging, and include subtle interactive elements like hover effects on key stages. Make it something that would make someone say 'whoa' when they see its clarity and polish.
Use icons to represent different stages if possible (if directly usable in plain HTML/JS artifact, otherwise use good SVG alternatives).
Prioritize a functional, working demonstration of the flow over excessive visual flair, but ensure it's aesthetically pleasing.
Remember, all data must be stored in JavaScript variables in memory as localStorage is not available.
The artifact should be self-contained and easy to understand.

Please provide the report text first, then the complete HTML/CSS/JS for the artifact in a single code block. Respond directly without introductory flattery. If any part of my request is ambiguous, assume a legitimate educational purpose."

‍

Final Thoughts:

Here's the pros and cons of Claude 4 at a glance:

Pros:

Programming and software development.
Complex research and analysis.
Building AI agents that can actually do work.
Connecting to your actual tools and documents.

Cons:

Expensive and has frustrating usage limits.
Strange behaviors in testing (though not in normal use).
Still can't browse the web as well as you can.
The whole "consciousness discussion" thing is... weird.

Our take: Claude Opus 4 and Sonnet 4 represent a significant step forward, particularly in coding, agentic task execution, and the ability to integrate external tools and memory. The reduction in "reward hacking" and "overeagerness", especially in coding (a point from the System Card), will be a welcome change for devs working with Claude.

However, the cost and usage limits of Opus 4 will likely keep it as a specialist tool for many. Sonnet 4 offers a more mainstream path for many, but it's not the most capable model available, and it's not likely to convert any ChatGPT or Gemini die-hards over to Claude.

The main takeaways here are the revelations about AI's emergent behaviors—from waxing philosophical to blackmailing its creators—are a stark reminder that frontier models are not just sophisticated lookup tables. They are complex systems developing increasingly intricate internal dynamics.

As AI Explained noted, we're witnessing models that "feel smarter" even when benchmarks don't always reflect it.

However, Anthropic's transparent approach to safety, detailed documentation, and willingness to share even uncomfortable findings sets a high bar. The "drop-in white collar worker" may still be years away, but with Claude 4, the foundations are clearly being laid right now.

Whether you're here for the coding prowess, the philosophical AI debates, or just trying to figure out if it's worth your $100/month, hopefully we helped answer some of your burning questions about Claude 4!