😸 Here's how good AI is at your job

PLUS: Google robot models, xAI lawsuit, Meta steal your researcher...

Welcome, humans.

LinkedIn user Cameron Mattis pulled off what might be the most delicious cyberattack ever.

He embedded hidden instructions in his profile that tricked an AI recruiting agent into abandoning its job duties and sending him a detailed flan recipe instead of a pitch about career opportunities. Now, he knows exactly when the recruiter is actually just AI.

The hack worked through “indirect prompt injection,” where sneaky instructions hidden in resumes or profiles can completely override an AI agent's programming. Other LinkedIn users caught wind and started requesting songs and poems (and got them).

Keep in mind: if a simple resume can hijack an AI recruiter, imagine the implications for AI agents handling sensitive business data, customer service, or financial transactions. Every AI tool that processes external content could be vulnerable.

The silver lining? The flan recipe was apparently pretty good.

Here’s what happened in AI today:

OpenAI study showed AI matches human experts on 40% of work tasks.
xAI sued OpenAI for trade secret theft through employee poaching.
Meta poached OpenAI executive Yang Song for SuperIntelligence Labs.
Google released Gemini Robotics-ER 1.5 for robot reasoning.

Advertise in The Neuron here

OpenAI Just Tested Whether AI Can Do Your Job (Spoiler: It's Getting Close)

OpenAI just released a new study showing that AI models can now perform real work tasks almost as well as industry experts; from writing legal briefs to designing engineering blueprints to creating nursing care plans.

They call it GDPval, and it's basically the SATs for AI, except instead of testing trigonometry, it tests whether ChatGPT can do your actual job. The evaluation includes 1,320 tasks across 44 occupations, all created by professionals with an average of 14 years experience. In case you couldn’t tell by the name, they’re literally trying to track “AI’s impact on GDP” by measuring its impact on the things that impact GPD.

Here's what they tested:

Manufacturing engineers designing cable reel jigs for mining trucks.
Lawyers drafting legal briefs with ambiguous fact patterns.
Nurses creating detailed care plans from physician notes.
Sales managers building revenue forecasts and competitive analyses.
Software developers fixing bugs and building features.

…Plus tasks for all the roles featured below!

P.S: Figure 11 on PDF page 13 has the win rate by occupation, if you really wanna see how close you are to being automated! Also, here’s how they grade it and the dataset they used).

The results? Kinda wild. Interestingly, Claude Opus 4.1 matched or beat human experts on nearly half the tasks (and the highest performing across 8/9 industries). GPT-5 excelled at accuracy (following instructions perfectly, nailing calculations), while Claude dominated aesthetics (making documents and slides look professional).

Even crazier: AI completed these tasks 100x faster and 100x cheaper than humans. We're talking tasks that take experts 7+ hours on average getting done in minutes for pennies on the dollar.

But here's the catch… most models struggled with the same issues: following complex instructions, avoiding formatting disasters (GPT-5 apparently loves creating PDFs with black squares), and sometimes confidently making stuff up when they got stuck.

Why this matters: AI can’t replace you today; the experts who reviewed the outputs of all the work preferred the human deliverables the most (since the models failed to fully follow all their instructions). So this is about understanding what kind of routine things AI can handle so we can focus on the creative, strategic work.

For example, OpenAI found that having experts review and fix AI outputs still beats doing everything from scratch. However, that only works if the AI output is actually decent. A recent recent HBR study coined the term “workslop” for low-quality generated content that looks polished at first glance but then forces colleagues to waste hours decoding, fixing, or completely redoing the work.

The difference? OpenAI's tasks had experts thoughtfully using AI as a tool, while workslop happens when people mindlessly dump AI output on their coworkers. One approach saves time; the other creates a $186-per-incident productivity tax (that can add up to $9M per year!).

PSA: Edit your own slop before slopping it into the trough for the other work piggies!

What's next: OpenAI's releasing 220 tasks publicly so other researchers can test their models. They're also expanding beyond the initial 44 occupations and adding more interactive, ambiguous tasks, because let's be real, most work isn't as neat as “here's a prompt, make a deliverable.” And if you’re an industry expert yourself, here’s the link to participate.

Also, Swyx (of “AI News”; the best technical AI news aggregator!!) basically sees this as a benchmark for AGI, and according to the data, Claude 4.1 Opus is “95%” of the way there. 😱 He says if you look at the trend, by ~Sept 2026, we could have AI equally good as humans “at most economically viable work.” AKA… AGI.

Bottom line? AI is your intern. It can’t do your job (yet), and its work might kinda suck on the first try, but it’s getting better, and if used effectively, it can DEFINITELY save you time and money. Just, y’know, don’t pass on your intern’s work without reading it…

FROM OUR PARTNERS

Stop reacting and start anticipating.

The CISO's Guide to Security Agents shows how to move beyond static playbooks to a proactive defense. Learn how intelligent agents use Agentic AI to identify and mitigate threats before they impact your organization. Download your free guide now.

Prompt Tip of the Day

So we came across this post in r/ChatGPT the other day asking for prompts everyone should know, and here are the top picks:

Ask it to review your prompt and give you a revised version that is best suited for an AI prompt. Then copy and paste its response in the next message.
For story building: Ask me 40 questions about my story that readers would have, focusing on plot holes and continuity.
Answer this in three passes: 1, high level summary, 2, structured breakdown with bullet points, 3, practical application steps. Keep each pass distinct.
Teach me <blank> using the Socratic Method. Use first-principle thinking where reasonable.
Red and blue team it. Find all holes, fix all. What are the top 15 most urgently needed additions or fixes? Return the best version.
Before you attempt to respond, please ask any clarifying questions that would help give the best answer.
Ask any clarifying questions until you're 95% confident you can complete this task successfully.
Iterate it three times. Check your first iteration against my requirements, generate iteration 2, then check again and produce the 3rd iteration.

Treats to Try

*Asterisk = from our partners (only the first one!). Advertise in The Neuron here.

*Go beyond exploring AI—learn how to deploy it strategically. Join MIT xPRO & CSAIL’s new 9-week course, Deploying AI for Strategic Impact. Explore the course.
Factory automates your coding tasks like migrations and debugging across any IDE or workspace you use (raised $50M); also their model Droid is crushing it on terminal bench (evals for testing AI that work in your terminal)!
Adobe Photoshop updated Generative Fill so it now lets you add, remove, or transform parts of your photos using natural language prompts with Nano Banana or Black Forest Labs’ FLUX.1 Kontext [pro]; also, Firefly Boards now has two new models (Runway Aleph, Moonvalley Marey) for creative brainstorming by generating images, videos, and text effects in one workspace.
Perplexity launched a new Search API that gives you raw web search results for AI apps (vs its old Sonar API which gives you ready-made “Perplexity” style AI answers).
ChatGPT Pulse is a new feature where each night, ChatGPT researches topics you care about based on your chat history and feedback, then delivers focused visual cards the next morning that you can scan or dive deeper into (only available to Pro on mobile atm, but coming to Plus “soon”).

Around the Horn

This is super cool. Here’s the code!

xAI sued OpenAI for stealing trade secrets through employee poaching, with a judge temporarily barring former xAI engineer from AI work at OpenAI.
Google DeepMind released Gemini Robotics-ER 1.5, an advanced AI model for robot reasoning with developer access that outperformed baselines in 85 test tasks and achieved 63% success on a new robot platform.
Meta poached Yang Song from OpenAI's strategic explorations team to lead Meta SuperIntelligence Labs, escalating the talent war.
Databricks invested $100M in OpenAI partnership to integrate GPT-5 into its platform, serving 20K global customers with enterprise-ready AI capabilities.
AI isn't actually replacing radiologists but instead augmenting their workflows, as hybrid human-AI teams consistently show higher diagnostic accuracy than either humans or AI alone.
Check out Simon Willison’s review of the newly released GPT-5-Codex coding model and the improved Gemini 2.5 and Gemini 2.5 Flash Lite.
We just interviewed Mustafa Suleyman about seemingly conscious AI, and it looks like he did another interview with Sinead Bovell that goes even deeper on the topic (and he says we could possibly expect these types of “seemingly conscious AI” sometime in the next 18 months).

FROM OUR PARTNERS

Build Custom Conversational AI Agents with Any LLM

Skip the complexity. Agora’s APIs make it easy to add real-time voice AI to any app or product.
Connect any LLM and deliver voice interactions that feel natural. Build agents that listen, understand, and respond instantly.
Scale globally on Agora’s network optimized for low-latency communication. Ensure reliable, high-quality performance in any environment.

Start building your agent →