😸 AI agents failed 97% of freelance tasks; here's why...

PLUS: Robert the AI CEO got toxic FAST...

Welcome, humans.

This is definitely a gimmick, BUT… this video going viral on Instagram features a company that let an AI robot named Robert take over as CEO for a demo, and let’s just say things got awkward fast.

Within minutes, Robert was grilling employees, flexing his superior knowledge, and a little TOO casually reminding everyone how replaceable they are…

Robert basically speedran every toxic boss trait, and the comments are split between laughing nervously and genuinely concerned for our future if AI eventually takes over management.

It’s all fun and games until Robert starts scheduling your performance reviews…

Here’s what happened in AI today:

New benchmark showed agents completed only 2-3% of real freelance tasks.
Google's Gemini app hit 650M+ monthly active users in Q3 2025.
Mercor's three 22-year-old founders became youngest self-made billionaires.
Ilya Sutskever's deposition alleged Sam Altman's pattern of lying (and more).

Advertise in The Neuron here

AI Agents Can't Actually Do Your Job (Yet)—New Benchmark Reveals The Gap

DEEP DIVE: AI can make you faster at your job, but can only do 2-3% of jobs by itself.

The hype: AI agents will automate entire workflows! Replace freelancers! Handle complex tasks end-to-end!

The reality: a measly 2-3% completion rate.

See, Scale AI and CAIS just released the Remote Labor Index (paper), a benchmark where AI agents attempted real freelance tasks. The best-performing model earned just $1,810 out of $143,991 in available work, and yes, finishing only 2-3% of jobs.

This benchmark is a much needed reality check for an industry spending untold trilli’s like Bond movie villy’s on the hypothesis that AI will automate all work. And honestly? It's useful data.

Here’s what they tested: Real tasks from freelance platforms. Not toy problems or academic benchmarks, but actual gigs that humans get paid to complete: writing, research, data entry, and design tasks.

Why agents struggled:

Multi-step workflows with unclear handoffs.
Ambiguous requirements that us humans clarify through conversation.
Tasks requiring judgment calls and context.
Work that needs iteration and client feedback.

What agents CAN do: In production environments, small fine-tuned models handle day-to-day repetitive tasks well, while bigger models orchestrate workflows or handle edge cases. This setup works, but it's narrow and human supervised.

These agents come with hidden costs, too. Even when agents work, Rate Limited's recent breakdown shows “free” coding agents carry costs: rate limits, latency, security reviews, and rework. You need guardrails and budgets, not blind automation.

The counterpoint = a new study that shows 74% of companies that actually measure GenAI ROI report positive returns (full report).

Why this matters: We're in a weird middle ground. AI can augment work impressively, but can't yet replace skilled humans on complex tasks (the middle-to-middle problem). Understanding this gap helps set realistic expectations.

What's coming: Better agent architectures, tighter human-in-the-loop workflows, and specialized agents for narrow domains. Progress is happening, it’s just not happening (successfully) as quickly as the AI companies want you to think.

The takeaway: If someone's selling you on fully autonomous AI workers, ask to see completion rates on real tasks you do every day… or don’t buy it.

FROM OUR PARTNERS

Iru is the AI-native security & IT platform used by the world’s fastest-growing companies to secure their users, apps, and devices.

Built for the AI era, Iru unifies identity & access, endpoint security & management, and compliance automation—collapsing the stack and giving IT & Security time and control back.

Prompt Tip of the Day

If you are still using OpenAI for image editing, this viral reddit post (8K upvotes and counting, on the ChatGPT subreddit no less) shows the direct comparison between Nano Banana and OpenAI’s ImageGen, and it's not even close.

Why? Because Gemini uses inpainting (editing specific masked areas) while ChatGPT regenerates the entire image from scratch.

In the example, the user asked both models to add pool floats to a backyard pool photo. Gemini delivered a photorealistic edit instantly. ChatGPT took 90 seconds and created floating pool toys... hovering in mid-air in front of the pool.

One caveat: Gemini sometimes gets “stuck” and spits out the exact same image without changes. If this happens, start a new chat or explicitly tell it to “go back to the previous step and start fresh.”

Our advice = for any photo editing task (removing objects, adding elements, or making realistic changes) skip ChatGPT entirely and head straight to Gemini. Save yourself the frustration, friends.

Treats to Try

Kimi Linear is a new AI model that replaces most attention layers (which normally store every previous word in memory) with efficient linear attention (which uses fixed-size memory instead), reducing memory by 75% and achieving 6× faster text generation at million-token contexts, which = roughly 750K words, so book-length context (model, paper).
1. The key insight here = traditional attention scales poorly (more text = more memory + slower), while linear attention maintains constant memory and speed regardless of length, making it practical for real-world long documents.
Maillayer lets you send email campaigns through Amazon SES at $0.10 per 1,000 emails with a self-hosted platform for a one-time payment ($29 one-time for self-hosted version), so no monthly fees.
1. Meanwhile, Sidemail handles transactional emails, marketing campaigns, and automations for startups in one EU-based platform with GDPR compliance built-in—free trial, then $19/month.
2. Then there’s Superinbox, which drafts email replies in your writing style and auto-organizes your Gmail or Outlook inbox—works as a sidebar, not a new app.
CoreStory automatically documents legacy codebases and provides modernization guidance, turning months-long manual analysis into days using AI that maps business logic, dependencies, and architectural insights (raised $32M).
Sheets Organizer searches, groups, pins, and bulk-manages Google Sheets tabs so you stop scrolling through 50+ tabs—free trial, then pricing details available on site.
MiniMax M2’s “interleaved” reasoning lets it reason after each tool call (which is crucial for adapting to unexpected outputs); you can install this AI in Claude Code by editing the ~/.claude/settings.json to point ANTHROPIC_BASE_URL to https://api.minimax.io/anthropic with your API key (here’s how); this works in Claude Code and Cline, or you can run it in LM Studio (if you have problems with it, read this thread).
Postiz schedules posts across 20+ social channels with AI content creation, a built-in design editor, and open-source self-hosting option—free trial, then $29/month
Dodo Payments handles payments, billing, subscriptions, usage-based pricing, and global tax compliance for AI and SaaS products in one platform—paid only rn (4% + $0.40 per transaction).
1. Their new Sentra agent writes billing and payment integration code in your IDE based on your prompts, then tracks your revenue metrics and automatically handles customer actions like refunds, upgrades, and credits—free to try.
ScaryStories Live generates interactive horror videos in real-time where you direct the nightmare and watch it evolve based on your choices—free to try

Around the Horn

Why yes, this IS a video of a Squirrel boxing a Rooster (both with boxing gloves) on somebody’s front porch! Thanks, Sora!

Google's Gemini app crossed 650M+ monthly active users in Q3 2025, making it one of the fastest-growing AI products ever.
Sam Altman and Satya Nadella discussed OpenAI's $100B revenue target for 2027 and the Microsoft-OpenAI partnership on the Bg2 podcast (and Sam gets spicy about that one question).
Ex-OpenAI Chief Scientist Ilya Sutskever was deposed in the Musk v. Altman lawsuit, and revelations included a 52-page memo Ilya made alleging a “pattern of lying” by Sam Altman, as well as board-level talks about merging with Anthropic, and Ilya’s thoughts on who should control AGI. Read more here.
Google pulled its Gemma AI model from AI Studio after Sen. Blackburn accused it of fabricating misconduct allegations against her.
Synthesia reportedly raised ~$200M at a ~$4B valuation to expand its enterprise video avatars and studio tools.
Mercor's three 22-year-old founders became the youngest self-made billionaires after a raise valuing their AI recruiting startup around $10B.
Anthropic researchers found that Claude models can sometimes detect when concepts are artificially injected into their processing and accurately identify them, suggesting AI models possess limited “introspective awareness” of their internal states.
Here are the top AI papers of last week according to Rohan Paul, while Simon Willison shared two new AI agent-related prompt injection papers worth paying attention to (Agent Rule of 2 and The Attacker Moves Second).