AI has for some time been in its “brilliant student who can solve graduate-level-physics but only recently tackled the idea 9.9 is bigger than 9.11” era.
That sounds like a joke. It is also basically the state of the technology: Amazing, but with random head-scratchers.
Stanford’s 2026 AI Index says frontier models now meet or exceed human baselines on some PhD-level science questions, multimodal reasoning tasks, and competition math benchmarks. OpenAI’s models reportedly swept the 2025 ICPC World Finals with a perfect 12-for-12 performance, while Google DeepMind’s Gemini Deep Think officially hit gold at the IMO.
And yet, in the same report, Stanford notes that the top model on ClockBench read analog clocks correctly only 50.1% of the time, compared with 90.1% for humans. AI agents jumped from roughly 12% to 66.3% accuracy on OSWorld, a benchmark for real computer tasks across operating systems, but that still means they fail about one in three attempts.
So yes, AI can now do things that would make a math professor sweat.
Also yes, it may look at a clock showing 3:45 and confidently tell you it’s Tuesday.
Welcome to the jagged frontier.
AI Got Spikier, Not Smoothly Smarter
The mistake people keep making is assuming AI progress works like a video game skill tree.
Level one: write emails.
Level two: summarize PDFs.
Level three: code.
Level four: reason.
Level five: reliably do useful work in the world.
That is not what happened.
AI did not climb a neat ladder of intelligence. It developed sharp peaks and weird holes. It can be superhuman on one task, average on the next, and strangely helpless on something a child could do.
This is what researchers call the “jagged technological frontier,” a term popularized by a Harvard Business School study with BCG consultants. In that experiment, consultants using GPT-4 completed more tasks, worked faster, and produced higher-quality output on tasks inside AI’s capability frontier. But on a task deliberately placed outside that frontier, consultants using AI were 19 percentage points less likely to get the right answer than consultants without AI.
That is the part people miss.
AI can make you better. It can also make you worse in areas. The hard part is knowing which side of the frontier you are standing on before the answer matters.
Why the Frontier Is Jagged
For humans, tasks that “feel similar” often share underlying skills. If you can read a digital clock, you can probably read an analog one. If you can navigate one app, you can usually figure out another. If you can reason through a hard math proof, you can probably handle a simple scheduling task.
AI does not generalize that way.
Models are trained on enormous oceans of examples, but those examples are uneven. Some domains are richly represented: code, essays, math problems, forum answers, textbook explanations, legal-ish language, business writing, internet argument. Other domains are thin, ambiguous, visual, embodied, or require stable interaction with messy software interfaces.
That creates strange capability cliffs.
A model may have seen thousands of competition math solutions in formats that reward symbolic reasoning. It may have learned patterns that map very well onto Olympiad-style problems. But an analog clock is a visual-spatial task with tiny details, multiple possible renderings, and no forgiving symbolic format. The model has to interpret an image, identify the hands, infer their angles, map those angles to time, and not get distracted by stylistic noise.
That sounds simple because humans have practiced it since childhood.
For AI, “simple” is not the same as “easy.” Easy means: well-represented in training, measurable, repeatable, and compatible with the model’s internal machinery.
This is why agent benchmarks matter so much. Computer-use tasks are full of tiny failure points. The model has to understand the goal, inspect the screen, click the right thing, recover from surprises, remember what happened five steps ago, and avoid turning one small mistake into a full workflow faceplant.
That is why the rise from 12% to 66.3% on OSWorld is genuinely impressive. It is also why the remaining one-third failure rate is now the whole product problem.
As we’ve written before at The Neuron, the next phase of AI is going to be much deeper than just bigger models. It is about building systems that can actually act: agents with better context, routing, feedback, and reliability. That is why launches like NVIDIA’s Nemotron 3 Nano matter less as “another model” and more as infrastructure for agentic workflows.
The model is one component in a very fragile work machine.
Benchmarks Can Hide the Weirdness
Benchmarks are useful. They are also dangerous when treated like IQ scores.
A model that performs well on a benchmark has proven it can perform well on that benchmark. That is not the same as proving it will handle your invoice workflow, your CRM cleanup, your regulatory review, your customer support escalation, or your “please update this spreadsheet without breaking the formulas” task.
Stanford’s technical performance chapter makes this tension clear. The report says capability is advancing so quickly that benchmarks are being saturated faster than expected. It also notes that some widely used evaluations have reliability concerns, including invalid question rates that can be surprisingly high.
Translation: the scoreboard is real, but the scoreboard is not the territory.
This is why AI feels so confusing in daily use. One minute it produces a strategy memo that makes you reconsider your career choices. The next minute it invents a source, misses an obvious constraint, or fails to follow a simple instruction you wrote in plain English.
Both experiences are real.
The person saying “AI is incredible” is not hallucinating. The person saying “AI is unreliable” is not being a hater. They are likely just touching different parts of the frontier.
Humans Are Jagged, Too
It's also worth saying the quiet part: human intelligence is also jagged.
A brilliant surgeon may be terrible at managing email. A chess grandmaster may be useless at fixing a printer. A gifted novelist may freeze in a statistics class. A software engineer who can debug distributed systems may still forget where they parked.
Even within the same person, ability is uneven. We are shaped by education, practice, culture, incentives, emotion, sleep, stress, confidence, and whether we have eaten lunch like civilized mammals.
Human intelligence is not one smooth substance called “smart.” It is a portfolio of skills we develop over time.
The difference is that human jaggedness comes with a lot of surrounding machinery. We have bodies, habits, social cues, embarrassment, memory of past mistakes, and often some sense of when we are out of our depth. Not always, obviously. Humanity did invent reply-all disasters. But we usually have a richer model of the world around the task.
AI’s jaggedness is stranger because it can be eloquent without being grounded. It can sound equally confident when it is right, wrong, or standing on the edge of a cliff wearing tap shoes.
That is the unsettling part. The surface fluency is smooth. The competence underneath is not. Yet.
The Practical Takeaway: Map the Frontier
For companies, AI adoption should look less like blanket automation and more like workflow cartography.
Use AI where it is strong: drafting, summarizing, transforming formats, generating options, searching across messy text, writing code with tests, explaining unfamiliar material, comparing alternatives, and accelerating first-pass work.
Be careful where it is brittle: visual interpretation, high-stakes judgment, multi-step tool use, edge-case-heavy operations, compliance-sensitive decisions, and tasks where one small error quietly poisons the rest of the workflow.
And when you do use agents, assume the system needs supervision, logging, recovery paths, and evaluation. The goal is to build around the frontier's current shape.
Start small. Test locally. Increase autonomy when reliability earns it.
The New AI Literacy
The people and companies that get the most value from AI will not be the ones who believe every demo. They also will not be the ones who dismiss every failure as proof the whole thing is hype.
They will be the ones who can say: here, it is excellent; here, it needs a human; here, it needs a test; here, it should not be allowed near the steering wheel.
AI did not get evenly smarter.
It got spikier.
And for the next few years, the winners will be the people who learn to climb the spikes without falling into the holes.