Doctors got worse at cancer screening after using AI helpers.

AI appears superhuman on deeply flawed medical benchmarks while simultaneously eroding real-world doctors' skills, revealing a critical need to fix how we measure AI's capabilities before we irrevocably change the practice of medicine.

Grant Harvey

July 29, 2024

An August study in The Lancet, one of the world's most prestigious medical journals, just dropped some uncomfortable news: experienced doctors who used AI to help spot colon cancer actually got worse at finding it when the AI wasn't around. Just as this evidence of human fallibility emerged, another blockbuster paper was published claiming that OpenAI’s next-generation model, GPT-5, has achieved "superhuman" performance on complex medical reasoning tasks, crushing the scores of trained professionals.

So, which is it? Is AI an indispensable tool elevating medicine to new heights, or is it a crutch that will leave a generation of clinicians dangerously dependent and less skilled? The answer, it turns out, is complicated.

Finally, a third, critical piece of research suggests that the way we’re measuring AI’s "superhuman" abilities is fundamentally broken, creating an illusion of progress that masks deep risks. Together, these findings paint a precarious picture: are we deskilling our doctors in favor of an AI whose true capabilities we don't fully understand, because we are measuring them all wrong? Let's dive in.

‍

Here are the details: Polish researchers tracked 19 experienced doctors (who each had done 2,000+ colonoscopies) across four medical centers. And this is what happened:

Before AI: Doctors found precancerous growths in 28.4% of procedures.
After AI: When working solo, they only caught 22.4% (a 6-point drop).
With AI on: Detection rate was 25.3%.

The study ran from September 2021 to March 2022, covering 1,443 procedures without AI assistance.

You can kinda call this the first real-world study to document “deskilling” from medical AI. Researchers are calling it “automation bias”, where doctors became less motivated and focused when working without their digital sidekick. As we’ve written before, think of it like using GPS for so long that you can't navigate your own city anymore.

To be fair, some experts think fatigue played a role too, since total procedure volumes increased. But the pattern is clear: remove the training wheels, and performance drops.

Why this matters: Even a 1% drop in cancer detection matters. This study showed a 6% decline… that's potentially thousands of missed cases if this scales across healthcare systems.

Meanwhile, AI keeps getting scarier good at medicine.

A new paper just dropped showing GPT-5 now beats pre-licensed human medical experts at diagnosis by +24.23% (reasoning) and +29.40% (understanding)% on complex cases involving medical images. We're talking about the same kinds of visual pattern recognition that the Polish doctors struggled with.

So here's what happened: researchers at Emory University School of Medicine illuminated its dazzling promise. Titled "Capabilities of GPT-5 on Multimodal Medical Reasoning," the preprint presented evidence that OpenAI's latest model isn't just as good as a doctor—it's significantly better.

The researchers benchmarked GPT-5 against its predecessors and human experts on a suite of punishingly difficult medical exams, including the USMLE (United States Medical Licensing Examination) and the highly complex MedXpertQA. Crucially, they tested its multimodal capabilities—its ability to reason not just from text but from a combination of text and images, like a doctor reviewing a patient's chart alongside a CT scan.

The results were nothing short of breathtaking:

Crushing Standardized Tests: GPT-5 achieved 95.8% on a version of the MedQA exam and an average of 95.2% on official USMLE sample questions, far exceeding the typical human passing threshold.
A Leap in Multimodal Reasoning: The most dramatic gains came in tasks requiring the synthesis of text and images. On the MedXpertQA multimodal benchmark, GPT-5 improved on GPT-4o's performance by a staggering 29% in reasoning and 26% in understanding.
Surpassing Human Experts: The paper's most provocative finding was a direct comparison to "pre-licensed human experts." While GPT-4o actually performed worse than the humans on most metrics, GPT-5 blew past them. It was 24.2% better at multimodal reasoning and 29.4% better at multimodal understanding than its human counterparts.

A case study included in the paper demonstrated this power in action. Given a complex case with a patient history, lab results, and an abdominal CT scan, GPT-5 correctly diagnosed a life-threatening esophageal perforation (Boerhaave syndrome), explained its reasoning, ruled out other possibilities, and recommended the precise, correct next step (a Gastrografin swallow study). It performed like a world-class diagnostician.

The authors concluded that GPT-5 represents a "qualitative shift," moving from the human-comparable performance of GPT-4 to "clear superhuman proficiency" on these benchmarks. This finding presents a powerful counter-narrative to the deskilling dilemma. If AI is truly becoming superhuman, perhaps it doesn't matter if human skills atrophy. Why insist on a human doctor navigating by memory when a self-driving AI can get you to the destination more safely and efficiently every single time?

So here's the wild part: AI is getting better than doctors at the same time doctors are getting worse without AI. That's... not a sustainable combo.

The Twist: Are We Measuring a Mirage?

This is where the story takes its crucial turn. Is GPT-5's "superhuman" performance real, or is it an artifact of the test? A third, landmark paper titled "Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models" suggests it may be the latter. A team of researchers introduced "MedCheck," a rigorous framework to evaluate the medical benchmarks themselves. They applied it to 53 existing benchmarks—the very kind used to test models like GPT-5—and uncovered systemic, field-wide flaws.

Their findings were a damning indictment of the current evaluation landscape:

The Clinical Disconnect: The researchers found that most benchmarks have a "profound disconnect from clinical practice." They often rely on multiple-choice exam questions, which test factual recall but fail to measure the nuanced, open-ended reasoning required in a real clinic. A stunning 53% of the benchmarks failed to align with any formal medical standards (like the ICD codes used globally to classify diseases), and 47% completely ignored safety and fairness in their design.
A Crisis of Data Integrity: Data quality, the bedrock of any benchmark, was found to be abysmal. A staggering 92% of benchmarks failed to adequately address the risk of "data contamination"—the possibility that the model has already seen the test questions in its vast training data, leading to memorization rather than true reasoning. This inflates scores and creates a false sense of progress.
Systematic Neglect of Safety: Perhaps most alarmingly, the benchmarks weren't testing for the things that matter most for patient safety. An incredible 94% had no way to test for model robustness (how it performs with slightly imperfect or "noisy" real-world data), and 96% failed to evaluate a model's ability to express uncertainty—to say, "I don't know," a critical skill for any safe medical practitioner.

The MedCheck paper argues that the field is suffering from an "illusion of progress," where models are getting better and better at passing flawed academic tests that don't reflect the messy reality of medicine. This fundamentally reframes the "superhuman" results of the GPT-5 paper. GPT-5 is likely superhuman on a test, but the test itself may be a poor measure of real-world clinical competence.

The Paradox Revisited

When viewed together, these three studies create a deeply troubling narrative. We have empirical evidence that relying on AI is already deskilling our expert doctors (The Lancet study). At the same time, we're being told to trust the AI because it's becoming superhuman (the GPT-5 paper). But now we learn that the very definition of "superhuman" is based on flawed, clinically disconnected, and potentially contaminated benchmarks that don't even test for safety or robustness (the MedCheck paper).

This is the AI medical paradox. We are weakening our human experts in favor of a technology whose real-world readiness we are measuring with broken instruments. We risk trading the proven, albeit imperfect, skills of a human doctor for the on-paper brilliance of an AI that might be a formidable test-taker but a brittle and unreliable clinical partner.

So what is the path forward? The solution isn't to abandon AI, but to radically rethink how we integrate and evaluate it.

The bigger picture? Besides colonoscopies and GPT-5, AI tools are spreading across radiology, dermatology, and pathology. If similar deskilling happens elsewhere, we could have a healthcare system that's better with AI but dangerously worse without it.

First, we must heed the warning of the deskilling study. So, what happens in this scenario? First off, more AI screening. Playing this out, if the doctors become less reliable, and the AI screeners become MORE reliable, then doctors will eventually stop doing screenings. After all, what good is a rubber stamp second opinion “looks good to me” if the doctors lose their skill to identify this stuff.

Alternatively… Smart hospitals might need some new guardrails:

“AI-off” training days to keep human skills sharp.
Skill monitoring that tracks solo performance, not just AI-assisted results.
Outage drills to ensure teams can still hit quality targets during tech failures.

Second, we need to fix the measurement problem. The MedCheck paper is a call to action for the entire AI research community. We need to move beyond static, multiple-choice leaderboards and build dynamic, interactive benchmarks that simulate real clinical workflows, prioritize safety and robustness, and are built on clean, representative data.

Finally, we must redefine the role of the human doctor in the age of AI. If AI makes us better, that’s great. But we can't let it make us helpless. So we either use it properly to keep our skills sharp, or cede territory to the machines and focus our human time on other things.

Maybe human doctors who would’ve been diagnosing can divert the time saved to talking to patients about preventive care measures and coming up with plans to reduce risk or chatting about early warning signs.

Or maybe the doctor's role changes even further. This one feels more sci-fi, but perhaps the end game isn't a machine that replaces the diagnostician, but the doctor become a sophisticated "AI operator," skilled at prompting, interpreting, and overriding the AI's suggestions. They become a manager of uncertainty, navigating the grey areas where the data is messy and the AI is unsure. And, by ceding rote tasks to automation, they can reinvest their time in the uniquely human aspects of medicine: patient communication, empathy, and focusing on preventive care—an area where the U.S. healthcare system, despite spending $4.9 trillion a year, invests a mere 3%.

AI's potential in medicine remains immense. But we are at a pivotal and perilous moment. Before we fully embrace a "superhuman" partner, we must first ensure it doesn't make us weaker, and we must be absolutely certain we are measuring its power with a yardstick that reflects reality.