ChatGPT isn't replacing doctors, but it might be the paranoid friend you need at 2 AM

ChatGPT isn't replacing doctors, and that’s a problem… (here’s why).

Grant Harvey

July 29, 2024

ChatGPT MD?

Get this: A Reddit user just credited ChatGPT with probably saving his wife's life.

Following a cyst removal, she was feeling feverish, but told him she wanted to “wait it out.”

The husband casually plucked all the details into ChatGPT (as he often did), and was surprised to see it urge them to get to the ER, ASAP.

He said normally, GPT was much more chill, so the urgency was a bit of a shock.

And guess what? It was right.

She had developed sepsis.

Naturally, the comments exploded with similar stories:

AI catching blood clots, diagnosing gallbladder issues, even spotting rare conditions that stumped 17 different doctors. It was honestly kind of amazing to read?? And by amazing, we mean kiiiinda hard to believe.

Also, before we go any further, we should state for the record:

None of this is medical advice, nor should it be construed as an endorsement of using AI for your healthcare. You should also probably read about how AI companies hold onto your data and may not delete your chats when they say they do.

Now, before you go designate ChatGPT as your new primary care physician, listen to this:

A major new Oxford study with ~1,300 participants revealed a vital truth about AI medical advice: people using AI chatbots performed worse than those just using Google.

The Oxford study tested GPT-4o, Llama 3, and Command R+ on medical scenarios with real people.

Here's what they found:

When the AIs worked alone, they scored 90-99% accuracy identifying conditions.

But when paired with humans, users could only identify relevant conditions 34.5% of the time—significantly worse than the 47% achieved by people just using Google and their own knowledge.

The breakdown? Users provide incomplete information, can't distinguish good from bad AI suggestions, and even ignore correct recommendations even when the AI gets it right.

The Oxford study exposes a critical flaw in how we evaluate AI:

Standard medical benchmarks showed 80%+ accuracy, while real-world performance with the same AI was below 20% in some cases.

The AIs that scored 99% accuracy when tested alone completely failed when real people tried to use them.

Another comprehensive 2025 meta-analysis of 83 studies found that generative AI models have an overall diagnostic accuracy of just 52.1%—performing no better than non-expert physicians but significantly worse than expert physicians.

‍The data on ChatGPT's medical accuracy is all over the map... from 49% accuracy in one study to outperforming emergency department physicians in another.

But the Oxford paper pointed out that the real issue isn't medical knowledge—it's the people skills.

As one medical expert put it: "LLMs have the knowledge but lack social skills."

In one sense, the bottleneck isn't medical expertise but human-computer interaction design.

People are already using ChatGPT as a "frontline triage tool" and only acting when the AI says "see doctor ASAP." But regular non-docs struggle to communicate effectively, even with fake scenarios where they have no reason to lie or withhold information.

This mirrors broader AI development patterns. Two years ago, "prompt engineering" was hyped across tech companies, then disappeared as models adapted to typical user questions.

But this hasn't translated to medical conversations with non-professionals, where "sometimes do X" instructions fail when conditions are fuzzy.

What about reasoning models (o1, o3, etc)?

Recent studies on newer models like GPT-4o and ChatGPT o1 show improved performance, with newer models achieving 60% accuracy compared to GPT-3.5's 47.8% for leading diagnoses.

That's referring to a March 2025 emergency department study that tested various ChatGPT models on 30 real patient cases and found that newer models like GPT-4o and o1 significantly outperformed older versions, with both achieving 60% accuracy for identifying leading diagnoses.

Interestingly, asking the AI to explain its reasoning improved performance for GPT-4o models—boosting accuracy from 45.6% to 56.7% in some cases—but had minimal impact on o1 models, suggesting that newer reasoning-focused models already incorporate this step internally.

A separate June 2025 pneumonia study evaluated three AI models on 50 pneumonia-specific questions. OpenAI o1 achieved the highest accuracy scores, followed by o3 mini, with ChatGPT-4o ranking lowest.

Crucially, the two chain-of-thought models (o1 and o3 mini) demonstrated robust self-correction abilities—when infectious disease specialists flagged errors, these models significantly improved their responses, while ChatGPT-4o showed minimal improvement upon re-prompting.

Even with these improvements, we're still not at the reliability level needed for unsupervised medical advice.

But all these studies highlight the same pattern: AI models are getting better at medical reasoning, but the human-AI interaction remains the critical bottleneck for real-world deployment.

Speaking of, WebMD aficionados aren’t the only ones using AI.

On the flip side, doctors themselves apparently have amazing AI tools. The startup OpenEvidence—an AI diagnosis engine used by ~25% of all US physicians—is trained exclusively on peer-reviewed medical literature and partners with journals like NEJM and JAMA (here’s a cool demo). These doctors apparently access it an average of 10 times a day to help with complex cases, especially in oncology.

OpenEvidence has already raised $100M at a $3B valuation, and even sued a competitor for allegedly impersonating doctors to get access to the their proprietary algorithms. That's how you know the tool is good.

Meanwhile, some hospitals (like Johns Hopkins) use AI to predict sepsis risk…the exact thing ChatGPT told the husband to take his wife to the hospital for.

Then there's Ellipsis Health, who just raised $45M for "Sage," an AI that makes fully autonomous care management calls to patients—handling everything from program enrollment to post-discharge follow-ups without human oversight.

And we’ve written before about how Abridge and Microsoft are essentially the two leaders in the doctor AI note-taking race. Although, over the last few weeks, two competitors just got additional funding:

Commure writes your medical notes and handles insurance paperwork automatically after each patient visit (raised $200M).
Nabla listens to your patient conversations and automatically writes your medical notes in 5 seconds (raised $70M).

The VCs see there's money to be made as adoption increases, obviously.

And of course you’ve seen how AI has been used to diagnose all kinds of diseases and disorders (here's an attempt by o3 Pro at a definitive list).

One patient even watched her dermatologist check drug interactions using ChatGPT on her phone during an appointment.

Keep in mind: Doctors don't 100% want to trust AI (and rightfully so). As of February, 47% of physicians want increased FDA oversight of AI medical devices, while researchers warn that "many hospitals are buying AI tools 'off the shelf' without local validation" which is "a recipe for disaster."

So adoption is increasing, and skepticism with it.

So why do people resort to using AI for their medical needs?

Because the #1 thing Americans want is easier access to healthcare.

People in the US don’t want to go to the doctor, even for preventative care visits, either because it’s too expensive, not covered by insurance, or often plagued by months-long wait lists.

And don't even get me started on the obscenely ludicrous ordeal it is to find a new doctor in a new city that is ALSO "in network." You're either picking names out of a list (horrible), scouring "doctor review sites" (who even writes those??), or typing "best doctor blue shield 90210 +reddit" to try and triangulate some sort of direction.

So yeah, obviously people turn to unregulated consumer chatbots, because that's what's easily available.

The conversational interface makes a big difference, too.

Instead of WebMD or Mayo Clinic's laundry list of symptoms that somehow always points to either "you're fine" or "death", you can actually describe what you're feeling and get a nuanced response.

“I have a headache, but I also feel nauseous and my vision seems off” hits different than checking boxes on a symptom checker or scrolling between symptoms and potential causes on an endless WebMD doom loop.

As a result, the answers from GPT feel more trustworthy because they're more personalized.

The real value of having a chatbot-like interface isn't the AI diagnosis (arguably, that’s the dangerous part). It's AI helping overcome that very human tendency to downplay symptoms when you should absolutely see a doctor.

But because we don’t know what we don’t know (we ain't doctors), and don’t share everything that could be relevant (because we're not prompted to, and we're not as thorough as tech workers who naturally "over-report" all possibly-relevant information for "QA" purposes,), we aren’t getting the most accurate help we can find.

In fact, we read earlier today that some people give AI as little as one to two sentences of context and expect it to be able to produce fully on brand marketing materials, let alone accurate health advice.

Pro tip: don't use AI like Google. Use it like writing a memo or project brief you would write a coworker to explain how to do your job.

‍This is why relying on ChatGPT for health advice is dangerous:

According to the Oxford study, the current direct AI-patient interaction is fundamentally broken: because people can't effectively communicate with AI systems, can't evaluate AI suggestions, and often ignore correct advice. So when they turn to GPT, they're creating exactly the dangerous scenario the researchers want to avoid by releasing this data.

Instead, medical experts emphasize that "AI should be regarded solely as a tool to aid physicians in their decision-making processes, which must always be under human control and supervision."

They want AI to be the Robin to the doctor's Batman—helpful sidekick, not the hero. The doctor should always be driving the car; AI just helps with navigation.

The Oxford researchers also recommend "systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare."

That's something that should be obvious but apparently isn't: Before you release an AI medical tool into the wild, maybe test it with actual humans first?

Not in a lab with tech-savvy researchers, but with your average person who types "why head hurt" into Google at 2 AM. See if regular folks can actually use the thing, and train it to actually do what a doctor or nurse would do.

There's also a knowledge gap that even perfect AI can't bridge: doctors have local, contextual knowledge—recognizing when they've "already seen 2 dozen other kids with the same condition this month"—that no general AI system can replicate.

**So then, what we actually need is some kind of doctor-supervised AI triage portal…**

Perhaps one where every patient conversation gets a physician review before the final diagnosis is revealed. Think OpenEvidence-level medical AI, but designed for consumers, and backed by real doctor oversight.

The ideal system would combine AI's 24/7 availability and conversational interface with mandatory physician oversight of every interaction.

Why doesn't this exist yet?? Get on this, startups and VCs!!

OH, and it has to be as easy to talk to as ChatGPT and free and/or covered by insurance so people actually use it.

Thankfully, studies like the Oxford one are also teaching us crucial lessons about AI design failures:

For example, giving users multiple diagnosis options to choose from might be counterproductive, since non-experts can't evaluate medical suggestions accurately.

Or how AI needs to be more proactive at extracting information, rather than waiting for users to volunteer details.

The problem? Hospitals won't touch anything that increases malpractice risk...

...Even if it could help patients. So instead of getting the safe, responsible version, people turn to good 'ole GPT.

Why is this, exactly? Well, the legal landscape is a total mess. Right now, physicians, health systems, and algorithm designers are subject to different, overlapping theories of liability, creating what one expert called "quite unsettled" legal territory. The liability theories are all over the map—medical malpractice, vicarious liability (where the hospital is responsible for AI "acting as an employee"), and product liability for the AI makers.

Despite its latest advances, AI is in most cases used as a tool for advice, not decisionmaking. That means a patient might be able to sue a physician for malpractice, or negligence, if the provider dispenses an incorrect treatment decision—even if it was suggested by an AI system. Doctors are stuck between a rock and a hard place: use AI and risk liability if it's wrong, don't use it and risk liability for not using available technology.

And once AI becomes very precise in telling what needs to be done for a patient, and the information is very likely accurate, it becomes very difficult for a doctor to consult AI and then discard the information—because if something goes wrong, that decision could come back to haunt them in court. Talk about a catch-22!

Some states have floated "safe harbor" provisions that would protect doctors who follow AI guidelines, but such legislation would have changed the liability outcome in favor of the physician defendant in only 1% of 266 claims reviewed in Oregon. Not exactly a game-changer.

Stanford researchers found that lawyers tend to give advice to healthcare providers that's very conservative (shocker, I know), and aren't necessarily banning AI tools, but strongly warning clients about the liability risks of AI in general.

Translation: legal departments are so spooked by potential lawsuits that they're blocking tools that could save lives. It begs the question: Are we avoiding the solution we actually need just because we're scared of getting sued?

People are asking ChatGPT (of ALL things) for critical life or death advice, for crying out loud. With some proper guardrails and some qualified humans in the loop, I think you'll be okay!!**

It's actually a huge, important issue to resolve, and policymakers, legal experts, and health care professionals probably need to collaborate on a framework that promotes the safe and effective use of AI.

As part of that work, they'll need to address concerns over liability... but some argue they must also recognize that the risks of not using AI to improve care could far outweigh the dangers posed by the technology itself.

Consider this: 371K Americans die annually from misdiagnoses, while another 250K die a year from avoidable medical errors. But hospitals are so terrified of AI liability that they're sticking with the status quo that's literally killing people.

So what does this mean in practice?

The healthcare industry, being inherently risk-averse (and rightfully so, we're talking life and death, literally), means health systems are advancing AI in clinical areas more gradually.

While operational applications of AI continue to expand, hospitals will remain cautious about clinical use cases due to the sensitive nature of patient care.

But once some sort of legal framework is established, and accuracy can be ensured, privacy can be secured, and proper user intake with clinical-grade review can be baked into the system, there can and should be an AI doctor system available to patients on demand at all hours of the night. Only question is: who is going to build it?

**Not legal / health advice: just a grumpy old opinion from someone fed up with the current system, k thx bye!