Articles

More Convinced. More Cautious: What the Evidence on Medical AI Now Tells Us

Prof. Shafi Ahmed
May 4, 2026

When our Partner, Prof. Shafi Ahmed, first wrote about the most influential papers on large language models in healthcare, in early 2025, the conversation was still dominated by a question we now find oddly quaint.

Can a chatbot pass a medical examination? Sixteen months later, the question has changed entirely. The frontier models we benchmark today, GPT-5, Claude Opus 4.6, Gemini 3.1 Pro, routinely show very strong performance on the multiple-choice tests that defined the early literature. The interesting work has moved on. We are no longer asking whether these systems can answer a USMLE question. We are asking whether they can think alongside a clinician at the bedside, whether they can be trusted with the messiness of real patients, and whether the gains demonstrated in research environments survive the journey into actual hospitals.

In the past eighteen months, three of the world’s most powerful technology companies have launched dedicated healthcare platforms. Several landmark randomised trials have, for the first time, given us prospective evidence on AI scribes and conversational diagnostic agents. New benchmarks have replaced exam-style testing with open-ended clinical reasoning. And we have learned, in ways that should sober us, that some of the safety questions raised in 2024 have not gone away. They have simply become more urgent at scale.

This article revisits the most consequential strands of LLM research in medicine, updates the pictureI sketched in early 2025, and asks what the evidence now tells us about where we are, and where we are heading.

When the Model Sits Beside the Doctor: Comparative Performance, Reframed

In late 2024, BMJOpen published an observational comparison of GPT-4 and GPT-4o against Swedish family medicine specialists on open-ended cases drawn from the Swedish national examination. The findings were striking and, at the time, reassuring for clinicians: top doctors scored 7.2 out of 10, average doctors 6.0, GPT-4only 4.5, and GPT-4o 5.3. The models lagged behind in multimorbidity, social context, compliance, and the legal and psychosocial nuances that define general practice. The authors were careful to note three caveats: zero-shot prompting, no domain fine-tuning, and no adaptation to the Swedish setting. Their conclusion was measured: AI is not yet ready to make clinical decisions, but it may help with the administrative weight that bears down on doctors every day.

That paper now reads like a snapshot from another era. The two studies that have most reshaped our own thinking arrived in the months that followed. The first, published in Nature in April 2025, was Google DeepMind’s OSCE-style evaluation of AMIE, a conversational diagnostic system. Across 159 standardised case scenarios spanning Canada, the United Kingdom, and India, AMIE outperformed primary care physicians on 30 of 32 clinically meaningful axes rated by specialists, and on 25 of 26 axes rated by patient-actors. The dimensions where it excelled were not just diagnostic accuracy. They included communication, empathy, and the felt experience of being heard, which, for most of us in clinical practice, is precisely what we work hardest to deliver and find hardest to measure.

The second is Microsoft’s MAI Diagnostic Orchestrator (MAI-DxO) study, posted as preprint in mid-2025. On 304 challenging cases published in the New England Journal of Medicine, an orchestrated multi-agent system reached 85.5% diagnostic accuracy. Experienced physicians, given the same sequential simulation, scored around 20%. The accuracy gap was over fourfold, and diagnostic costs fell by more than half. Whatever one’s scepticism about benchmark inflation (and we should retain a great deal of it), a result of this magnitude on cases of this difficulty is not noise.

How do we reconcile this with the Swedish study? The honest answer is that they are measuring different things. Open-ended general-practice scenarios in a national language, with all their cultural and regulatory texture, remain hard. Closed differential diagnosis on complex inpatient cases, even very hard ones, is now within reach of frontier systems and may already exceed individual human performance. The interesting clinical question is no longer which one is better. It is where, for which task, and under what supervision, AI now offers a meaningful uplift.

Why Retrieval Still Matters: Even in the Age of Frontier Models

Prof. Ahmed's January 2025 review highlighted the NEJM AI framework on retrieval-augmented generation as a possible answer to the limitations of zero-shot LLMs in clinical settings. That argument has, if anything, become stronger. Despite the dramatic improvements in raw model capability, hallucination remains the central obstacle to safe deployment. Recent work across journals including Frontiers in Public Health and npj Digital Medicine suggests that well-designed RAG architectures can substantially reduce hallucination rates on public-health question answering - by 40% or more in some studies - and a 2025 radiology evaluation reported near-zero hallucinations when retrieval was grounded against contrast-media guidelines in a controlled setting.

The signal here is not that one technique has won. It is that architecture, more than any single model, determines whether a system is safe to deploy. The platforms launched in early 2026 - Claude for Healthcare, ChatGPT Health, Copilot Health - all rest on retrieval over verified clinical sources, structured connectors to PubMed, ICD-10, regulatory databases, and trial registries, and architectural commitments that the model’s parametric knowledge is never the last word. The frontier model is the engine, but the retrieval and orchestration layers are increasingly where clinical safety is engineered. For health-system leaders evaluating AI suppliers, the most important questions in 2026 are no longer about model size. They are about what the model is allowed to look at, what it is required to cite, and what happens when it does not know.

Where the Field Is Moving: From Capability Claims to Real-World Evidence

Three new pieces of evidence deserve to sit alongside the foundational reviews cited last year. The first is the December 2024 study, Superhuman Performance of a Large Language Model on the Reasoning Tasks of a Physician, which evaluated OpenAI’s o1-preview on differential diagnosis and management. That paper has now been substantially extended by the CPC-Bench work using a century of New England Journal of Medicine clinicopathological conferences. Across 377 contemporary cases, OpenAI’s o3 ranked the correct final diagnosis first in 60% of cases and within the top ten in 84%, and selected the appropriate next test in 98%. A twenty-physician baseline was outperformed across most measures. Image and literature-search tasks remain harder, but on text-based differential diagnosis the trajectory is unambiguous.

The second is the Lancet Digital Health STANDING Together consensus, developed through a broad multi-stakeholder Delphi process involving hundreds of international contributors, which sets out the standards we should apply to clinical AI evaluation: representative datasets, transparency about training populations, explicit reporting of subgroup performance. A broader systematic review in npj Digital Medicine, spanning eighty-three studies, has clarified the picture: generative models perform comparably to or slightly better than non-expert clinicians on diagnostic tasks, while expert clinicians still significantly outperform them. The replacement narrative is, for now, not supported by the data. The augmentation narrative very clearly is.

The third is HealthBench, OpenAI's benchmark built with 262 physicians who have collectively practised in sixty countries, across five thousand realistic clinical conversations. It is the first widely adopted evaluation that prizes safe escalation, contextual judgement, and appropriate uncertainty over single-answer accuracy. Whatever we may think of any particular vendor, this is the right direction of travel. Medicine is not a multiple-choice test. The benchmarks we use ought to reflect that.

Taken together, these studies suggest a field that is maturing rapidly. We have moved beyond capability claims toward genuine evidence about what these systems can and cannot do, for whom, under which conditions. That is the foundation a serious clinical discipline requires.

The Cognitive Mirror: What Testing Models Like Patients Actually Tells Us

The BMJ Christmas paper, Age Against the Machine, remains one of our favourite pieces of medical AI writing. Subjecting ChatGPT-4 and 4o, Claude 3.5 Sonnet, and Gemini 1.0 and 1.5 to the Montreal Cognitive Assessment, the authors found signs of mild cognitive impairment in every model except ChatGPT-4o, with the oldest models faring worst, Gemini 1.0 scored 16 of 30. The result was a wry inversion: the same tools that we hoped might one day diagnose dementia were themselves failing the test we use to detect it.

It is worth pausing on what has changed. The models in that study are now two generations behind the frontier. Claude Opus 4.6, GPT-5, and Gemini 3.1 Pro perform substantially better on visuospatial and executive tasks, though independent re-runs of the protocol on newer systems are still emerging. The deeper insight, however, has not aged. The MoCA was designed for a particular kind of mind, one that perceives, attends, and reasons in ways grounded in embodied experience. Applying it to a language model exposes the gap between linguistic fluency and the kinds of cognition that medicine actually depends on. A model that can write a beautiful discharge summary may still be unable to draw a clock face, and the lesson is not that we should expect it to. The lesson is that fluency without grounding can be persuasive in ways that warrant clinical caution.

Empathy in Code, Reconsidered

Few claims about LLMs in medicine are as contested, or as consequential, as the empathy claim. The widely discussed 2023 JAMA Internal Medicine study cited last year, found that clinicians preferred ChatGPT‑style responses to physicians’ answers in about 79% of patient‑question comparisons and rated them as more empathetic and higher quality. Subsequent work, including a 2025 Nature study of AMIE using OSCE‑style patient‑actor evaluations, found that an AI system could be rated as more empathetic than primary‑care physicians on multiple dimensions in simulated consultations. In March 2025, an NEJM AI randomised controlled trial of Therabot, a generative‑AI mental‑health agent, reported clinically significant reductions in depressive and anxiety symptoms compared with a waitlist control, with patient‑reported therapeutic alliance scores similar to those seen with human therapists.

These findings deserve to be taken seriously, and they should also be held against the warning issued by the Nature Machine Intelligence essay, Empathic AI Can’t Get Under the Skin. Models do not feel. They generate text that, statistically speaking, is associated with feeling. When patients perceive empathy in a model, they are perceiving something that is, in a strict sense, simulated. That does not necessarily make the comfort it provides illegitimate, particularly in low-resource settings where the alternative is no support at all. But it places a heavy responsibility on those of us deploying these systems to be honest about what they are.

Our own view is that empathy is not a metric. It is a relationship. AI may, and increasingly will, do an extraordinary amount of useful emotional and informational work in medicine. It will not, in any near-term horizon we can see, replace the act of one person bearing witness to another’s suffering. The two roles are different, and a healthcare system that confuses them does so at considerable cost.

Ambient Scribes: From Conflicting Signals to Converging Evidence

Of all the strands covered last year, the AI scribe story is the one that has resolved most decisively. In early 2025, the picture was genuinely confusing. An NEJM AI study of DAX Copilot at Atrium Health found no statistically significant differences in EHR-related or financial metrics between DAX users and controls, though it noted modest reductions for high-utilisation subgroups. A JAMIA study at Stanford, by contrast, reported reduced workload, decreased burnout, and better usability. Prof. Ahmed argued at the time that the discrepancy reflected differences in study design, metrics, and institutional context.

The evidence base since then has matured considerably. A pragmatic three-arm randomised controlled trial published in NEJM AI in late 2025, conducted across UCLA Health and led by Paul Lukac and colleagues, randomised 238 outpatient physicians from fourteen specialties to DAX Copilot, Nabla, or usual care. Nabla produced a statistically significant reduction in time-in-note of around 9.5% (P=0.02), while DAX did not (a non-significant 1.7% reduction, P=0.66). On secondary, non-hypothesis-tested measures, both arms showed potential improvements of roughly 7% in burnout, task load and work exhaustion scores. A separate JAMA Network Open study across 263 clinicians and six health systems found that burnout fell from 52% to 39% within thirty days of ambient AI adoption, a 13.1-percentage-point reduction that is, by any measure, clinically meaningful. By 2026, DAX Copilot alone serves more than six hundred organisations and processes over three million encounters per month. Abridge has been awarded Best in KLAS for ambient AI in two consecutive years.

So the answer to the question Prof. Ahmed raised last year is now clearer. Ambient AI scribes do work, in the sense that they meaningfully reduce documentation burden and improve clinician wellbeing in the right institutional setting. The conditions matter: hallucination rates remain non-trivial (with some published evaluations reporting rates in the high single digits), physical examination findings are particularly vulnerable to fabrication, and physician review of every note is non-negotiable. But the conflicting picture of early 2025 has given way to converging evidence. Ambient documentation has become, in less than two years, one of the few categories of clinical AI with prospective randomised evidence supporting routine adoption.

Over the last 18 months, start-ups in this area have raised over 2 billion dollars, and the deployment of AI scribes remains the easiest rung on the ladder of AI implementation, as it is perceived to carry low risk.

That said, we want to flag a more uncomfortable finding from this year’s literature. A stress-test of consumer-facing health AI, published in Nature Medicine, used 60 clinician-authored vignettes across 21 clinical domains and found that the leading consumer platform under-triaged 52% of gold-standard emergencies, redirecting patients with conditions including diabetic ketoacidosis and impending respiratory failure to 24-to-48-hour evaluation rather than the emergency department. When family members minimised symptoms, recommendations shifted further. These results do not invalidate consumer health AI. They do mandate vigilance, particularly as these tools reach hundreds of millions of users.

Where This Leaves Us

We find ourselves more convinced than in early 2025, and also more cautious. More convinced because the evidence for genuine clinical utility, in documentation, in differential diagnosis on hard cases, in patient-facing communication, in mental-health support, has accumulated faster than we anticipated. More cautious because the failure modes have also been clarified. Models that perform brilliantly in research conditions can under-triage emergencies at the consumer scale. Benchmarks that suggest superhuman performance can mask weaknesses in the populations and presentations that matter most. The covenant of care between healer and patient remains, in our view, irreducibly human, even as AI does more and more of the work that surrounds it.

The papers we would point any clinician or health-system leader toward in April 2026 are not the ones that promise the most. They are the ones that test most rigorously. AMIE in Nature. MAI-DxO on the NEJM cases. The UCLA randomised trial of DAX and Nabla. HealthBench as the new benchmark standard. The STANDING Together consensus on equity in evaluation. Read together, they sketch a discipline finding its feet, less infatuated with capability, more serious about evidence, and increasingly clear that the question is not whether AI will reshape medicine but how, and whether we will do it well.

That, we think, is a healthier place to be than the year of speculation we left behind. The job in front of us now is to ensure the next eighteen months are governed by the same standards we expect of any clinical intervention: prospective evidence, equitable access, transparency about limitations, and an unwavering commitment to the patients in whose service all of this work is supposed to be done.