LLMs Fail 80%+ of Differential Diagnoses When Patient Data Is Incomplete, JAMA Study Finds

TL;DR: A new JAMA Network Open study led by Mass General Brigham’s MESH Incubator found that all 21 leading large language models tested — including OpenAI, Anthropic, Google, xAI and DeepSeek frontier models — failed more than 80% of the time at producing appropriate differential diagnoses when patient data was incomplete. Final-diagnosis accuracy, given full clinical information, exceeded 90% for the best performers.

The study used 29 clinical vignettes from a standard medical reference text, feeding models information in the sequence a real doctor would encounter it: history, physical examination, then laboratory results.

Context and Background

The finding matters because differential diagnosis — the generation of a testable list of candidate conditions — is where clinical reasoning actually lives. Closing to a final answer once tests are back is the easy part. As lead author Arya Rao of Harvard Medical School put it: “These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.”

Cross-source context reinforces the picture. The Eurekalert summary of the same study adds that the researchers developed a stepwise scoring system called PrIME-LLM, which prevents strong performance in one phase from masking weak performance in another. Under that framework, model scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5. More recent models outperformed older ones, confirming incremental improvement — but none reached the threshold for “unsupervised clinical-grade deployment”, in the words of corresponding author Dr Marc Succi.

For the NHS, the implications run in two directions. First, consumer-AI misdiagnosis is a real patient-safety concern: people arriving at GP surgeries or A&E already anchored to a chatbot’s erroneous preliminary diagnosis can bias subsequent clinical reasoning. Sanjay Kinra of the London School of Hygiene & Tropical Medicine, quoted in the FT coverage, noted that general-purpose LLMs are unlikely to match clinical assessments that “rely heavily on the look and feel of the patient” — but acknowledged they “may have a role to play, particularly in situations or geographies in which access to doctors is limited”.

Second, the study implicitly sets the bar for specialised medical LLMs such as Google’s AMIE and MedFound. The JAMA evidence base now requires any medical-grade model to demonstrate performance not just on fully-specified cases but on the messy, information-incomplete opening stages where real diagnoses are made.

Looking Forward

For NHS trusts piloting AI triage and symptom-checker tools, the study is a hard reminder that “90% accuracy” headlines typically refer to the easier second half of the diagnostic process. Procurement teams evaluating medical-AI vendors should specifically require performance data on stepwise differential-diagnosis tasks — not just final-diagnosis accuracy on complete case files.