The people who follow the emerging evidence on large language models (LLMs) in clinical medicine will notice something unusual. A systematic review of 30 studies found diagnostic accuracy ranging from 25% to 98%, a nearly fourfold spread across similar models on similar tasks 1 .
Minor changes in prompt wording and format can shift model performance substantially, even when the underlying clinical task is unchanged 2 . LLMs may accurately carry out nonsensical instructions—for example, when asked to write letters advising patients to switch from Tylenol to acetaminophen (the same drug) because of ‘new side effects’, models complied 100% of the time 3 .
Many evaluations (91.7%) have detected demographic biases in LLMs 4 . Others have found the same models “relatively invariant to race and ethnicity” 5 .
Across 3.4 million clinical responses, models made internally contradictory decisions: labeling Black unhoused patients as at the highest addiction risk while simultaneously prescribing them the most opioids 6 . The easy reading of these inconsistencies is contradiction.
Research is full of those.
Nature Medicine published a clinical update in Research Highlights on 23 Apr 2026.
The item focuses on How to meaningfully evaluate AI in clinical medicine.
Review the original article for the full source wording and details.