NatureResearch Highlights

How to meaningfully evaluate AI in clinical medicine

23 Apr 20264 min read0 viewsJournal Feed

GIST

The people who follow the emerging evidence on large language models (LLMs) in clinical medicine will notice something unusual. A systematic review of 30 studies found diagnostic accuracy ranging from 25% to 98%, a nearly fourfold spread across similar models on similar tasks 1 .

Minor changes in prompt wording and format can shift model performance substantially, even when the underlying clinical task is unchanged 2 . LLMs may accurately carry out nonsensical instructions—for example, when asked to write letters advising patients to switch from Tylenol to acetaminophen (the same drug) because of ‘new side effects’, models complied 100% of the time 3 .

Many evaluations (91.7%) have detected demographic biases in LLMs 4 . Others have found the same models “relatively invariant to race and ethnicity” 5 .

Across 3.4 million clinical responses, models made internally contradictory decisions: labeling Black unhoused patients as at the highest addiction risk while simultaneously prescribing them the most opioids 6 . The easy reading of these inconsistencies is contradiction.

Research is full of those.

Clinical Editorial

Summary

Nature Medicine published a clinical update in Research Highlights on 23 Apr 2026.

The item focuses on How to meaningfully evaluate AI in clinical medicine.

Review the original article for the full source wording and details.

Source Reference

Read the full original publication from the source journal or publisher link below.

Open Original Source More Feed Items

Related by Condition/Drug

NatureRelated by Drugacetaminophen

COLD, FLU AND SORE THROAT (acetaminophen, dextromethorphan hbr, guaifenesin, phenylephrine hcl) solution [Walgreen Company]

Pharmacology • 23 Apr 2026

Feed Metadata