AI can reason like a doctor — but that’s only the beginning of the problem

Artificial intelligence now scores at the top of medical reasoning tests. The harder question is what that actually means for patients and the future of medicine.

LongevityWatch editorsMay 7, 2026

A few years ago, it was considered a landmark when an AI system passed the US medical licensing exam. Today, that’s a baseline expectation. Models like GPT-4 and its successors routinely score in the upper ranges of human performance on medical reasoning benchmarks — and in specific domains, such as reading radiology images or detecting certain skin cancers, they already outperform average specialists. A commentary in Science takes stock of where this leaves medicine: not with answers, but with a set of urgent, largely unresolved questions.

The central distinction the authors draw is between reasoning like a physician and understanding like one. Large language models are trained on vast quantities of medical text — they know the associations, the clinical patterns, the terminology. But they have no relationship with the patient, no sense of responsibility for the outcome, and no capacity to explain their reasoning in a way that a colleague could genuinely interrogate and challenge. Transparency and accountability — foundational to medical practice — remain structurally weak in current AI systems.

The gap between benchmarks and the bedside

There’s also a methodological issue worth taking seriously. The tests on which AI systems are evaluated largely measure the ability to select correct answers in multiple-choice clinical scenarios. That’s not the same as seeing a patient in person, reading non-verbal cues, managing real-time uncertainty, or making decisions with incomplete information under time pressure. Physicians do all of this constantly. AI systems are rarely tested on it systematically.

There’s also the problem of data leakage: large language models may have been trained on material that resembles the questions they are later tested on, inflating apparent performance. This makes it genuinely difficult to assess whether impressive benchmark scores translate to reliable performance in genuinely novel clinical situations — exactly the kind a physician encounters most days.

Integration without guardrails

The commentary isn’t a call for caution about AI in medicine — far from it. The authors see substantial potential, particularly in supporting clinicians in resource-limited settings, accelerating diagnostics, and synthesising complex medical literature at scale. What they argue for is rigorous validation, clear regulatory frameworks, and workable structures of accountability.

That last point is the one that feels most unresolved. If an AI-assisted recommendation turns out to be wrong, who bears responsibility — the physician who followed it, the developer who built the system, the hospital that deployed it? This question is neither legally nor ethically settled. And it becomes more pressing as AI systems move from advisory roles toward direct involvement in diagnostic decisions. The fact that AI can now reason like a physician is remarkable. The fact that medicine has not yet figured out what to do with that is equally so.

Read the original article