In medicine, reasoning is not just about being right or wrong; it is about how confident you should be given incomplete information. That kind of judgment has always set clinicians apart, and now researchers are testing whether AI in medicine can learn it too.
A new study in NEJM AI puts large language models (LLMs) to the test on this, using something called Script Concordance Testing (SCT), a tool designed to measure how well diagnostic thinking is updated when presented with new data.
The results were intriguing: LLMs outperformed medical students, but fell short of residents and attendings. The gap wasn’t about factual knowledge—it was about judgment. The authors noted that models struggled to recognize when new information shouldn’t change a hypothesis, and tended to swing too far toward extreme answers. In other words, they lacked the nuance clinicians develop through experience: knowing when to stay steady and when to pivot.
As the paper put it, reasoning-tuned models “Underused the 0 response,” meaning they rarely said, “This new piece of data doesn’t actually change my mind.” Instead they were overconfident, treating every clue as a plot twist.
That theme echoes something we’ve been thinking about a lot in our Practicing with AI podcast. In a recent OpenAI blog post, the company reflected on one reason AI in the medical field still hallucinates: Most language models are trained and evaluated in a test-taking modality. In that world, it’s better to guess than to admit, “I don’t know.” This reward system directly leads to an increase in hallucinations.
This might work fine for trivia questions, but that logic doesn’t hold in medicine. Confidently wrong answers are dangerous. OpenAI’s researchers suggested that future models should be trained to reward appropriate uncertainty and penalize overconfidence, essentially teaching them to reason more like clinicians.
Put together, these two ideas point to a deeper truth: The next big leap in medical AI might not come from more data, but from a better reasoning framework. For medical AI to become trusted clinical partners, they’ll need to learn the art of calibrated uncertainty—the same technique that good clinicians use every day.
To learn more about hallucinations and reasoning in medical AI chatbots, check out Episode 4 of Hippo Education’s Practicing with AI podcast. You can listen on Apple and Spotify.