Is AI Really 4× Smarter-—or Did We Handicap the Humans? (Again)
My take on the new “MAI-Dx” diagnostic paper that set LinkedIn ablaze.
AI’s “4×-Smarter-Than-Your-Doctor” Moment—And What It Really Means for Medicine
Last week my health-tech feeds lit up with a pre-print from Microsoft Research describing MAI-DxO, an ensemble of large-language-model agents that scored 80 % accuracy on 304 famously gnarly New England Journal of Medicine case conferences while a cohort of seasoned internists averaged just 20 %. Headlines trumpeted “AI is four times better at diagnosis than your doctor”—catnip for investors and fuel for endless comment-section angst; Some clinicians were obstinate at the idea AI might replace them, investors heralding this is more evidence of the new age, general commenters littered the spectrum from disbelief to unbridled enthusiasm for our new robot doctor overlords. As both a practicing surgeon and a founder building AI tools for clinicians and health systems, I found the work genuinely impressive and profoundly misinterpreted. Here’s why.
A quick tour of what they actually built
This was actually really cool. Instead of giving the model a fully baked vignette and a multiple-choice list, the authors recreated the messy, question-by-question grind of real consults. A Gatekeeper agent controlled access to the chart, doling out test results only when explicitly requested. Over on the AI side, a single LLM was split into a five-member “virtual tumor board”—Dr Hypothesis, Dr Test-Chooser, Dr Challenger, Dr Stewardship, and Dr Checklist—that debated each turn before ordering another lab or committing to a diagnosis. That orchestration layer, not a secret new model, is the real novelty: it forces deliberation, cost awareness, and bias checking in a way off-the-shelf chatbots rarely do. This is definitely a fascinating approach, and something I’m excited to experiment with both in academic research and at Revel Ai Health in future agent workflows.
The asterisk nobody is talking about
Buried in the methods is a sentence that changes the validity of the study entirely for me: participating physicians were forbidden to use any external resource—UpToDate, PubMed, Google, or ChatGPT—because the original cases are searchable online. Imagine asking an intensivist to run a code without the EHR, the bedside ultrasound, or their phone. We didn’t handicap the computer and then we blindfolded the humans. I think that in countless adjacent industries, humans would also be handily ousted by large language models if put In a similar scenario. The result is still interesting—unaided clinical reasoning versus an agentic panel—but it is not a simulation of real-world practice, where every competent clinician cross-checks guidelines, taps colleagues, and yes, increasingly pings an LLM.
Beyond the flashy number
Set aside the handicapping for a moment and MAI-DxO still shows two truths:
Structured reasoning beats raw horsepower. The same base model scored lower and spent more money when it wasn’t corralled by the panel.
Sequential “ask-decide-test” benchmarks expose weaknesses hidden by static vignettes. That’s good news for safety research, and is a really innovative new way to structure assessments for clinical agents in particular.
But let’s be brutally honest about scale and the pragmatic application of these findings for our ailing healthcare system. . Rare-disease differentials are not why most patients suffer or why most dollars evaporate. The real cost sinks are care-coordination failures, medication mismanagement, and social-needs bottlenecks after the diagnosis is obvious. Agentic architectures like MAI-DxO may end up more valuable chasing those operational headaches than chasing zebras.
What this means for medical education
The paper’s most disruptive message isn’t the 80 % score; it’s the demonstration that an LLM—when scaffolded correctly—can out-chat experienced doctors at hypothesis-driven reasoning. If that’s true today, imagine what version-next will do by the time a first-year med student finishes residency. Our training pipelines were built for the Gutenberg era, refined in the UpToDate era, and are now colliding head on with the Generative era.
We need a radical re-write of curricula that treats LLMs the way aviation treats autopilot:
Systems literacy: every trainee should understand token limits, chain-of-thought prompts, and why models hallucinate dates(Among other things).
Human–AI teaming: case-based drills where students alternate between driver and checker, explicitly tracking when to trust, when to verify, and how to document AI assistance.
Ethical & economic stewardship: if the model orders the MRI, who owns the downstream cost and liability? My bet is it will be us, the clinicians, and that is another reason leaving doctors out of the loop in this innovation supercycle is a glaring mistake.
Clinicians are already voting with their thumbs
I’ve watched colleagues enthusiastically weave LLM-powered tools like OpenEvidence into peri-operative huddles, curbside consults, and discharge-planning marathons (as well as the occasional off-label self-diagnosis odyssey). They aren’t surrendering judgment; they’re shaving minutes and broadening differentials. But there is a chorus of academic conversations in which stakeholders describe the models as “black boxes”—a disconnect that scares me. Tools we don’t understand become tools we either blindly obey or reflexively dismiss, neither of which ends well for patients.
The public perception gap
When I posted on TikTok asking “Is AI four times smarter than your doctor?” the overwhelming response wasn’t fear but relief and a disconcerting degree of anger. : finally, the “pharma hacks” might be replaced. The public’s trust in healthcare (Just like many of our long-standing institutions) is eroding; an LLM, ironically, feels less compromised by billing targets or pharma reps. If clinicians, policymakers, and operational stakeholders abdicate the conversation, Silicon Valley will own the narrative, and we will wake up to find the doctor–patient relationship further intermediated by a chat window—or several for that matter.
So, what should doctors do tomorrow morning?
Skip the ten-point checklist—here’s a single imperative: embed an LLM into a facet of your daily workflow this week and keep a diary of what it gets right, what it fumbles, and how it changes your decision making as well as your everyday ordinary task workflows. This immersive reflective loop will teach you more than any grand-rounds lecture.
A measured conclusion
MAI-DxO is a tour-de-force of prompt engineering and agent choreography, not proof that doctors are obsolete. It signals a future where the competitive edge belongs to “cyborg teams” that marry human context, empathy, and accountability with machine-speed capabilities. Flashy headlines will keep coming; our duty is to translate them into pragmatic improvements in care—and to make sure the next generation of clinicians can do the same in partnership with the silicon at their side.
In case you missed it in the article above, a link to the paper is here. I’d love to hear your experiences: have you integrated an LLM into clinic or the OR? Where did it shine, where did it scare you, and what would you teach trainees about it?