In a Single Month, AI Beat the Doctor and the Doctor Beat the AI Back

May 18, 2026 · Dr. Ramy Azzam

April 2026 was a strange month to work in this field. On the thirtieth, a paper appeared in Science in which Adam Rodman and his colleagues at Beth Israel Deaconess showed that a large language model from OpenAI outperformed physicians on case-based diagnostic and clinical reasoning evaluations, including a set of cases drawn from a real Boston emergency department. Rodman, who is one of the more careful researchers in this space, told STAT he was worried the paper would be misread as proof that AI is safe and effective in real clinical use. He published it anyway, because the evidence is what it is.

Six days earlier, the Utah Medical Licensing Board had written a letter demanding that the state immediately suspend its pilot programme with Doctronic, an AI startup that had been quietly authorised to evaluate patients and autonomously renew prescriptions for nearly two hundred chronic disease drugs. The board only learned about the programme after it had launched. Their language was unusually direct. Proceeding without consulting the medical board, they wrote, potentially places Utah citizens at risk.

The day before that, the American Medical Association called for a regulatory crackdown on wellness AI and consumer-facing chatbots. And on the eleventh of May, Alon Bergman, Robert Wachter, and Ezekiel Emanuel published a JAMA Viewpoint and a STAT First Opinion arguing, with a straight face, that autonomous AI clinicians should be licensed the way physicians are licensed, with an exam, a residency-equivalent supervised deployment, a defined scope of practice, biennial renewal, and a new federal Office of Clinical AI Oversight inside HHS.

Taken individually, each of these stories reads like a different referendum on AI in medicine. Taken together, they read like a single coherent story. The benchmark beat the doctor. The doctor beat the AI back. Both were correct. And the people building in this space need to learn, quickly, how to live inside that contradiction.

What the Benchmark Actually Showed

Let me be clear about the Rodman paper, because it is going to be cited badly for years. The headline result is that an OpenAI model outperformed physicians on diagnostic reasoning across a set of case-based evaluations. That is a meaningful scientific finding. It is also, on its own, almost entirely uninformative about whether that model should be allowed anywhere near a patient without a clinician in the loop.

Case-based evaluations are constructed environments. The cases are pre-filtered for the kind of information a clinician would have at the moment of the encounter. The endpoints are diagnostic and reasoning quality, not downstream outcomes. The patients are simulated or historical. The cases do not include the parts of medicine that consume most of a clinician's day, including communication, negotiation, advocacy for the patient against an insurer (yup), recognising when a story does not add up, knowing when to break a rule, and knowing when to escalate. Rodman acknowledges all of this. He is a good scientist publishing a good study and then warning the world about how it will be misused.

The Bergman, Wachter, and Emanuel piece adds two pieces of evidence that move the needle further. A prospective study of nearly forty thousand primary care visits in Kenya in 2025 showed that AI-supported clinicians made substantially fewer diagnostic and treatment errors than clinicians working alone. And the NOHARM trial, published in December, found that in head-to-head comparison on routine clinical tasks, doctors did not beat the strongest large language models on any measured dimension. These are not benchmark results. These are prospective, real-world studies. They are the closest thing we have to evidence that an AI system can do meaningful clinical work safely under defined conditions.

That is the state of the evidence base in May 2026. The systems are good. On certain tasks, they are better than physicians. And almost none of that evidence licences anyone to deploy an AI as an autonomous clinician without supervision, because the conditions of safe performance have not been characterised.

A physician at a clinic desk consulting a digital interface that shows an AI-generated diagnostic suggestion alongside a clinical guideline, the human reviewing and editing before signing off.

What the Utah Suspension Actually Showed

Now the other direction. The Utah pilot is interesting not because it is reckless but because it is structurally honest. Doctronic and the state agreed that for Phase 1, every prescription renewal would be reviewed by a physician. The plan was to phase that review out as volume and safety benchmarks were met, moving first to retrospective audit and then to random sampling. That is, in fact, roughly how mature regulators think about autonomous systems in other safety-critical domains. You introduce them under supervision, you collect evidence, and you titrate the supervision down as the evidence accumulates.

The problem was not the structure. The problem was the sequence. Utah's Office of Artificial Intelligence Policy launched the programme without bringing the Medical Licensing Board in beforehand. The board found out after the fact and reacted exactly the way you would expect a body of regulators to react when an autonomous clinician shows up in their state without anyone having asked them whether that was acceptable. Their letter does not say AI doctors are inherently dangerous. It says proceeding without our involvement creates risk. That is a procedural objection, and it is correct.

This matters because the wave of state-level activity is enormous. Utah is one of at least forty seven states currently considering more than two hundred and fifty bills governing clinical AI. Each of those bills is being drafted in a different conversation, with different stakeholders, against a different incumbent regulatory baseline. California bars insurers from using AI to deny coverage based on medical necessity. Colorado mandates bias assessments for high-risk systems. Texas is moving in a different direction. New York is moving in yet another. The Federal Food and Drug Administration's medical device pathway, which was designed for static products like imaging algorithms, is structurally incapable of regulating systems that update themselves every six weeks. We are about to have fifty different answers to a question that needs exactly one.

What the AMA Asked For

The AMA's call on the twenty third of April was less specific than the Utah board's letter, but it was politically more consequential. The largest physician organisation in the country, after years of cautious engagement with AI vendors, asked openly for a regulatory crackdown on chatbots and wellness AI. It singled out direct-to-consumer products that are making therapy-adjacent or diagnosis-adjacent claims under the cover of a wellness exemption. The translation is straightforward. The profession has decided, formally, that the current loophole between FDA-regulated medical software and unregulated consumer wellness has been exploited beyond what the profession is willing to tolerate.

That is a signal to anyone building in this space. The next two years are going to be defined by the closing of the wellness loophole. The companies that built their business model on staying just inside the wellness exemption while making clinical-adjacent claims are going to have a harder time. The companies that built on the assumption that regulation was coming, and that the right posture was to welcome it and prepare for it, are going to find their lives unexpectedly easier.

The Framework That Fits the Evidence

The Bergman, Wachter, and Emanuel framework is the cleanest articulation I have seen of what licensure for autonomous clinical AI could actually look like, and it deserves to be read carefully rather than dismissed as academic. Its four pillars are straightforward.

The first is demonstrated competency. An autonomous AI clinician would have to perform at or above the median score of recent human test takers on the USMLE and on any specialty boards relevant to its intended scope, and then enter a supervised deployment phase analogous to residency, during which it would demonstrate non-inferior performance on real patients at scale.

The second is a defined scope of practice. Licensure would specify which conditions, settings, and tasks the AI is authorised to handle independently, and when it must escalate to a human clinician. This is exactly how nurse practitioners and physician assistants are licensed today.

The third is ongoing monitoring with periodic renewal. Authorisation would be time-limited, perhaps biennial, and contingent on continuous real-world performance tracking. A model that drifts below standard loses its licence. This is a profound shift from the FDA's current model, which assumes a product is fixed at the moment of approval.

The fourth is federal preemption with layered accountability. A new federal Office of Clinical AI Oversight inside HHS would certify competency. Developers would bear responsibility for model performance. Deploying institutions would be responsible for workflow integration, supervision, and adverse event reporting. States would retain authority over scope of practice and enforcement, but could not impose duplicative competency assessments.

This is not the only framework worth considering. It is, however, the first one I have seen that takes both the evidence and the political reality seriously. It would not have prevented the Utah situation, because Utah's problem was procedural, not substantive. It would, however, prevent the next forty nine versions of the Utah situation, which is what is otherwise coming.

A conceptual image of a regulatory blueprint with four pillars labelled Competency, Scope, Monitoring, and Accountability, sitting on a desk next to a stethoscope and a laptop running a clinical AI interface.

What This Means for Builders

I built two platforms, one which is a social impact wellness company, CIGMA, that has deliberately stayed inside narrow boundaries on what it claims and what it does. Another, WhatsHealth, is the cumulation of my 13 years in digital health (stay tuned). I also advise companies, through EthicaLabs, on the governance architecture they need to operate inside the regimes that are now arriving. And I sit on the partnerships side of Partners In Digital Health, which means I see what the peer-reviewed literature is starting to demand from companies that want to be cited in it five years from now.

What I tell builders, when they ask, is this. The era of staying inside the wellness exemption while making clinical-adjacent claims is closing. The era of treating regulation as something that will arrive later is over. The companies that will be standing in 2028 are the ones that are building, today, the operational infrastructure that licensure-style oversight will require. Audit trails. Defined scope. Documented escalation. Continuous performance monitoring. Independent evidence. A clear answer to the question of which named human or named institution is accountable when the model gets something wrong.

None of this is incompatible with building fast. It is incompatible with building lazily.

The Utah board did not object to AI clinicians as a category. They objected to AI clinicians being introduced into their state without anyone having asked them first.

The AMA did not object to clinical AI as a category. It objected to a market segment that has been making therapy-adjacent claims without doing the work to back them up.

The Rodman paper did not say AI is ready to replace physicians. It said AI is good enough at certain tasks that we now have to take the regulatory question seriously.

The next two years are not going to be a battle between AI in medicine and the medical profession. They are going to be a negotiation about the conditions under which AI is allowed to function inside the medical profession. The companies that come to that negotiation prepared, with the evidence, the architecture, and the humility to acknowledge what their systems cannot yet do, are going to do well. The companies that come to that negotiation insisting that benchmark performance should be sufficient are going to be on the wrong side of the AMA letter, the Utah suspension, and every state law that follows.

The Sentence Worth Holding Onto

The cleanest sentence in the Bergman piece, the one I keep returning to, is this. Licensing AI does not equate it with a doctor. Clinicians will remain essential for complex judgement, moral reasoning, and the human elements of care that no language model replicates. Licensure simply acknowledges what the trial evidence already shows, that for defined lower-risk tasks, a well-regulated AI can practise safely.

That is, I think, the correct frame. Not whether AI will replace doctors. Not whether AI is dangerous. The frame is, what are the defined tasks on which a regulated AI can practise safely, what evidence do we need before we authorise it, what supervision do we require during deployment, and what mechanism do we use to revoke the authorisation when the system drifts. Those are the questions that will define healthcare AI for the rest of this decade.

The benchmark beat the doctor. The doctor beat the AI back.

Neither result was a referendum on the technology. Both were a referendum on the process. The companies and institutions that build to fit a real process are going to outlast the ones still waiting for someone else to write the rules.

I would rather be in the first group. If you are also working on this, on operational maturity, governance architecture, evidence frameworks, or the boring infrastructure that makes safe deployment possible, the quieter conversation is starting to find its rooms. Reach out. There is more work to do than any one organisation can carry, and the next two years are going to settle a lot of questions that are still considered open today.