A team from Harvard Medical School and Beth Israel Deaconess Medical Center has published in the journal Science a direct comparison between OpenAI’s o1 and 4o models and two internal medicine physicians across 76 real emergency room cases. The o1 model offered the correct or near-correct diagnosis in 67% of triage cases, compared to 55% and 50% for the two human physicians. The researchers stop well short of claiming AI is ready for clinical deployment — but the numbers reignite a debate medicine can no longer defer.
Key Takeaways
- OpenAI’s o1 reached 67% diagnostic accuracy at triage, versus 55% and 50% for two internists
- The study, published in Science, covered 76 real ER cases at Beth Israel, assessed blind by other physicians
- The authors call for prospective trials before any clinical deployment
What the Study Actually Measured
76 patients admitted to the emergency room at Beth Israel Deaconess Medical Center in Boston. The same electronic medical records were presented to o1, 4o, and two internal medicine attending physicians, with no adaptation or reformatting of the data. Assessment of the diagnoses was then handled by two other attending physicians who did not know whether each answer came from a human or a machine.
This blind protocol is one of the study’s methodological strengths. It reduces the classic evaluation bias of judging AI-generated outputs more harshly. The results appear in Science, one of the world’s most selective scientific journals, co-authored by physicians and computer scientists from Harvard Medical School and Beth Israel.
o1 offered the exact or very close diagnosis in 67% of triage cases. The first physician hit 55%, the second 50%. The researchers note the gap was especially pronounced at the first diagnostic touchpoint, the initial triage: where patient information is scarcest and urgency is highest.
Arjun Manrai, who leads an AI lab at Harvard Medical School and is a lead study author, stated that the model surpassed both prior models and physician baselines across virtually every benchmark tested. The performance advantage is not marginal, and it shows up precisely under the hardest conditions.
The researchers are also clear that no pre-processing of the data was performed. o1 and 4o received the same information available in medical records at each point in the diagnostic process. This matters for the external validity of the results.
What the Numbers Leave Out
The two physicians used for comparison are internists, not emergency room physicians. ER doctor Kristen Panthagani argues this comparison is flawed at its core: if the goal is to evaluate AI against clinicians, the comparison should involve specialists from the relevant field, in this case emergency physicians.
She raises an even more fundamental point. In the ER, a physician’s primary objective is not to arrive at a final diagnosis. It is to identify what might kill the patient in the next hour. These are different tasks, and the study measures the first, not the second.
Both models were evaluated exclusively on text-based data. The researchers acknowledge this directly: current large language models remain limited when it comes to reasoning over non-text inputs. A real emergency room visit involves imaging, physiological readings, and direct physical examination that models do not yet handle adequately.
Adam Rodman, a Beth Israel physician and study lead author, adds a further warning: there is currently no formal accountability framework for AI-generated diagnoses. The question of who is responsible when a model is wrong remains unanswered, and patients still want humans to guide them through life-or-death decisions.
Also on Horizon:
- ChatGPT Images 2.0: India Surges, the World Waits
- The Oscars Officially Ban AI from Nominations
- Anthropic Closing In on a $900 Billion Valuation
What This Study Shifts in the Medical Debate
The study does not argue for replacing physicians. Its explicit recommendation is to conduct prospective trials in real clinical settings before any integration into patient care. That is a prerequisite, not an endorsement. But a publication in Science, with numbers this sharp, will fuel discussions in ethics committees, hospital administrations, and health regulatory agencies.
In the short term, hospitals already experimenting with diagnostic support tools will face pressure to take a formal position. Having a clear institutional stance on AI in emergency medicine is becoming a political and administrative necessity, not just a medical one.
Over the next several months, this study joins a series of converging findings. LLMs have shown comparable performance in radiology, dermatology, and ECG interpretation. The question is no longer whether models can compete on targeted text-based tasks. The question is defining under what conditions that performance is useful and safe enough to integrate into a care pathway.
For OpenAI, the results validate a clear strategic direction toward high-stakes professional applications. Appearing in Science with results of this kind shifts the conversation from commercial demonstration to scientific validation. That is a meaningful change of register for institutional decision-makers.
ER physicians are right to hold their ground. Their expertise encompasses a holistic reading of the patient, flow management, and decision-making under time and emotional pressure that the study made no attempt to model. What the numbers measure is real. What they leave aside is equally real.
Follow the story on Horizon.



Pingback: OpenAI Replaces Custom GPTs with Team-Level Agents - Horizon
Pingback: Jensen Huang from Nvidia: AI Is Creating 'An Enormous Number of Jobs' - Horizon