Harvard Study Finds OpenAI o1 Outperforms Doctor Baselines

camila 4 5 月, 2026

(AsiaGameHub) – A team from Harvard Medical School and Beth Israel Deaconess Medical Center evaluated OpenAI models against physicians in medical diagnostic tasks, using real emergency room cases and other clinical scenarios.

Good to Know

OpenAI o1 provided an exact or nearly exact diagnosis in 67% of initial ER triage cases.
Two attending physicians achieved scores of 55% and 50% on the same triage assessment.
Researchers stated that hospitals need to conduct real patient care trials before using AI for high-stakes diagnostic purposes.

Researchers Emphasize Need for Further Testing Before Real-World ER Use of AI

The most notable result emerged at the stage where doctors typically have the least information. During initial ER triage, OpenAI o1 delivered an exact or nearly exact diagnosis in 67% of cases. One attending physician scored 55%, while another reached 50%.

Researchers did not frame the findings as a green light for AI to manage emergency rooms. Instead, the study published in Science called for an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings.”

This warning is important because the test was limited to text-based records. The team noted that “existing studies suggest that current foundation models are more limited in reasoning over nontext inputs.” In simple terms, charts, scans, images, physical exams, and bedside judgment still present greater challenges for AI diagnostic tools.

The study included 76 patients from Beth Israel’s emergency room. OpenAI o1 and 4o received the same electronic medical record details available at each diagnostic stage. Harvard Medical School stated that researchers did not “pre-process the data at all,” so the models did not get cleaned-up summaries or additional assistance.

Two additional attending physicians then graded the responses without knowing which diagnosis came from a human doctor and which from an AI.

The study said:

“At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,”

It added that the performance gap was most distinct early in care, where pressure is high and information is scarce:

“were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision.”

Arjun Manrai, who leads an AI lab at Harvard Medical School and co-led the study, said:

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,”

Still, accountability remains a major challenge. Adam Rodman, a Beth Israel doctor and one of the study’s lead authors, told the Guardian that there is “no formal framework right now for accountability” around AI diagnoses. He also noted that patients still “want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions.”

This article is provided by a third-party. AsiaGameHub (https://asiagamehub.com/) makes no warranties regarding its content.

AsiaGameHub delivers targeted distribution for iGaming, Casino, and eSports, connecting 3,000+ premium Asian media outlets and 80,000+ specialized influencers across ASEAN.

导航

Singapore News

Links

Harvard Study Finds OpenAI o1 Outperforms Doctor Baselines

Researchers Emphasize Need for Further Testing Before Real-World ER Use of AI