Harvard Medical School and Beth Israel Deaconess Medical Center joint team published a study in the journal Science, testing the diagnostic decision-making ability of the OpenAI o1 model on a sample of 76 emergency patients. The results showed that o1 achieved an accuracy rate of 67%, significantly outperforming two internal medicine attending physicians at 55% and 50%. However, the researchers simultaneously issued an important warning: the control group was not composed of emergency specialists, and the study did not claim that AI can make life-and-death decisions in real-world scenarios.
(Background: UC research on “AI Brain Fog” phenomenon: 14% of office workers are driven crazy by agents and automation, with 40% considering resignation)
(Additional context: Author of Sapiens: AI is becoming a threat, breaking through human civilization’s operational systems! Like nuclear weapons)

A paper from Harvard Medical School quietly made its mark in the top academic journal Science, officially bringing the discussion of medical AI from demo displays into clinical research circles.

This study, conducted jointly by Harvard Medical School and Beth Israel Deaconess Medical Center, used medical records from 76 real emergency patients as test samples, with OpenAI o1, GPT-4o, and two internal medicine attending physicians diagnosing each case. The evaluation criterion was: the proportion of diagnoses that were “accurate or very close to correct.”

The final numbers caught many’s attention—o1’s accuracy reached 67%, while the two human doctors scored 55% and 50% respectively. GPT-4o was also included as a control but performed worse than o1.

What makes o1 stronger in which aspect?

The research team specifically pointed out that the most significant gap between o1 and human doctors occurred during the “initial triage” stage—when patients first arrive at the emergency department, with minimal information and the highest uncertainty.

In this scenario, o1 needs to synthesize a preliminary diagnosis based on textual descriptions of chief complaints, symptoms, and vital signs. This falls squarely within the strengths of large language models: pattern recognition in structured text, rapid integration of cross-disciplinary knowledge, and the ability to provide coherent reasoning paths even with incomplete information.

Although GPT-4o was also tested as a control, under the same conditions it performed less stably than o1, and the gap between it and the physicians was relatively smaller. The researchers believe this is directly related to o1’s more advanced reasoning chain architecture.

In terms of research significance, this is no longer just a story of “AI winning on benchmarks”—the samples come from real emergency records, not artificially designed test questions, giving this data a certain clinical reference value.

Don’t be led by headlines: three key premises you must know

Before this study sparks widespread discussion, there are three things worth slowing down to clarify.

First, the control group is not composed of emergency specialists. The two doctors used for comparison are “internal medicine attending physicians,” not ER doctors with emergency medicine training. The core difficulty of emergency diagnosis lies in high-pressure, multitasking, fragmented information judgment—internal medicine doctors are not the best benchmark in this scenario to begin with. The study’s comparison framework itself is open to challenge.

Second, this is “text-based triage,” not real multimodal emergency scenes. The study director explicitly stated: “This is just text triage, not equivalent to real multimodal ER.” Actual emergency care involves image interpretation, physical observation, on-site communication, urgent procedures—all aspects that large language models currently cannot handle.

Third, the research team itself does not claim that AI can make life-and-death decisions. Simultaneously with publishing the results, the researchers emphasized the limitations of this study and did not suggest AI diagnoses should be directly applied in clinical practice.

From an operational perspective: this study indeed marks a real technological milestone—in the “structured text diagnosis” track, AI has the capability to surpass human doctors in specific contexts. But moving from “lab accuracy” to “clinical deployment” still involves hurdles such as legal responsibility, multimodal integration, hospital system interfacing, and—most difficult—who takes responsibility when errors occur. The technical threshold may have been crossed, but the real challenge of implementing medical AI is just beginning.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.