Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
A Harvard study published in Science: OpenAI o1’s emergency diagnosis accuracy rate is 67%, beating two human doctors.
Harvard Medical School and Beth Israel Deaconess Medical Center joint team published a study in the journal Science, testing the diagnostic decision-making ability of the OpenAI o1 model on a sample of 76 emergency patients. The results showed that o1 achieved an accuracy rate of 67%, significantly outperforming two internal medicine attending physicians at 55% and 50%. However, the researchers simultaneously issued an important warning: the control group was not composed of emergency specialists, and the study did not claim that AI can make life-and-death decisions in real-world scenarios.
(Background: UC research on “AI Brain Fog” phenomenon: 14% of office workers are driven crazy by agents and automation, with 40% considering resignation)
(Additional context: Author of Sapiens: AI is becoming a threat, breaking through human civilization’s operational systems! Like nuclear weapons)
A paper from Harvard Medical School quietly made its mark in the top academic journal Science, officially bringing the discussion of medical AI from demo displays into clinical research circles.
This study, conducted jointly by Harvard Medical School and Beth Israel Deaconess Medical Center, used medical records from 76 real emergency patients as test samples, with OpenAI o1, GPT-4o, and two internal medicine attending physicians diagnosing each case. The evaluation criterion was: the proportion of diagnoses that were “accurate or very close to correct.”
The final numbers caught many’s attention—o1’s accuracy reached 67%, while the two human doctors scored 55% and 50% respectively. GPT-4o was also included as a control but performed worse than o1.
What makes o1 stronger in which aspect?
The research team specifically pointed out that the most significant gap between o1 and human doctors occurred during the “initial triage” stage—when patients first arrive at the emergency department, with minimal information and the highest uncertainty.
In this scenario, o1 needs to synthesize a preliminary diagnosis based on textual descriptions of chief complaints, symptoms, and vital signs. This falls squarely within the strengths of large language models: pattern recognition in structured text, rapid integration of cross-disciplinary knowledge, and the ability to provide coherent reasoning paths even with incomplete information.
Although GPT-4o was also tested as a control, under the same conditions it performed less stably than o1, and the gap between it and the physicians was relatively smaller. The researchers believe this is directly related to o1’s more advanced reasoning chain architecture.
In terms of research significance, this is no longer just a story of “AI winning on benchmarks”—the samples come from real emergency records, not artificially designed test questions, giving this data a certain clinical reference value.
Don’t be led by headlines: three key premises you must know
Before this study sparks widespread discussion, there are three things worth slowing down to clarify.
First, the control group is not composed of emergency specialists. The two doctors used for comparison are “internal medicine attending physicians,” not ER doctors with emergency medicine training. The core difficulty of emergency diagnosis lies in high-pressure, multitasking, fragmented information judgment—internal medicine doctors are not the best benchmark in this scenario to begin with. The study’s comparison framework itself is open to challenge.
Second, this is “text-based triage,” not real multimodal emergency scenes. The study director explicitly stated: “This is just text triage, not equivalent to real multimodal ER.” Actual emergency care involves image interpretation, physical observation, on-site communication, urgent procedures—all aspects that large language models currently cannot handle.
Third, the research team itself does not claim that AI can make life-and-death decisions. Simultaneously with publishing the results, the researchers emphasized the limitations of this study and did not suggest AI diagnoses should be directly applied in clinical practice.
From an operational perspective: this study indeed marks a real technological milestone—in the “structured text diagnosis” track, AI has the capability to surpass human doctors in specific contexts. But moving from “lab accuracy” to “clinical deployment” still involves hurdles such as legal responsibility, multimodal integration, hospital system interfacing, and—most difficult—who takes responsibility when errors occur. The technical threshold may have been crossed, but the real challenge of implementing medical AI is just beginning.