After 6 days of launching ChatGPT Health, OpenAI was overtaken on their own medical health Benchmark

Author: Li Yuan

Have you ever asked your AI assistant about your health issues?

If you are a deep user of AI like me, chances are you have tried.

OpenAI’s own data shows that health has become one of the most common use cases for ChatGPT, with over 230 million people worldwide asking health and wellness-related questions every week.

For this reason, by 2026, the health sector is showing signs of becoming a key battleground in the AI field.

On January 7, OpenAI released ChatGPT Health, allowing users to connect electronic medical records and various health applications, enabling more targeted medical responses; meanwhile, on January 12, Anthropic immediately launched Claude for Healthcare, emphasizing its new model’s capabilities in medical scenarios.

Interestingly, this time, Chinese companies did not fall behind—in fact, they are even leading.

On January 13, Baichuan Intelligence announced the release of the Baichuan M3 model, which outperformed OpenAI’s GPT-5.2 High on the HealthBench evaluation test set for medical and health domains, achieving SOTA.

After announcing that their all-in medical approach faced many doubts, Baichuan Intelligence seems to have finally proven itself. GeekPark also specifically spoke with Wang Xiaochuan about how Baichuan Intelligence views the capabilities of the M3 model and the ultimate goal of AI in healthcare.

01 Surpassing OpenAI in health domain testing set for the first time

One of the most impressive achievements of the M3 model is that it, for the first time, surpassed OpenAI’s GPT-5.2 High on the HealthBench evaluation test set for medical and health domains, achieving SOTA.

SOTA on HealthBench, HealthBench Hard, and Hallucination Evaluation

HealthBench is an evaluation test set for medical and health domains released by OpenAI in May 2025, built collaboratively by 262 doctors from 60 countries. It includes 5,000 highly realistic multi-turn medical dialogues and is currently one of the most authoritative and closest to real clinical scenarios worldwide.

Since its release, OpenAI’s models have dominated the leaderboards.

This time, Baichuan Intelligence’s new open-source medical large model Baichuan-M3 scored an overall 65.1 points, ranking first globally. Even on the complex decision-making test, HealthBench Hard, M3 also won the championship, setting a new high score.

Baichuan also published a hallucination rate test result, with M3 achieving just 3.5%, one of the lowest globally.

Notably, this hallucination rate is measured without relying on external retrieval tools, under purely model-based settings.

Baichuan states that achieving these results mainly stems from introducing suitable reinforcement learning algorithms tailored for medical applications.

In the M3 model, Baichuan first used Fact Aware RL (Fact Perception Reinforcement Learning), which effectively prevents the model from spouting boilerplate or making random statements.

This is critically important in the medical field.

In unoptimized models, asking medical questions often leads to two issues: one, the model fabricates symptoms or diagnoses; two, semantic ambiguity results in the model suggesting you see a doctor, which is of little help to both doctors and patients.

This is because many models optimize solely for hallucination rate, which can lead to the model simply stacking correct facts to dilute overall hallucinations. Baichuan introduces semantic clustering and importance weighting mechanisms—clustering to eliminate redundant expressions and weighting to ensure core medical assertions are given higher priority.

Moreover, if only high-weight hallucination penalties are introduced, the model may fall into a conservative “say less, err less” strategy. Therefore, the Fact Aware RL algorithm also incorporates a dynamic weight adjustment mechanism, adaptively balancing these two goals based on the model’s current capability—focusing on medical knowledge learning and expression during the capability building phase (high Task Weight), and gradually tightening factual constraints as the model matures (increasing Hallucination Weight).

When connected to the internet, Baichuan also added an online multi-turn search verification module, along with an efficient caching system to align vast amounts of medical knowledge.

02 Diagnostic ability surpassing human doctors, entering the usable stage

However, surpassing OpenAI on HealthBench is not the only highlight of this release.

An even more interesting point is that Baichuan creatively built its own SCAN-bench evaluation set. Compared to the leaderboard-chasing evaluation sets of OpenAI, Baichuan’s self-constructed evaluation set perhaps better reflects the direction Baichuan Intelligence aims to optimize in healthcare.

The key focus of Baichuan’s SCAN-bench is on optimizing “end-to-end diagnostic capability.” This stems from Baichuan’s own experimental insights: a 2% increase in diagnostic accuracy correlates with a 1% increase in treatment outcome accuracy.

In other words, compared to OpenAI’s HealthBench, which mainly assesses “whether AI answers questions,” Baichuan’s SCAN-bench aims to evaluate whether AI can extract effective information in a Q&A exchange and provide correct diagnoses and medical advice.

Typically, when asking an AI assistant, simply stating “You are an experienced doctor” does not yield a very good result. Because real doctors follow very standardized diagnostic procedures—Baichuan summarizes this as the SCAN principles in four quadrants: Safety Stratification, Clarity Matters, Association & Inquiry, and Normative Protocol.

Based on these principles, Baichuan borrowed the long-standing OSCE method from medical education, collaborating with over 150 frontline doctors to build the SCAN-bench evaluation system. It breaks down the diagnostic process into three main stages: medical history collection, auxiliary examinations, and precise diagnosis. The assessment is conducted dynamically and multi-turn, simulating the entire process from consultation to diagnosis, aiming to optimize the model by achieving better results in each stage.

Baichuan also published the M3 model’s results on SCAN-bench.

The results are quite interesting. Baichuan not only compared the model with itself but also included comparisons with real doctors. In all four quadrants, the real doctors actually lagged behind the level that the model can reach.

GeekPark specifically asked Baichuan about this, and the response was: the evaluation involved real specialist doctors comparing their performance on specialist cases with the model. The model wins partly because it is more patient, but more importantly, because it has better interdisciplinary knowledge.

For example, in one case, a 10-year-old with recurrent fever was mentioned. Fever is a complex medical phenomenon. If only asking about cough or lung issues, serious problems in joints or urinary systems might be overlooked, leading to misdiagnosis as a common infection.

Human doctors are usually more proficient in specific specialties, which is why complex symptoms often require expert consultation or specialists who need to look up information.

A regular, untrained model pretending to be a doctor often cannot answer such questions well.

03 Next steps: gradually developing C-end products and promoting more serious medical applications

For Baichuan Intelligence, surpassing human doctors is highly significant: it means AI has crossed the usability threshold and can be deployed in real scenarios.

Since January 13, users can start experiencing answers provided by the M3 model on Baichuan’s website and app.

The website design is quite interesting. Although both versions use the M3 model for responses, they distinguish between a doctor version and a user version. The doctor version provides more concise answers, cites more references, and is less “people-like.” The general patient version, however, rarely provides a one-shot answer; instead, it asks more follow-up questions to clarify diagnoses.

Baichuan mentions that the model’s internal reasoning process is quite fascinating. “We often see the model mention in its reasoning chain, ‘This patient didn’t address my question, but I must ask this.’ We even saw extreme cases where it said, ‘I’ve already asked the patient 20 rounds, which exceeds the maximum set rounds, but I still need to ask this question.’ This is because, during training, the model is rewarded only when it provides sufficiently key information and reaches a correct diagnosis. This is a clear difference from other training approaches.”

Recently, many AI companies have started to enter the healthcare field. Baichuan believes this is its biggest differentiator—aiming to do more serious medical work.

“This means Baichuan does not choose scenarios just because they are easy to implement. Instead, Baichuan insists on continuously pushing technological capabilities and challenging more difficult problems,” Wang Xiaochuan said.

A typical example is that Baichuan will prioritize developing solutions for oncology in the future, while mental health and therapy are lower on its priority list.

In popular opinion, providing AI-based psychological therapy is considered easier and more feasible. Baichuan’s reasoning, however, differs. They believe the oncology field has more rigorous scientific backing. Here, AI is more likely to produce serious medical results that can meet or surpass human doctors. In contrast, psychology lacks such definitive scientific anchors.

For example, some companies are working on creating AI “avatars” of doctors, but Wang Xiaochuan believes this is not the direction Baichuan wants to pursue. An AI doctor avatar cannot fully replicate or surpass a human doctor’s expertise. Such AI would ultimately only serve as a facade or customer acquisition tool, not truly advancing serious medical practice.

This insistence on seriousness deeply influences Baichuan’s business choices.

It directly relates to Wang Xiaochuan’s fundamental thoughts on the next stage of medical AI. He believes the most important task now is to enhance AI capabilities and gradually provide more medical services.

China has long attempted to implement hierarchical diagnosis and treatment and general practitioner systems. The original intention was to encourage people to seek care at primary healthcare institutions first, alleviating the difficulty of hospital appointments, long queues, and congestion at large hospitals.

The difficulty in implementing this system mainly stems from insufficient medical resources. Primary healthcare lacks high-level doctors. Even for common colds, people prefer to queue at top-tier hospitals because they lack confidence in primary care.

This is where medical AI can play a crucial role. Large models can distribute top-tier medical knowledge at scale, filling the gap in primary care, enabling every community and household to have diagnostic and treatment capabilities comparable to top hospitals.

In the long run, this could have a broader impact, gradually shifting decision-making power from doctors to users. In traditional medical scenarios, patients are beneficiaries but often lack decision-making authority. Power is concentrated in the hands of doctors, which can lead to communication costs and treatment frustrations.

Baichuan hopes to use AI to make high-quality medical resources more accessible to patients. “Many people think healthcare is too complicated and that patients will never understand it. But we think of it like the jury system in the US judicial system. Law is very specialized, and laypeople don’t understand it. So, judges, lawyers, and prosecutors lead the process, conducting thorough debates, explaining things clearly so that ordinary people can judge guilt or innocence based on logic. That way, ordinary people can make reasonable judgments,” Wang Xiaochuan explained.

This is also one of the reasons Baichuan does not want to limit itself to simple scenarios but aims to push into more complex, serious medical diagnosis and treatment.

When asked whether solving high-difficulty problems is the most commercially profitable, Wang Xiaochuan gave a profound answer.

He believes that solving minor issues like colds and fevers is hard to build sufficient trust in users. Healthcare is a trust-dependent industry. Only when AI can address serious illnesses and complex cases can it truly establish a trustworthy foundation.

From a business perspective, patients facing serious health issues are more willing to pay for high-quality AI services. This trust is not only a prerequisite for commercial returns but also the core for scaling AI medical applications.

More fundamentally, healthcare still represents a path toward Artificial General Intelligence (AGI) for Baichuan and Wang Xiaochuan himself.

Wang Xiaochuan believes that AI has already found practical solutions in fields like literature, science, engineering, and arts, but medicine remains a highly unique domain. Human exploration of medicine is still ongoing, and AI is in an exploratory stage in this field.

Baichuan’s roadmap is very clear. First, use AI to improve diagnostic efficiency and address current shortages in medical supply. Based on this, Baichuan aims to build deep trust with patients. When patients are willing to use AI tools for long-term medical consultation, AI can accumulate real, high-quality medical data through long-term companionship.

The ultimate goal of this data is to construct a mathematical model of life. This is a path that human doctors have not yet fully traversed, and in the future, AI is very likely to lead the way. If the essence of life can be modeled, it will be a crucial step toward advancing general artificial intelligence.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)