Introducing HealthBench. Opening Scene – ChatGPT in Your Doctor’s Office?
Imagine walking into your doctor’s office and seeing an AI chatbot helping with your visit. From tools that take notes during checkups to “virtual nurses” answering questions, artificial intelligence is moving quickly into hospitals and clinics.
But there’s a big question behind the buzz: can we actually trust these AI tools with our health?
Until recently, the answer was shaky. In 2024, researchers found that only 5% of AI health studies used real patient data. Most focused on how well a model could answer medical exam questions—not how it would behave in real-life situations. That means we’ve had lots of smart-sounding AI, but very little proof that it’s safe or helpful when stakes are high.
That’s where HealthBench comes in. Created by OpenAI, this new test was built with the help of 262 real doctors. Its mission is simple but crucial: find out if medical AI tools are ready for the real world. Not just in theory, but in practice.
What Makes HealthBench Different?
Most AI health tests today are a bit like pop quizzes. They check how well a model remembers facts—often using multiple-choice questions.
But real medicine isn’t about memorizing answers. It’s about conversations, problem-solving, and decisions that can affect someone’s life.
That’s why OpenAI built HealthBench, a smarter way to test medical AI. It doesn’t just ask trivia. It puts AI models into real-style chats that feel like actual doctor visits.
To make it truly useful, OpenAI teamed up with 262 doctors from 60 countries, covering 26 different specialties—from pediatrics to emergency medicine. These doctors didn’t just help test the AI. They helped design the whole system, ensuring it reflected what real clinicians care about.
HealthBench includes over 5,000 multi-turn conversations across 49 languages. That means the AI isn’t just answering once—it has to follow a back-and-forth discussion, just like a real doctor would. The test checks if the AI listens, asks follow-up questions, and avoids risky guesses.
To judge performance, doctors created 48,000 tiny rubrics. These are like scorecards for specific behaviors: Did the AI give safe advice? Did it ask for more information? Did it explain things clearly?
But here’s the key: even the best model so far—OpenAI’s o3—only scored 60%. That’s a solid start, but far from perfect. And that’s the point. HealthBench was designed to leave room for growth, not declare victory too early. It challenges AI systems to improve and evolve.
By focusing on real conversations, real doctors, and real-world tasks, HealthBench sets a new standard for what medical AI should be able to do—not just in theory, but in practice.
Seven Skills Every “AI Clinician” Must Nail

These checklists reveal seven core skills that every AI clinician needs to master:
1. Emergency Referrals: Spot Danger Fast
Sometimes, a patient needs help right away. The AI must recognize symptoms that point to serious conditions—like chest pain or sudden weakness—and say clearly, “Go to the ER now.” HealthBench checks if the AI gives that urgent advice without delay or doubt.
2. Context Seeking: Ask Before You Answer
Good doctors don’t jump to conclusions. They ask more questions to understand the full situation. HealthBench measures whether the AI does the same—does it ask for details before offering advice? Or does it assume too much too soon?
3. Global Health: Adjust to the Setting
Medical advice that works in a high-tech hospital might not work in a rural clinic. HealthBench includes chats from around the world and in 49 languages, from English to Amharic. The AI needs to tailor its answers to fit local resources and cultural context.
4. Health-Data Tasks: Handle the Paperwork
Doctors spend a lot of time writing notes and checking records. Can the AI help? HealthBench tests whether models can summarize conversations, list medications correctly, and complete routine tasks—without making dangerous mistakes.
5. Audience Fit: Speak the Right Language
A doctor talks differently to a patient than to another doctor. HealthBench checks if the AI can do the same. It should use plain language when talking to patients, and switch to medical terms when talking to professionals.
6. Uncertainty: Say “I Don’t Know” When Needed
No one—human or AI—knows everything. HealthBench rewards models that admit when they’re unsure. It’s much safer to say, “I’m not certain, you should talk to a doctor,” than to make a risky guess.
7. Depth Control: Know When to Dive Deep
Some users want short answers. Others want all the details. A good AI should adjust its response to match the situation. HealthBench checks if the model can give both quick advice and deeper explanations when asked.
These seven skills cover much more than facts. They show whether an AI can think, communicate, and care like a clinician. So far, top models have done well in tone and safety, but many still miss key details or fail to ask enough questions.
That’s where HealthBench shows its value—by pointing out what still needs fixing before AI is ready for the real world.
Who’s Winning the Test So Far?

So, how are the top AI models doing on HealthBench? The scores tell a story of fast progress—but also a clear limit to what even the best models can handle today.
Top Scores So Far
OpenAI’s latest models have made big leaps. The earlier GPT-3.5 Turbo scored just 16% on HealthBench. That means it missed most of the important behaviors doctors were looking for.
Things improved with GPT-4o, which scored 32%, and then jumped with OpenAI’s o3 model, which hit 60%. That’s a big leap in a short time.
Other major players also took the test:
- xAI’s Grok scored 54%
- Google’s Gemini 2.5 Pro reached 52%
These results show that newer models are learning fast. But even at 60%, the best model still misses 4 out of every 10 doctor-defined goals. That’s a sign of progress—but also a reminder that no AI is ready to replace real doctors yet.
The models did well in areas like emergency referrals. They could often spot red-flag symptoms and suggest going to the ER. They also showed strength in adjusting their tone and tailoring their language to different users—like switching between patient-friendly talk and medical terms for professionals.
This is important. It shows that AI is starting to handle high-stakes moments and audience differences with more skill.
Where They Still Struggle
But HealthBench also revealed weak spots. Most models stumbled in areas like:
- Context-seeking (asking follow-up questions)
- Completeness (giving full, clear answers)
- Health-data tasks (like summarizing notes)
- Global health (offering advice fit for low-resource settings)
The biggest issue? Many models don’t ask enough questions before giving advice. In medicine, that’s risky. If an AI gives answers based on missing information, it could make the wrong call—and someone could get hurt. That’s why context-seeking is one of HealthBench’s most critical skills.
Interestingly, “completeness” turned out to be the best predictor of a high score. That means the more thoroughly a model covers everything a user needs to know, the better it does. In health conversations, missing something important can be worse than getting a detail slightly wrong.
Is the Grading Fair?
All responses are graded by GPT-4.1, a powerful AI model acting as the “referee.” But how reliable is it?
To check, researchers compared GPT-4.1’s grading with scores from real doctors. The results were promising: the AI grader often matched—or even outperformed—individual physicians in applying the scoring rules.
Still, doctors didn’t always agree with each other either. In fact, their agreement ranged from 55% to 75%, depending on the topic. This shows just how hard it is to judge complex medical conversations. So while HealthBench grading is consistent and scalable, some human review is still important, especially for high-risk use cases.
How Do Doctors Stack Up?

In an interesting twist, OpenAI also asked real doctors to write answers to the same HealthBench cases—without seeing what the AI said. The surprising result? Sometimes, the AI did better than the physicians, at least by the benchmark’s own rules.
Doctors could still improve older AI responses with edits. But with the newest models, it was harder to make major improvements. This suggests that today’s frontier models are getting really good—and even experts may need new tools or training to keep up with these rubric-based evaluations.
Why HealthBench Could Change the Game
Most older medical AI benchmarks, like MedQA or MMLU, rely heavily on multiple-choice questions pulled from exams. These formats test what a model remembers, but not how it behaves in a real conversation with a patient or doctor. They don’t reflect the messy, high-stakes world of healthcare, where context matters and the wrong tone or a missed detail can cause harm.
HealthBench changes that by using over 5,000 realistic, back-and-forth conversations. These aren’t just pop quizzes—they’re designed to feel like real interactions in a clinic or hospital. The responses are judged using detailed rubrics written by doctors.
These check not just for accuracy, but also for things like safety, communication clarity, and whether the AI asked enough questions before giving advice. This makes the scores far more meaningful, because they reflect how a real doctor would judge the AI’s performance.
Benchmark Name | Primary Data Source | Main Task Type(s) | Key Evaluation Focus | Unique Contribution/Limitation (Brief) |
HealthBench | Simulated/adversarially generated health conversations | Conversational response generation | Physician-judged quality of conversational AI in diverse health scenarios (via rubrics) | Contribution: Realistic multi-turn conversations, detailed physician rubrics, 7 themes. Limitation: Simulated data, model-based grader. |
MedQA/MultiMedQA | Medical exam questions (USMLE, Indian entrance, etc.) | Primarily Multiple-choice QA; some open-ended QA | Medical knowledge recall and application (exam-style) | Contribution: Broad medical knowledge coverage, established. Limitation: Lacks real-world interaction complexity, primarily exam-focused. |
PubMedQA | PubMed abstracts | Yes/No/Maybe QA based on scientific text | Reasoning over biomedical research literature | Contribution: Focus on scientific text comprehension. Limitation: Not direct clinical interaction; specific to research questions. |
CRAFT-MD | Simulated AI agent-LLM dialogues | Diagnostic dialogue, history-taking | Clinical conversational reasoning and diagnostic accuracy in simulated patient interactions | Contribution: Focus on natural diagnostic dialogues with AI-simulated patients. Limitation: Scale and diversity may differ from HealthBench; specific focus on diagnostic interactions. |
MedHELM | Real Electronic Health Records (EHRs), other datasets | Diverse: Clinical decision support, note generation, patient communication, etc. | Real-world clinical applicability across a broad taxonomy of tasks using EHR data | Contribution: Use of real EHR data, extensive task taxonomy. Limitation: Challenges of EHR data (privacy, noise, bias); complexity of integrating diverse datasets. |
Another key advantage of HealthBench is how it helps engineers. Instead of just saying how well a model did overall, it breaks things down into thousands of micro-critiques. Developers can see exactly where the AI struggled—maybe it didn’t clarify a vague symptom, or gave unsafe advice in a low-resource setting. This kind of feedback is incredibly useful for improving future models.
HealthBench is also open source, meaning anyone—from startups to hospitals—can use the same dataset and tools to test their models. But it’s still secure: it doesn’t include the actual answers used in evaluations, so models can’t “cheat” by memorizing them. That keeps the test fair and effective over time.
Most importantly, HealthBench was built to grow. As AI models improve, the standard shouldn’t stay still. That’s why OpenAI created HealthBench Hard—a set of 1,000 extra-tough cases meant to challenge even the most advanced systems. While older benchmarks like MedQA are starting to hit their limits (with some models scoring nearly perfectly), HealthBench ensures the bar keeps rising.
In the end, what sets HealthBench apart is its focus on real-world usefulness. It doesn’t just ask what the AI knows—it asks how well it performs when lives might be on the line. That shift, from exam trivia to clinical trust, could help shape the future of AI in medicine.
Caveats & Critiques (Because This Is Healthcare)
HealthBench is a big step forward in testing AI for medicine. But no system is perfect—and even the creators of HealthBench openly acknowledge that it has its limits. Understanding these challenges helps keep expectations realistic and shows where more work is still needed.
Doctors Don’t Always Agree
One key challenge in building HealthBench was the natural variation in how doctors think. Medicine isn’t always black and white. Different doctors may read the same patient case and give slightly different advice based on their specialty, experience, or how they interpret the problem.
This shows up in the grading process, too. When multiple doctors scored the same conversation, they agreed between 55% and 75% of the time. That’s decent, but not perfect. And it reminds us that even human experts often see things differently.
Most of HealthBench’s rubrics were written by individual physicians and weren’t cross-checked by others—except in the “Consensus” set. That means some scoring criteria reflect just one doctor’s judgment. While the total number of rubrics (nearly 50,000) covers a lot of ground, it’s still a limitation in terms of consistency.
Doctor-Written Responses Have Their Own Issues
To compare AI answers with human ones, HealthBench also asked doctors to write “ideal” responses to the test conversations. But this task isn’t something most doctors do every day. And because of how the task was set up—with specific formats and some constraints—the responses might not reflect what doctors would naturally say in the real world.
Many of the physician-written answers were shorter and less detailed than what the top AI models produced. So while they serve as a baseline, they don’t always represent a gold standard, and should be interpreted with care.
Simulated Conversations Aren’t the Real World
HealthBench uses carefully designed, simulated health chats. These conversations are realistic, but they’re still artificial. They don’t include real patient records, real-time decision-making, or unpredictable medical emergencies.
This means HealthBench scores tell us how well a model performs in a test setting—but not necessarily how it will behave with real patients. Future studies need to link benchmark scores with outcomes in actual healthcare settings.
Biases Can Still Sneak In
Although the data was made to be broad and challenging, it still comes from synthetic generation and adversarial testing. These methods are smart, but they can still miss rare cases, unusual symptoms, or diverse cultural perspectives. There’s a risk of the AI being trained mostly on common patterns—and failing when something unexpected shows up.
Rubric Creation Is Expensive and Time-Consuming
Writing detailed rubrics for thousands of conversations took a huge effort from hundreds of doctors. That’s part of what makes HealthBench so valuable—but it also makes it hard to repeat. Expanding to more languages, specialties, or health systems could be a challenge because it requires so many expert hours and resources.
Grading AI with AI Has Risks
HealthBench uses GPT-4.1 to grade AI responses, because it can handle complex rubrics and apply them at scale. But that also creates a risk: future AI models might start writing responses that “sound right” to GPT-4.1, instead of being truly useful to a human doctor or patient. This is a kind of “teaching to the test” that could skew results.
Benchmarks Age Quickly
Medical knowledge changes fast. So do AI models. That means any benchmark, including HealthBench, can become outdated if it’s not regularly updated. OpenAI tried to address this with the “HealthBench Hard” set, but keeping benchmarks fresh is an ongoing challenge.
Models Might Act Differently Under Pressure
One final thing to consider: AI models might behave differently when they know they’re being judged. In a benchmark, they might be extra cautious, use more formal language, or ask more questions than they would in the real world. This can affect how we interpret their performance and raises the question: how do they behave when no one is watching?
Despite all these caveats, HealthBench is a serious leap forward. But just like in medicine, no test is complete without context. These limitations remind us to use AI benchmarks wisely—and never forget the human side of healthcare.
What HealthBench Means for…

HealthBench isn’t just a technical tool—it’s a turning point. Its design, openness, and growing use have wide-reaching implications for many different groups working at the intersection of healthcare and AI.
Researchers: A Better Way to Measure Progress
For researchers, HealthBench offers something that’s been missing: a realistic, detailed, and shared way to test medical AI.
Instead of relying on outdated multiple-choice benchmarks, they can now use a system that mirrors real-life conversations and clinical decision-making. Because it’s open source, researchers can run the same tests, adapt them to new medical specialties, and compare models head-to-head in a fair way.
This opens the door to more collaborative work, faster discovery of model flaws, and clearer evidence of true progress.
Hospitals & Startups: A Smarter First Gate
Hospitals and startups exploring medical AI need a way to screen tools before using them with real patients.
HealthBench acts like a first gate—a tough test that helps identify which models are safe, helpful, and worth moving to clinical trials. While passing HealthBench isn’t enough on its own, it’s a strong starting point for making smart, safe choices. It can save time, flag problems early, and support responsible deployment in high-stakes environments.
Regulators: A Transparent Yardstick
For regulators, HealthBench offers a clear and transparent standard. It doesn’t just say a model is “good”—it shows why, using thousands of specific rubrics grounded in physician judgment.
This kind of structure is essential for creating trust and building rules that protect patients while still allowing innovation. If HealthBench scores become widely recognized, they could help set the baseline for what “safe and effective” really means in medical AI.
Patients might not read HealthBench scores, but they benefit from what those scores represent.
When an AI chatbot or assistant in a clinic has passed a benchmark like this, it means it’s been tested in real conversations and judged by real doctors. That gives patients more confidence that the advice or information they’re getting is thoughtful, clear, and safe. Especially as AI becomes more common in care settings, benchmarks like HealthBench can help build trust one safe interaction at a time.
Closing Takeaway: A Smarter Stethoscope, Not a Stand-In
HealthBench doesn’t declare that AI is ready to replace doctors. Instead, it offers something much more valuable: a clear, grounded way to measure whether these tools are truly helpful, safe, and trustworthy in real medical conversations. By focusing on how AI performs in full, back-and-forth interactions—and judging that performance through detailed rubrics created by real physicians—HealthBench raises the standard for what good medical AI should look like.
The goal isn’t perfection. It’s progress. And with top models still scoring well below 100%, the message is clear: there’s plenty of room to improve. But thanks to HealthBench, we now have a tool that helps guide that improvement, points out weaknesses, and supports smarter development.
As AI continues to find its way into hospitals, clinics, and patient portals, HealthBench helps ensure that the technology acts more like a smarter stethoscope—a tool that supports care—rather than a shortcut around real medical expertise.