When ‘Smarter’ AI Gets the Facts Wrong – Why OpenAI’s New Models Hallucinate More

OpenAI’s newest AI models—called o3 and o4-mini—can solve logic puzzles and explain complex ideas better than ever. These models were built to “think” more clearly, using advanced reasoning skills.

But there’s a twist: they also make up facts far more often than older versions. According to The New York Times, these new models doubled or even tripled the number of wrong answers they gave in OpenAI’s own tests. That’s not what you’d expect from “smarter” AI.

This creates a strange and troubling paradox. The models are better at thinking through problems but worse at sticking to the truth. They can explain their reasoning in great detail—but sometimes it’s all based on things that aren’t real.

This raises a big question: What good is sharper thinking if the answers aren’t trustworthy?

OpenAI admits this is a serious issue and says more research is needed. Early ideas suggest that the way these models are trained—especially how they’re rewarded for giving confident, detailed answers—might make them more likely to bluff. For now, the problem seems to be growing faster than the fixes.

In the sections ahead, we’ll look at how these hallucinations work, what’s causing them, who’s most at risk, and what’s being done to stop them.

What is AI Hallucination?

Why OpenAI’s New Models Hallucinate More

AI hallucination happens when a language model confidently gives an answer that sounds real—but isn’t. These mistakes aren’t just small errors. The AI can make up facts, names, quotes, and even links that look and feel believable. Because it speaks with confidence, most people won’t realize the answer is false unless they check it carefully.

For example, an AI might say a medical treatment has a 90% success rate, but there’s no real study behind that number. Or it might cite a legal case that never existed. These aren’t just wrong—they’re completely invented.

This is a big problem because AI tools are designed to sound natural and persuasive. When they get something wrong, they don’t warn you. They just keep talking like they know what they’re doing. That’s what makes hallucinations so dangerous: false information can spread easily, especially when it sounds trustworthy.

OpenAI’s Benchmarks: PersonQA and SimpleQA

To measure how often this happens, OpenAI uses two internal tests: PersonQA and SimpleQA.

PersonQA checks how well a model can recall facts about people. It asks questions like where someone was born or what job they had. These are things that don’t usually change, so the right answer should be clear.

SimpleQA asks short, fact-based questions across a wide range of topics—like history, science, or geography. These questions have just one correct answer and are carefully written to test whether the AI will stay factual or start making things up.

These benchmarks help OpenAI stress-test how truthful its models are. They’re designed to catch hallucinations and see how each model handles questions with clear right and wrong answers.

How Much Worse Did It Get? Comparing o3 and o4-mini to Older Models

OpenAI’s latest models, o3 and o4-mini, were built to be smarter and better at reasoning. But in testing, they actually hallucinated more than older versions like o1 and o3-mini.

On the PersonQA benchmark:

The older o1 model gave wrong answers 16% of the time.
The newer o3 jumped to 33%.
The o4-mini did even worse, getting 48% wrong.

On the SimpleQA benchmark:

o1 made up facts in 44% of answers.
o3 was worse at 51%.
o4-mini got nearly 8 out of 10 questions wrong, with a hallucination rate of 79%.

This shows a clear and surprising trend: the newer the model, the more often it makes things up. The increase in hallucination rates is not small—it’s dramatic.

Table 1: Comparative Hallucination and Accuracy Rates in OpenAI Models (PersonQA & SimpleQA Benchmarks)

Model	Benchmark	Hallucination Rate (%) (lower is better)	Accuracy (%) (higher is better)	Source(s)
o1	PersonQA	16	47	¹
o3-mini	PersonQA	14.8	N/A	¹
o3	PersonQA	33	59	¹
o4-mini	PersonQA	48	36	¹
GPT-4.5	PersonQA	19	N/A	¹²
GPT-4o	PersonQA	30	N/A	¹²
o1	SimpleQA	44	47	²
o3	SimpleQA	51	49	²
o4-mini	SimpleQA	79	20	²

N/A: Data not consistently available in the same comparative context across all cited sources for these specific models.

OpenAI says that part of the reason may be that models like o3 give longer, more detailed answers. That means they say more true things—but also more false ones. So users end up with more information overall, but it’s harder to tell what’s real and what’s not.

Leading Theories: Why Might Reasoning Models Hallucinate More?

OpenAI’s new models, o3 and o4-mini, were built to reason better. They were supposed to handle complex problems more logically. But as we’ve seen, they’re also more likely to make things up. So, why does “smarter” AI seem to lie more?

The truth is, no one fully knows yet. OpenAI says clearly in its reports: more research is needed. Still, researchers have shared a few working theories. These ideas help explain why better thinking might come with worse accuracy.

“More Words, More Risk” Hypothesis

OpenAI says the o3 model makes more claims overall. Some of these are true, but many are not. The more it says, the more chances it has to get things wrong. This model doesn’t just answer—it explains, justifies, and adds extra details. But those extras can be where the falsehoods slip in.

“Small-Brain Penalty” Hypothesis

The o4-mini model is fast and efficient because it’s smaller than others. But that smaller size comes with a trade-off: less knowledge. OpenAI suggests that because the model can’t “remember” as much about the world, it fills in the gaps—often incorrectly.

“Reward-for-Confidence” Hypothesis

These models are trained using a method called RLHF—short for Reinforcement Learning from Human Feedback. This process teaches the AI to give answers that are helpful, clear, and confident. But that push for confidence might backfire. When a model doesn’t know something, it may still answer boldly, leading to confident-sounding falsehoods.

“Chain-of-Thought Drift” Hypothesis

Reasoning models are built to think step-by-step. That sounds great—until a small error early in the chain leads to bigger mistakes down the line. These “chains of thought” are meant to help the model reason like a human. But if one link in the chain is wrong, the final answer might be a fully structured, logical-sounding hallucination.

These are only hypotheses, not proven facts. Even OpenAI admits it doesn’t fully understand why this is happening.

Some experts have compared today’s AI models to black boxes—even the people building them don’t always know how they work. As the industry shifts from making models bigger to making them better at reasoning, new problems are showing up. Hallucinations may be one of the costs of this new direction.

AI Hallucination: Who’s at Risk First?

AI hallucinations aren’t just a tech problem—they’re a real-world risk. When advanced models like o3 and o4-mini make up facts or processes, it can cause serious damage, especially in fields where accuracy matters most.

Professionals in High-Stakes Fields

People in jobs that depend on trustworthy information—like lawyers, doctors, and analysts—are especially vulnerable. These users often rely on AI tools to speed up research, review documents, or summarize complex topics. But when the AI makes up a law, a diagnosis, or a financial detail, the consequences can be severe.

In the legal world, some attorneys have already faced penalties for submitting fake case citations from ChatGPT. In medicine, even small mistakes in treatment suggestions could lead to harm. In finance, a wrong number or false prediction can lead to major losses.

Businesses That Use AI to Save Time

Companies are also at risk. Many businesses turn to AI to write contracts, answer customer questions, or generate reports. But if the AI gets facts wrong, the results could be costly—think compliance violations, legal trouble, or damage to the brand.

Some businesses now see hallucination as a dealbreaker. They want AI to save time, not create extra work by requiring teams to double-check everything it says. If workers have to constantly fact-check AI output, the return on investment starts to disappear.

Everyday Users

It’s not just experts and companies. Anyone using AI for schoolwork, health advice, or even basic research can be misled. The biggest problem is that the AI sounds so sure of itself—even when it’s wrong. That confidence makes it easy to trust and hard to spot the lies.

As hallucinations get more detailed and complex, they also get harder to catch. Some models now explain how they got their answers—even if those steps are completely made up. That makes human review harder, not easier. And it raises the bar for users: checking AI answers now means not just spotting mistakes, but spotting lies hidden inside good-sounding logic.

Fixes on the Table (Still Early Days)

OpenAI knows hallucinations are a serious problem. The company is actively working on ways to reduce them, but so far, there’s no perfect fix. The tools and techniques being used today can help, but they don’t solve the problem completely—especially not for the newest models like o3 and o4-mini.

Tools That Help (But Don’t Fully Solve It)

OpenAI has added tools to ChatGPT to reduce hallucinations, especially for users on paid plans. These include:

Web browsing: The model checks the internet for real-time facts.
Code interpreter (data analysis): Helps with calculations, logic, and charts.
Deep research tools: Pulls answers from trusted sources with real citations.

These tools can improve accuracy, but they don’t eliminate errors. The model still might say something false, especially when it’s not connected to live data.

Another promising fix is Retrieval-Augmented Generation (RAG). This method lets the model grab facts from verified sources before it gives an answer. With this setup, GPT-4o reached 90% accuracy on the SimpleQA benchmark.

RAG works well because it adds grounding—the model doesn’t rely only on memory, but checks its facts first.

There’s also prompt engineering—changing how we ask questions. Giving clear instructions, examples, or asking for step-by-step logic can help reduce mistakes. Some prompts even tell the AI, “Only answer if you’re sure.” That can make a big difference, though it still requires human effort.

SmythOS: Combining RAG, Web Scraping, and Workflow Control

Another promising direction is using platforms that combine retrieval, user-defined data, and structured automation. SmythOS is a leading example of this approach.

It allows users to scrape web content, upload trusted data sources, and store them in a system that supports Retrieval-Augmented Generation (RAG). Its LLM component can then interact directly with this information—responding based on real, grounded data instead of relying only on model memory.

This setup offers more than just accurate answers. It enables users to build repeatable, auditable AI workflows that reduce hallucinations at the source.

For teams that need both flexibility and control—like legal, compliance, research, or enterprise operations—SmythOS bridges the gap between powerful language models and practical, verifiable output.

Fixing the Root Problem (Still a Work in Progress)

OpenAI is also trying to improve the training process itself. This means better data, better rules for rewarding correct answers, and stricter penalties for making things up. They’re also updating models regularly and blocking certain risky behaviors—like trying to identify people in images.

But many of these fixes are what you might call “bolt-ons.” They help guide or correct the model after it’s already built. They don’t change the model’s core behavior. To make hallucinations truly rare, the AI systems themselves need to be fundamentally better at knowing what’s real and what’s not.

And even when these tools work, they come with trade-offs. For example, web browsing adds privacy concerns for businesses. Extra human checking takes time and money. And prompt engineering doesn’t scale well for casual users.

Fighting AI Hallucination: Practical Tips for Readers

Advanced AI tools are powerful—but not perfect. With hallucinations on the rise in newer models, it’s more important than ever to use these tools with care. Whether you’re a casual user, a developer, or a business leader, here are some smart habits to follow when working with AI.

What Users Can Do: Tips for Safer AI Use

Everyday users have a key role to play in keeping AI reliable. The first step is a mindset shift. Instead of accepting everything the model says at face value, users should approach each response with healthy skepticism. Just because an AI sounds confident doesn’t mean it’s correct.

When using AI for important tasks—like schoolwork, research, or health advice—it’s smart to treat the model’s output as a first draft or a rough idea, not a final answer. Independent verification is essential. If something sounds off, or even if it sounds too perfect, checking the facts against trusted sources is the safest move.

It also helps to understand the model’s limitations. Unless it’s connected to real-time data, its knowledge stops at a fixed point. It won’t know about current events or recent developments. And because it’s trained to sound fluent and helpful, it may give a smooth, confident answer even when it’s guessing. That makes critical thinking even more important.

Using grounded AI platforms like SmythOS adds another layer of protection. When accuracy matters, tools that fetch and reference real data give users far more confidence in the output.

What Organizations Should Prioritize

For developers and businesses building with AI, the stakes are even higher. When AI-generated mistakes affect legal, financial, or regulatory outcomes, the risks can be severe. That’s why organizations need stronger systems, not just smarter models.

Responsible deployment starts with oversight. Any critical AI workflow should include a human-in-the-loop review process to catch potential errors before they cause harm. But that’s only one piece of the puzzle. Grounded architectures like RAG—especially when integrated into platforms like SmythOS—should be the standard, not the exception. When the model pulls from verified external sources instead of guessing, accuracy improves significantly.

Clear and consistent prompt design is also crucial. Teams should define best practices that guide the model toward clarity, precision, and honest uncertainty. When training or fine-tuning models, using clean and accurate datasets helps reduce built-in biases or misinformation.

Transparency is another key responsibility. Organizations should let users know how the AI works, what it can and can’t do, and that hallucinations are still possible. That way, users can engage with these tools safely and with realistic expectations.

Finally, ongoing monitoring is essential. Models need to be updated, tested, and refined as they’re used. Tracking hallucination rates in real-world applications can reveal patterns and inform future improvements.

Closing Thought

Today’s most advanced AI models are impressive. They can break down complex problems, explain their reasoning step-by-step, and hold fluent conversations that sound almost human. But beneath all that polish lies a serious problem: the truth doesn’t always come with the answer.

As OpenAI’s latest models show, better reasoning doesn’t always mean better facts. In fact, the opposite might be happening. The more these models try to “think,” the more chances they have to make things up—whether it’s a fake statistic, a made-up process, or a citation that leads nowhere. And because they sound confident, even wrong answers can feel right.

This creates a challenge for everyone—from casual users to legal teams, researchers, and business leaders. If we can’t fully trust what the AI says, then using it safely requires more effort, more oversight, and better tools.

That’s why responsible engagement matters. Users need to stay skeptical, double-check key facts, and understand where a model’s knowledge begins and ends. Businesses need to build workflows that include validation, not just automation.

Article last updated on: Last updated: June 3, 2025

What is AI Hallucination?

OpenAI’s Benchmarks: PersonQA and SimpleQA

How Much Worse Did It Get? Comparing o3 and o4-mini to Older Models

Leading Theories: Why Might Reasoning Models Hallucinate More?

“More Words, More Risk” Hypothesis

“Small-Brain Penalty” Hypothesis

“Reward-for-Confidence” Hypothesis

“Chain-of-Thought Drift” Hypothesis

AI Hallucination: Who’s at Risk First?

Professionals in High-Stakes Fields

Businesses That Use AI to Save Time

Everyday Users

Fixes on the Table (Still Early Days)

Tools That Help (But Don’t Fully Solve It)

SmythOS: Combining RAG, Web Scraping, and Workflow Control

Fixing the Root Problem (Still a Work in Progress)

Fighting AI Hallucination: Practical Tips for Readers

What Users Can Do: Tips for Safer AI Use

What Organizations Should Prioritize

Closing Thought

Alexander De Ridder

Explore All AI Trends Articles

DeepMind Creates AlphaEvolve: An AI Agent That Invents

GPT-4.1 in ChatGPT: What Changed, Who Benefits, and Why It Matters

Audible’s AI Gambit: Can AI Voices Really Change the Story?

HealthBench: Can We Finally Trust AI Doctors?

Ludus AI in Unreal Engine: Helpful Sidekick or Over‑Hyped Hologram?

What Tech Jobs Will Be Safe From AI—At Least for the Next 5-10 Years?

Ready to Scale Your Business with SmythOS?

Ludus AI in Unreal Engine: Helpful Sidekick or Over‑Hyped Hologram?