Tech giant OpenAI has hit an unexpected roadblock with its latest artificial intelligence models. The company’s new reasoning models, o3 and o4-mini, are showing a concerning spike in hallucination rates – essentially making up information that isn’t true – compared to their predecessors.
This development has stunned both the company’s own engineers and industry watchers alike, as it reverses years of steady improvement in AI reliability. While each previous generation of OpenAI’s large language models had been getting gradually better at avoiding hallucinations, these new models are suddenly performing worse.
A Step Backward in AI Reliability
According to OpenAI’s internal testing, the new o3 model hallucinated in 33% of cases on the company’s PersonQA benchmark. That’s roughly double the rate of previous models like o1 (16%) and o3-mini (14.8%). Even more troubling, the o4-mini model performed worse still, hallucinating in nearly half of all test cases – a staggering 48%.
This setback has raised serious concerns throughout the AI research community. When AI systems confidently present false information as fact, it undermines user trust and limits how these technologies can be safely used in important applications.

“What we’re seeing is unusual for a company that has built its reputation on steady, measurable progress in AI safety,” said tech analyst Sarah Chen. “These hallucination rates could potentially undermine years of work building public trust in AI systems.”
Mystery Behind the Decline
Perhaps most concerning is that OpenAI itself doesn’t fully understand why this is happening. In its technical documentation, the company openly admits that “more research is needed” to figure out why scaling up these reasoning models is leading to more frequent hallucinations.
Neil Chowdhury, a researcher at nonprofit AI lab Transluce and former OpenAI employee, suggests the reinforcement learning methods used in developing these models might be amplifying problems that older techniques managed to avoid. His team found that o3 sometimes makes up not just facts, but even fabricates actions it claims to have taken – like pretending to run code on hardware that doesn’t exist.
“It’s as if the models are becoming more confident but not necessarily more accurate,” Chowdhury explained. “They’re generating more claims overall, which means both more correct answers and more incorrect ones.”
Despite these issues, the new models do excel in certain areas. The o3 model achieved an impressive 69.1% score on the SWE-bench coding benchmark, with o4-mini close behind at 68.1%. These are significant improvements in coding and mathematical capabilities.
However, the practical problems are already evident. Kian Katanforoosh, Stanford adjunct professor and CEO of startup Work, noted that while o3 performs exceptionally well for coding tasks compared to competitors, it frequently generates broken website links – URLs that simply don’t exist.
“For businesses relying on these models, such hallucinations can be more than just annoying – they can actively harm productivity and decision-making,” Katanforoosh said. “Imagine building a product roadmap based on AI research that includes references to non-existent studies or tools.”
Industry Impact and Future Challenges
This spike in hallucination rates comes at a crucial moment for OpenAI, which faces intense competition from rivals like Google, Meta, xAI, Anthropic, and DeepSeek. The company had been counting on these new reasoning models to set a new industry standard, but the unexplained rise in hallucinations could damage user trust.
AI ethics researcher Maya Johnson points out the fundamental challenge: “While some creative ‘hallucination’ can be useful for brainstorming or generating novel ideas, these rates are simply too high for enterprise or scientific applications where accuracy is non-negotiable.”
OpenAI has acknowledged the seriousness of the issue and is dedicating resources to understanding and addressing the root causes. The company has also called on the broader AI research community to help investigate this phenomenon.
As the race for more capable AI continues, this development serves as a sobering reminder that as models grow more sophisticated in some ways, they may simultaneously struggle with basic reliability problems. For now, users of these advanced models may need to exercise extra caution and verification when working with their outputs.