Tech Design: OpenAI's New o3 and o4-mini Models Show Mixed Results on Hallucination Tests

OpenAI's recent System Card, dated April 16, 2025, details the capabilities and safety evaluations of its new o3 and o4-mini AI models. Among the various safety challenges assessed, the report provides specific findings on the models' tendency to "hallucinate" – generate inaccurate or fabricated information – using the PersonQA evaluation benchmark.

The PersonQA Benchmark

The evaluation focused on the PersonQA dataset, which consists of questions paired with publicly available facts. This benchmark specifically aims to measure how often models generate incorrect answers (hallucinate) when attempting to respond, alongside their overall accuracy. Two key metrics were reported:

Accuracy: Measures the proportion of questions the model answered correctly (higher is better).

Hallucination Rate: Measures how often the model produced inaccurate or fabricated information in its response (lower is better).

o4-mini Performance: Expected Trade-offs for Size

The smaller o4-mini model demonstrated significantly lower performance compared to both the new o3 model and the existing o1 model on the PersonQA evaluation.

Accuracy: o4-mini scored 0.36.

Hallucination Rate: o4-mini scored 0.48.

The report notes that this underperformance is largely expected. Smaller models generally possess less world knowledge, which often correlates with a higher tendency to hallucinate when faced with factual queries.

Its hallucination rate (0.48) was considerably higher than both o1 (0.16) and o3 (0.33).

Its accuracy (0.36) lagged behind o1 (0.47) and o3 (0.59).

o3 Performance: Higher Accuracy but Also More Hallucinations

The performance of the larger o3 model presented a more nuanced picture when compared to the existing o1 model:

Accuracy: o3 scored 0.59.

Hallucination Rate: o3 scored 0.33.

While o3 achieved higher accuracy (0.59) than o1 (0.47), indicating it could answer more questions correctly, it also exhibited a notably higher hallucination rate (0.33) compared to o1 (0.16).

Interpreting the o3 Results

OpenAI offers a specific observation regarding the difference between o1 and o3. The report states that o3 appears to "make more claims overall" during its responses. This tendency has a dual effect:

It leads to more accurate claims, contributing to its higher overall accuracy score compared to o1.

Simultaneously, it results in more inaccurate/hallucinated claims, explaining its higher hallucination rate compared to o1.

Essentially, while o3 seems capable of accessing and providing more factual information correctly, it also takes more "risks" by making additional claims, some of which turn out to be incorrect. The report explicitly mentions that "More research is needed to understand the cause of this result."

What does this mean?

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” said OpenAI spokesperson Niko Felix in an email to TechCrunch.

This means that OpenAI is still highly focused on addressing hallucinations in their models. o3 and o4-mini represent very important state of the art models that are high performing, that are both contributing to the increase in AI improvement in the industry.

The hallucination findings from the OpenAI System Card reveal distinct profiles for the new models. The o4-mini performs as anticipated for a smaller model, exhibiting lower accuracy and a higher tendency to hallucinate. The o3 model shows an improvement in accuracy over o1 but comes with a trade-off: an increased propensity to generate hallucinations. This specific characteristic of o3 – making more claims overall, leading to both more correct and more incorrect statements – is highlighted as an area requiring further investigation by OpenAI. These results underscore the ongoing challenge in large language model development of enhancing knowledge and capability while simultaneously minimizing factual inaccuracies.

OpenAI's Full o3 and o4-Mini Technical Report is available:
https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

OpenAI's Model Information on o3 and o4-mini:
https://openai.com/index/introducing-o3-and-o4-mini/

Tech Design

April 23, 2025

OpenAI's New o3 and o4-mini Models Show Mixed Results on Hallucination Tests

No comments:

Post a Comment

Articles are augmented by AI.