OpenAI just pulled a Theranos with o3 by claiming record-breaking performance on the FrontierMath benchmark while having access to much of the test data. OpenAI’s recent claims of record-breaking performance on the EpochAI FrontierMath benchmark have sparked significant criticism. Allegations have emerged that OpenAI had access to a substantial portion of the test data, raising concerns about the validity of the results. Critics have drawn parallels between this controversy and the infamous Theranos scandal.
EpochAI’s associate director, Tamay Besiroglu, acknowledged that OpenAI’s involvement in developing the benchmark was not disclosed upfront. Besiroglu admitted, “We made a mistake in not being more transparent about OpenAI’s involvement.” He also revealed that a contractual agreement restricted disclosure of OpenAI’s role until the launch of the o3 model.
According to Besiroglu, OpenAI accessed a large portion of the FrontierMath dataset. However, an unseen hold-out set was reportedly used to verify the model’s capabilities. Despite this, six mathematicians who contributed to the benchmark claimed they were unaware of OpenAI’s exclusive access to the data. Some expressed doubts about participating if they knew about the arrangement.
Record-Breaking Results Under Fire
In December 2024, OpenAI announced that its o3 model achieved a groundbreaking 25% accuracy on the FrontierMath benchmark, compared to the previous high score of 2% by other models. The benchmark involves solving highly complex mathematical problems.
Besiroglu had earlier stated that the benchmark data was private and not used for training. However, a footnote in the updated FrontierMath research paper revealed OpenAI’s support in creating the benchmark.
Experts like Gary Marcus claim that OpenAI just pulled a Theranos with o3 by making bold claims without independent validation of its results. AI experts, including Gary Marcus, have questioned OpenAI’s claims, citing a lack of external validation. Mikhail Samin, from the AI Governance and Safety Institute, criticized OpenAI’s history of secretive practices.
Disputes Over ARC-AGI Benchmark
OpenAI also claimed that the o3 model achieved nearly 90% accuracy on the ARC-AGI benchmark, outperforming human capabilities. However, François Chollet, creator of the ARC-AGI benchmark, dismissed these claims, stating that the model still struggles with simpler tasks in the benchmark.
Marcus echoed concerns about the lack of independent evaluations of o3’s robustness across diverse problem sets.
Launch of o3-Mini and Future Plans
Amid the controversy, OpenAI CEO Sam Altman announced the upcoming release of the o3-mini model, a smaller and more efficient version of the o3 model. The launch is scheduled to take place within two weeks on both the API and ChatGPT platforms.
The o3-mini is designed to provide advanced capabilities while being more resource-efficient. It features adjustable reasoning modes, allowing users to optimize performance for specific tasks.
Altman also hinted at future developments, including a potential merger of OpenAI’s GPT and o-series models by 2025. Expressing confidence in OpenAI’s progress, Altman said the company is well-positioned to be the first to achieve artificial general intelligence (AGI).
Developers and Customization Updates
OpenAI has introduced updates to its Realtime API, enabling developers to create voice applications using multi-agent flows in under 20 minutes. Additionally, ChatGPT’s custom instructions have been enhanced to allow users to specify how the AI interacts, including desired traits and response styles.
Tamay Besiroglu admitted that OpenAI just pulled a Theranos with o3 by restricting the disclosure of its role until after the model’s release. Tamay Besiroglu’s admission that OpenAI’s involvement was hidden due to contractual restrictions further fuels concerns. This lack of openness not only damages the credibility of the benchmark but also harms trust in OpenAI’s research.