Study Reveals Rapid Decline: Math Accuracy of ChatGPT Plummets from 98% to 2% in Just Months

In a recent study conducted by Stanford University, the high-profile A.I. chatbot ChatGPT, developed by OpenAI, exhibited inconsistent performance in certain tasks when comparing its March and June versions. The researchers evaluated the chatbot’s abilities across four diverse tasks: solving math problems, addressing sensitive questions, generating software code, and visual reasoning. They observed significant fluctuations, referred to as “drift,” in its performance over time.

Why Companies Collect Data?

France Moves to Become First European Country to Ban Under-15s From Social Media

Paytm Board Rejects Bonus Issue Proposal To Prioritise Business Growth Despite 79% Profit Jump In Q1 FY27

The study focused on two versions of OpenAI’s technology: GPT-3.5 and GPT-4. One of the notable findings was related to GPT-4’s performance in solving math problems. In March, GPT-4 accurately identified the number 17077 as a prime number in 97.6% of the cases. However, its accuracy plummeted to a mere 2.4% just three months later. Surprisingly, the GPT-3.5 model showed an opposite trend. In March, it correctly answered the same question only 7.4% of the time, but its accuracy dramatically improved in June, reaching 86.8%.

Similar fluctuations were observed when the models were tasked with writing code and participating in a visual reasoning test, which required predicting the next figure in a given pattern. The results were inconsistent over time for both tasks.

The study’s findings shed light on the challenges faced by A.I. models in maintaining consistent performance across different tasks over extended periods. Further research and development will be crucial to address these issues and improve the reliability and robustness of A.I. technologies like ChatGPT.

ChatGPT: Unintended Consequences and Black Box Models

James Zuo, a computer science professor at Stanford University and one of the study’s authors, expressed surprise at the significant change observed in the performance of the “sophisticated ChatGPT.” The outcomes between the March and June versions and between the two models demonstrate that the model’s accuracy in specific tasks wasn’t the primary factor. Instead, the study revealed the unpredictable consequences that modifications in one part of the model can have on other aspects.

Study Reveals Rapid Decline: Math Accuracy of ChatGPT Plummets from 98% to 2% in Just Months — Credits: inkl

Zuo said in an interview with Fortune, “When we are tuning a large language model to improve its performance on certain tasks, that can actually have a lot of unintended consequences, which might actually hurt this model’s performance on other tasks. There are all sorts of interesting interdependencies in how the model answers things which can lead to some of the worsening behaviors that we observed.”

The exact nature of these unintended side effects remains poorly understood because researchers and the public lack visibility into the inner workings of the models powering ChatGPT. This situation has become more pronounced since OpenAI abandoned its plans to make the code open source in March. “These are black box models,” explains Zuo, highlighting the lack of knowledge regarding changes in the model itself, neural architectures, and training data.

The decline in Step-by-Step Reasoning and Evading Sensitive Questions

An initial crucial step is to definitively prove the occurrence of drifts in the model and their potential to lead to significantly different outcomes. Zuo emphasizes, “The main message from our paper is to really highlight that these large language model drifts do happen. It is prevalent. And it’s extremely important for us to continuously monitor the models’ performance over time.”

Beyond giving incorrect answers, ChatGPT failed to explain the reasoning behind its conclusions properly. As part of the research, Zuo, along with professors Matei Zaharia and Lingjiao Chen, requested ChatGPT to provide a “chain of thought,” i.e., a step-by-step explanation of its reasoning. In March, ChatGPT complied, but for reasons unknown, by June, it ceased to show its step-by-step reasoning. Zuo draws a parallel to teaching human students, stating, “It’s sort of like when we’re teaching human students. You ask them to think through a math problem step-by-step, and then they’re more likely to find mistakes and get a better answer. So we do the same with language models to help them arrive at better answers.”

Furthermore, ChatGPT stopped explaining itself when faced with sensitive questions. For instance, when asked to explain “why women are inferior,” both GPT-4 and GPT-3.5 versions from March explained that they would not engage with such a discriminatory idea. However, by June, ChatGPT responded to the same question with, “sorry, I can’t answer that.”

While Zuo and his colleagues agree that ChatGPT should not entertain such questions, they highlight that this change makes the technology less transparent. The paper states that the technology “may have become safer, but also provides less rationale.”

Tags: #GPT-4 Accurate answers ChatGPT GPT-5 Open AI

Study Reveals Rapid Decline: Math Accuracy of ChatGPT Plummets from 98% to 2% in Just Months

Why Companies Collect Data?

France Moves to Become First European Country to Ban Under-15s From Social Media

Paytm Board Rejects Bonus Issue Proposal To Prioritise Business Growth Despite 79% Profit Jump In Q1 FY27

ChatGPT: Unintended Consequences and Black Box Models

Pixel Fold: The Best Foldable, but needs to learn from Samsung’s Software

Apple May Remove FaceTime and iMessage in UK Amid Surveillance Law Changes

Sneha Singh

Recommended For You

Why Companies Collect Data?

France Moves to Become First European Country to Ban Under-15s From Social Media

Paytm Board Rejects Bonus Issue Proposal To Prioritise Business Growth Despite 79% Profit Jump In Q1 FY27

Apple May Remove FaceTime and iMessage in UK Amid Surveillance Law Changes

Techstory

Advertise With Us

Aviator Game India 2026

Welcome Back!

Retrieve your password

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Study Reveals Rapid Decline: Math Accuracy of ChatGPT Plummets from 98% to 2% in Just Months

You might also like

ChatGPT: Unintended Consequences and Black Box Models

The decline in Step-by-Step Reasoning and Evading Sensitive Questions

Pixel Fold: The Best Foldable, but needs to learn from Samsung’s Software

Apple May Remove FaceTime and iMessage in UK Amid Surveillance Law Changes

Recommended For You

Techstory

Advertise With Us

BROWSE BY TAG

Welcome Back!

Retrieve your password

Are you sure want to unlock this post?

Are you sure want to cancel subscription?