Carvana shares soared today on its plans to reduce its debt. Apple is reportedly developing its own artificial intelligence challenger to OpenAI's ChatGPT. Delivery service UPS and Teamster unions are set to return to the negotiating tables next week as economic impairment looms if a labor strike does happen.
High-profile A.I. chatbot ChatGPT performed worse on certain tasks in June than its March version, a Stanford University study found.
The study compared the performance of the chatbot, created by OpenAI, over several months at four “diverse” tasks: solving math problems, answering sensitive questions, generating software code, and visual reasoning.
Researchers found wild fluctuations—called drift—in the technology’s ability to perform certain tasks. The study looked at two versions of OpenAI’s technology over the time period: a version called GPT-3.5 and another known as GPT-4. The most notable results came from research into GPT-4’s ability to solve math problems. Over the course of the study researchers found that in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted a lowly 2.4%. Meanwhile, the GPT-3.5 model had virtually the opposite trajectory. The March version got the answer to the same question right just 7.4% of the time—while the June version was consistently right, answering correctly 86.8% of the time.
Similarly varying results happened when the researchers asked the models to write code and to do a visual reasoning test that asked the technology to predict the next figure in a pattern.
James Zuo, a Stanford computer science professor who was one of the study’s authors, says the “magnitude of the change” was unexpected from the “sophisticated ChatGPT.”
The vastly different results from March to June and between the two models reflect not so much the model’s accuracy in performing specific tasks, but rather the unpredictable effects of changes in one part of the model on others.
“When we are tuning a large language model to improve its performance on certain tasks that can actually have a lot of unintended consequences, which might actually hurt this model's performance on other tasks,” Zuo said in an interview with Fortune. “There's all sorts of interesting interdependencies in how the model answers things which can lead to some of the worsening behaviors that we observed.”
The exact nature of these unintended side effects is still poorly understood because researchers and the public alike have no visibility into the models powering ChatGPT. It’s a reality that has only become more acute since OpenAI decided to on plans to make its code open source in March. “These are black box models,” Zuo says. “So we don't actually know how the model itself, the neural architectures, or the training data have changed.”
This story was originally featured on