ChatGPT’s accuracy is slipping, says new study

A pair of new studies presents a problematic dichotomy for OpenAI’s ChatGPT large language model programs. Although its popular generative text responses are now all-but-indistinguishable from human answers according to multiple studies and sources, GPT appears to be getting less accurate over time. Perhaps more distressingly, no one has a good explanation for the troubling deterioration.

A team from Stanford and UC Berkeley noted in a research study published on Tuesday that ChatGPT’s behavior has noticeably changed over time—and not for the better. What’s more, researchers are somewhat at a loss for exactly why this deterioration in response quality is happening.

To examine the consistency of ChatGPT’s underlying GPT-3.5 and -4 programs, the team tested the AI’s tendency to “drift,” i.e. offer answers with varying levels of quality and accuracy, as well as its ability to properly follow given commands. Researchers asked both ChatGPT-3.5 and -4 to solve math problems, answer sensitive and dangerous questions, visually reason from prompts, and generate code.

In their review, the team found that “Overall… the behavior of the ‘same’ LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.” For example, GPT-4 in March 2023 identified prime numbers with a nearly 98 percent accuracy rate. By June, however, GPT-4’s accuracy reportedly cratered to less than 3 percent for the same task. Meanwhile, GPT-3.5 in June 2023 improved on prime number identification in comparison to its March 2023 version. When it came to computer code generation, both editions’ ability to generate computer code got worse between March and June.

These discrepancies could have real world effects—and soon. Earlier this month, a paper published in the journal JMIR Medical Education by a team of researchers from NYU indicates ChatGPT’s responses to healthcare-related queries are ostensibly indistinguishable from human medical professionals when it comes to tone and phrasing. The researchers presented 392 people with 10 patient questions and responses, half of which came from a human healthcare provider, and half from OpenAI’s large language model (LLM). Participants had “limited ability” to distinguish human- and chatbot-penned responses. This comes alongside increasing concerns regarding AI’s ability to handle medical data privacy, alongside its propensity to “hallucinate” inaccurate information..

Academics aren’t alone in noticing ChatGPT’s diminishing returns. As Business Insider notes on Wednesday, OpenAI’s developer forum has hosted an ongoing debate about the LLM’s progress—or lack thereof. “Has there been any official addressing of this issue? As a paying customer it went from being a great assistant sous chef to dishwasher. Would love to get an official response,” one user wrote earlier this month.

OpenAI’s LLM research and development is notoriously walled off to outside review, a strategy that has prompted intense pushback and criticism from industry experts and users. “It’s really hard to tell why this is happening,” tweeted Matei Zaharia, one of the ChatGPT quality review paper’s co-authors, on Wednesday. Zaharia, an associate professor of computer science at UC Berkeley and CTO for Databricks, continued by surmising that reinforcement learning from human feedback (RLHF) could be “hitting a wall” alongside fine-tuning, but also conceded it could simply be bugs in the system.

So, while ChatGPT may pass rudimentary Turing Test benchmarks, its uneven quality still poses major challenges and concerns for the public—all while little stands in the way of their continued proliferation and integration into daily life.

ChatGPT’s accuracy has gotten worse, study shows

The ‘internet of animals’ could transform what we know about wildlife The ‘internet of animals’ could transform what we know about wildlife

How to use Apple Maps on the web How to use Apple Maps on the web

Google’s own upcoming AI chatbot draws from the power of its search engine Google’s own upcoming AI chatbot draws from the power of its search engine

Just because an AI can hold a conversation does not make it smart Just because an AI can hold a conversation does not make it smart

The best free AI tools you can try right now The best free AI tools you can try right now

Netflix is rolling out a feature that ends password sharing in the US Netflix is rolling out a feature that ends password sharing in the US

Google will start deleting inactive accounts later this year Google will start deleting inactive accounts later this year

Read the fine print before signing up for a free Telly smart TV Read the fine print before signing up for a free Telly smart TV

A free IRS e-filing tax service could start rolling out next year A free IRS e-filing tax service could start rolling out next year

WhatsApp released a super-secure new feature for private messages WhatsApp released a super-secure new feature for private messages

Microsoft is betting ChatGPT will make Bing useful Microsoft is betting ChatGPT will make Bing useful

Is ChatGPT groundbreaking? These experts say no. Is ChatGPT groundbreaking? These experts say no.

6 ways ChatGPT is actually useful right now 6 ways ChatGPT is actually useful right now

ChatGPT is quietly co-authoring books on Amazon ChatGPT is quietly co-authoring books on Amazon

Microsoft lays off entire AI ethics team while going all out on ChatGPT Microsoft lays off entire AI ethics team while going all out on ChatGPT

Sounding like an AI chatbot may hurt your credibility Sounding like an AI chatbot may hurt your credibility

‘Godfather of AI’ quits Google to talk openly about the dangers of the rapidly emerging tech ‘Godfather of AI’ quits Google to talk openly about the dangers of the rapidly emerging tech

IBM’s AI has a new job: sorting through NASA’s giant stream of Earth and weather images IBM’s AI has a new job: sorting through NASA’s giant stream of Earth and weather images

Share