A pair of new studies presents a problematic dichotomy for OpenAI’s ChatGPT large language model programs. Although its popular generative text responses are now all-but-indistinguishable from human answers according to multiple studies and sources, GPT appears to be getting less accurate over time. Perhaps more distressingly, no one has a good explanation for the troubling deterioration.

A team from Stanford and UC Berkeley noted in a research study published on Tuesday that ChatGPT’s behavior has noticeably changed over time—and not for the better. What’s more, researchers are somewhat at a loss for exactly why this deterioration in response quality is happening.

To examine the consistency of ChatGPT’s underlying GPT-3.5 and -4 programs, the team tested the AI’s tendency to “drift,” i.e. offer answers with varying levels of quality and accuracy, as well as its ability to properly follow given commands.  Researchers asked both ChatGPT-3.5 and -4 to solve math problems, answer sensitive and dangerous questions, visually reason from prompts, and generate code.

[Related: Big Tech’s latest AI doomsday warning might be more of the same hype.]

In their review, the team found that “Overall… the behavior of the ‘same’ LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.” For example, GPT-4 in March 2023 identified prime numbers with a nearly 98 percent accuracy rate. By June, however, GPT-4’s accuracy reportedly cratered to less than 3 percent for the same task. Meanwhile, GPT-3.5 in June 2023 improved on prime number identification in comparison to its March 2023 version. When it came to computer code generation, both editions’ ability to generate computer code got worse between March and June.

These discrepancies could have real world effects—and soon. Earlier this month, a paper published in the journal JMIR Medical Education by a team of researchers from NYU indicates ChatGPT’s responses to healthcare-related queries are ostensibly indistinguishable from human medical professionals when it comes to tone and phrasing. The researchers presented 392 people with 10 patient questions and responses, half of which came from a human healthcare provider, and half from OpenAI’s large language model (LLM). Participants had “limited ability” to distinguish human- and chatbot-penned responses. This comes alongside increasing concerns regarding AI’s ability to handle medical data privacy, alongside its propensity to “hallucinate” inaccurate information.. 

Academics aren’t alone in noticing ChatGPT’s diminishing returns. As Business Insider notes on Wednesday, OpenAI’s developer forum has hosted an ongoing debate about the LLM’s progress—or lack thereof. “Has there been any official addressing of this issue? As a paying customer it went from being a great assistant sous chef to dishwasher. Would love to get an official response,” one user wrote earlier this month.

[Related: There’s a glaring issue with the AI moratorium letter.]

OpenAI’s LLM research and development is notoriously walled off to outside review, a strategy that has prompted intense pushback and criticism from industry experts and users. “It’s really hard to tell why this is happening,” tweeted Matei Zaharia, one of the ChatGPT quality review paper’s co-authors, on Wednesday. Zaharia, an associate professor of computer science at UC Berkeley and CTO for Databricks, continued by surmising that reinforcement learning from human feedback (RLHF) could be “hitting a wall” alongside fine-tuning, but also conceded it could simply be bugs in the system.

So, while ChatGPT may pass rudimentary Turing Test benchmarks, its uneven quality still poses major challenges and concerns for the public—all while little stands in the way of their continued proliferation and integration into daily life.