Amid the rapid adoption of generative AI programs, many educators have voiced concerns about students misusing the systems to ghostwrite their written assignments. It didn’t take long for multiple digital “AI detection” tools to arrive on the scene, many of which claimed to accurately parse original human writing from text authored by large language models (LLMs) such as OpenAI’s ChatGPT. But a new study indicates that such solutions may only create more headaches for both teachers and students. These AI detection tools are severely biased, the authors found, and inaccurate when it comes to non-native English speakers.
A Stanford University team led by senior author James Zou, an assistant professor of Biomedical Data Science, as well as Computer Science and Electrical Engineering, recently amassed 91 non-native English speakers’ essays written for the popular Test of English as a Second Language (TOEFL) assessment. They then fed the essays into seven GPT detector programs. According to Zou’s results, over half of the writing samples were misclassified as AI-authored, while native speaker sample detection remained nearly perfect.
“This raises a pivotal question: if AI-generated content can easily evade detection while human text is frequently misclassified, how effective are these detectors truly?” asks Zou’s team in a paper published on Monday in the journal Patterns.
The main issue stems from what’s known as “text perplexity,” which refers to a written work’s amount of creative, surprising word choices. AI programs like ChatGPT are designed to simulate “low perplexity” in order to mimic more generalized human speech patterns. Of course, this poses a potential problem for anyone who happens to use arguably more standardized, common sentence structures and word choice. “If you use common English words, the detectors will give a low perplexity score, meaning my essay is likely to be flagged as AI-generated,” said Zou in a statement. “If you use complex and fancier words, then it’s more likely to be classified as ‘human written’ by the algorithms.”
Zou’s team then went a step further to test the detection programs’ parameters by feeding those same 91 essays into ChatGPT before asking the LLM to punch-up the writing. Those more “sophisticated” edits were then thrown back through the seven detection programs—only to have many of them reclassified as written by humans.
So, while AI-generated written content often isn’t great, neither apparently are the currently available tools to identify it. “The detectors are just too unreliable at this time, and the stakes are too high for the students, to put our faith in these technologies without rigorous evaluation and significant refinements,” Zou recently argued. Regardless of his statement’s perplexity rating, it’s a sentiment that’s hard to refute.