The magic of generative artificial intelligence projects like ChatGPT and Bard relies on data scraped from the open internet. But now, the sources of training data for these models are starting to close up. The New York Times has banned any of the content on its website from being used to develop AI models like OpenAI’s GPT-4, Google’s PaLM 2, and Meta’s Llama 2, according to a report last week by Adweek.
Earlier this month the Times updated its terms of service to explicitly exclude its content from being scraped to train “a machine learning or artificial intelligence (AI) system.” While this won’t affect the current generation of large language models (LLMs), if tech companies respect the prohibition, it will prevent content from the Times being used to develop future models.
The Times’ updated terms of service ban using any of its content—including text, images, audio and video clips, “look and feel,” and metadata—to develop any kind of software including AI, plus, they also explicitly prohibit using “robots, spiders, scripts, service, software or any manual or automatic device, tool, or process” to scrape its content without prior written consent. It’s pretty broad language and apparently breaking these terms of service “may result in civil, criminal, and/or administrative penalties, fines, or sanctions against the user and those assisting the user.”
Given that content from the Times has been used as a major source of training data for the current generation of LLMs, it makes sense that the paper is trying to control how its data is used going forward. According to a Washington Post investigation earlier this year, the Times was the fourth largest source of content for one of the major databases used to train LLMs. The Post analyzed Google’s C4 dataset, a modified version of Common Crawl, that includes content scraped from more than 15 million websites. Only Google Patents, Wikipedia, and Scribd (an ebook library) contributed more content to the database.
Despite its prevalence in training data, this week, Semafor reported that the Times had “decided not to join” a group of media companies including the Wall Street Journal in an attempt to jointly negotiate an AI policy with tech companies. Seemingly, the paper intends to make its own arrangements like the Associated Press (AP), which struck a two-year deal with OpenAI last month that would allow the ChatGPT maker to use some of the AP’s archives from as far back as 1985 to train future AI models.
Although there are multiple lawsuits pending against AI makers like OpenAI and Google over their use of copyrighted materials to train their current LLMs, the genie is really out of the bottle. The training data has now been used and, since the models themselves consist of layers of complex algorithms, can’t easily be removed or discounted from ChatGPT, Bard, and the other available LLMs. Instead, the fight is now over access to training data for future models—and, in many cases, who gets compensated.
Earlier this year, Reddit, which is also a large and unwitting contributor of training data to AI models, shut down free access to its API for third-party apps in an attempt to charge AI companies for future access. This move prompted protests across the site. Elon Musk similarly cut OpenAI’s access to Twitter (sorry, X) over concerns that they weren’t paying enough to use its data. In both cases, the issue was the idea that AI makers could turn a profit from the social networks’ content (despite it actually being user-generated content).
Given all this, it’s noteworthy that last week OpenAI quietly released details on how to block its web scraping GPTBot by adding a line of code to the robots.txt file—the set of instructions most websites have for search engines and other web crawlers. While the Times has blocked the Common Crawl web scraping bot, it hasn’t yet blocked GPTBot in its robots.txt file. Whatever way you look at things, the world is still reeling from the sudden explosion of powerful AI models over the past 18 months. There is a lot of legal wrangling yet to happen over how data is used to train them going forward—and until laws and policies are put in place, things are going to be very uncertain.