NYT stops generative AI from scraping its content

The magic of generative artificial intelligence projects like ChatGPT and Bard relies on data scraped from the open internet. But now, the sources of training data for these models are starting to close up. The New York Times has banned any of the content on its website from being used to develop AI models like OpenAI’s GPT-4, Google’s PaLM 2, and Meta’s Llama 2, according to a report last week by Adweek.

Earlier this month the Times updated its terms of service to explicitly exclude its content from being scraped to train “a machine learning or artificial intelligence (AI) system.” While this won’t affect the current generation of large language models (LLMs), if tech companies respect the prohibition, it will prevent content from the Times being used to develop future models.

The Times’ updated terms of service ban using any of its content—including text, images, audio and video clips, “look and feel,” and metadata—to develop any kind of software including AI, plus, they also explicitly prohibit using “robots, spiders, scripts, service, software or any manual or automatic device, tool, or process” to scrape its content without prior written consent. It’s pretty broad language and apparently breaking these terms of service “may result in civil, criminal, and/or administrative penalties, fines, or sanctions against the user and those assisting the user.”

Given that content from the Times has been used as a major source of training data for the current generation of LLMs, it makes sense that the paper is trying to control how its data is used going forward. According to a Washington Post investigation earlier this year, the Times was the fourth largest source of content for one of the major databases used to train LLMs. The Post analyzed Google’s C4 dataset, a modified version of Common Crawl, that includes content scraped from more than 15 million websites. Only Google Patents, Wikipedia, and Scribd (an ebook library) contributed more content to the database.

Despite its prevalence in training data, this week, Semafor reported that the Times had “decided not to join” a group of media companies including the Wall Street Journal in an attempt to jointly negotiate an AI policy with tech companies. Seemingly, the paper intends to make its own arrangements like the Associated Press (AP), which struck a two-year deal with OpenAI last month that would allow the ChatGPT maker to use some of the AP’s archives from as far back as 1985 to train future AI models.

Although there are multiple lawsuits pending against AI makers like OpenAI and Google over their use of copyrighted materials to train their current LLMs, the genie is really out of the bottle. The training data has now been used and, since the models themselves consist of layers of complex algorithms, can’t easily be removed or discounted from ChatGPT, Bard, and the other available LLMs. Instead, the fight is now over access to training data for future models—and, in many cases, who gets compensated.

Earlier this year, Reddit, which is also a large and unwitting contributor of training data to AI models, shut down free access to its API for third-party apps in an attempt to charge AI companies for future access. This move prompted protests across the site. Elon Musk similarly cut OpenAI’s access to Twitter (sorry, X) over concerns that they weren’t paying enough to use its data. In both cases, the issue was the idea that AI makers could turn a profit from the social networks’ content (despite it actually being user-generated content).

Given all this, it’s noteworthy that last week OpenAI quietly released details on how to block its web scraping GPTBot by adding a line of code to the robots.txt file—the set of instructions most websites have for search engines and other web crawlers. While the Times has blocked the Common Crawl web scraping bot, it hasn’t yet blocked GPTBot in its robots.txt file. Whatever way you look at things, the world is still reeling from the sudden explosion of powerful AI models over the past 18 months. There is a lot of legal wrangling yet to happen over how data is used to train them going forward—and until laws and policies are put in place, things are going to be very uncertain.

Win the Holidays with PopSci's Gift Guides

Here’s how your Paul McCartney wannabe can learn how to play the guitar Here’s how your Paul McCartney wannabe can learn how to play the guitar

This discounted e-scooter is perfect for anyone who loves shortcuts (and hates parking) This discounted e-scooter is perfect for anyone who loves shortcuts (and hates parking)

Just because an AI can hold a conversation does not make it smart Just because an AI can hold a conversation does not make it smart

No, the AI chatbots (still) aren’t sentient No, the AI chatbots (still) aren’t sentient

Meta attempts a new, more ‘inclusive’ AI training dataset Meta attempts a new, more ‘inclusive’ AI training dataset

Meet Spotify’s new AI DJ Meet Spotify’s new AI DJ

The good and the bad of Lensa’s AI portraits The good and the bad of Lensa’s AI portraits

Popular youth mental health service faces backlash after experimenting with AI-chatbot advice Popular youth mental health service faces backlash after experimenting with AI-chatbot advice

Shutterstock and OpenAI have come up with one possible solution to the ownership problem in AI art Shutterstock and OpenAI have come up with one possible solution to the ownership problem in AI art

DeviantArt’s AI image generator aims to give more power to artists DeviantArt’s AI image generator aims to give more power to artists

Microsoft’s take on AI-powered search struggles with accuracy Microsoft’s take on AI-powered search struggles with accuracy

Adobe built its Firefly AI art generator to avoid bias and copyright issues Adobe built its Firefly AI art generator to avoid bias and copyright issues

Meta’s new AI can use deceit to conquer a board game world Meta’s new AI can use deceit to conquer a board game world

This AI is no doctor, but its medical diagnoses are pretty spot on This AI is no doctor, but its medical diagnoses are pretty spot on

Meta is pivoting to video with its new AI generator Meta is pivoting to video with its new AI generator

Google’s new robot butler was trained on social media and Wikipedia articles Google’s new robot butler was trained on social media and Wikipedia articles

Google’s AI has a long way to go before writing the next great novel Google’s AI has a long way to go before writing the next great novel

Adobe’s new AI can turn a 2D photo into a 3D scene Adobe’s new AI can turn a 2D photo into a 3D scene

Share

Win the Holidays with PopSci's Gift Guides