2023 marked the rise of generative AI and 2024 could well be the year its makers reckon with the technology’s fallout of the industry-wide arms race. Currently, OpenAI is aggressively pushing back against recent lawsuits’ claims that its products including ChatGPT are illegally trained on copyrighted texts. What’s more, the company is making some bold legal claims as to why their programs should have access to other people’s work.
In a blog post published on January 8, OpenAI accused The New York Times of “not telling the full story” in the media company’s major copyright lawsuit filed late last month. Instead, OpenAI argues its scraping of online works falls within the purview of “fair use.” The company additionally claims that it currently collaborates with various news organizations (excluding, among others, The Times) on dataset partnerships, and dismisses any “regurgitation” of outside copyrighted material as a “rare bug” they are working to eliminate. This is attributed to “memorization” issues that can be more common when content appears multiple times within training data, such as if it can be found on “lots of different public websites.”
“The principle that training AI models is permitted as a fair use is supported by a wide range of [people and organizations],” OpenAI representatives wrote in Monday’s post, linking out to recently submitted comments from several academics, startups, and content creators to the US Copyright Office.
In a letter of support filed by Duolingo, for example, the language learning software company wrote that it believes that “Output generated by an AI trained on copyrighted materials should not automatically be considered infringing—just as a work by a human author would not be considered infringing merely because the human author had learned how to write through reading copyrighted works.” (On Monday, Duolingo confirmed to Bloomberg it has laid off approximately 10 percent of its contractors, citing its increased reliance on AI.)
On December 27, The New York Times sued both OpenAI and Microsoft—which currently utilizes the former’s GPT in products like Bing—for copyright infringement. Court documents filed by The Times claim OpenAI trained its generative technology on millions of the publication’s articles without permission or compensation. Products like ChatGPT are now allegedly used in lieu of their source material at a detriment to the media company. More readers opting for AI news summaries presumably means less readers subscribing to source outlets, argues The Times.
Meanwhile, OpenAI is lobbying government regulators over their access to copyrighted material. According to The Telegraph on January 7, a recent letter submitted by OpenAI to the UK’s House of Lords communications and digital argues access to copyrighted materials is vital to the company’s success and product relevancy.
“Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI wrote in the letter, while also contending that limiting training data to public domain work, “might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.” The letter states that it is part of OpenAI’s “mission to ensure that artificial general intelligence benefits all of humanity.”
Meanwhile, some critics have swiftly mocked OpenAI’s claim that its program’s existence requires the use of others’ copyrighted work. On the social media platform Bluesky, historian and author Kevin M. Kruse likened OpenAI’s strategy to selling illegally obtained items in a pawn shop.
“Rough Translation: We won’t get fabulously right if you don’t let us steal, so please don’t make stealing a crime!” AI expert Gary Marcus also posted to X on Monday.