OpenAI wants to devour a huge chunk of the internet. Who’s going to stop them?

The AI giant plans to buy WordPress and Tumblr data to train ChatGPT. What could go wrong?
Vacuum moving towards two blocks with Wordpress and Tumblr logos
WordPress supports around 43 percent of the internet you're most likely to see. DepositPhotos, Deposit Photos

Share

You probably don’t know about Automattic, but they know you.

As the parent company of WordPress, its content management systems host around 43 percent of the internet’s 10 million most popular websites. Meanwhile, it also owns a vast suite of mega-platforms including Tumblr, where a massive amount of embarrassing personal posts live. All this is to say that, through all those countless Terms & Conditions and third-party consent forms, Automattic potentially has access to a huge chunk of the internet’s content and data.

[Related: OpenAI’s Sora pushes us one mammoth step closer towards the AI abyss.]

According to 404 Media earlier this week, Automattic is finalizing deals with OpenAI and Midjourney to provide a ton of that information for their ongoing artificial intelligence training pursuits. Most people see the results in chatbots, since tech companies need the text within millions of websites to train large language model conversational abilities. But this can also take the form of training facial recognition algorithms using your selfies, or improving image and video generation capabilities by analyzing original artwork you uploaded online. It’s hard to know exactly what and how much data is used, however, since companies like Midjourney and OpenAI maintain black box tech products—such is the case in this imminent business deal.

So, what if you wanna opt-out of ChatGPT devouring your confessional microblog entries or daily workflows? Good luck with that.

When asked to comment, a spokesperson for Automattic directed PopSci to its “Protecting User Choice” page, published Tuesday afternoon after 404 Media’s report. The page attempts to offer you a number of assurances. There’s now a privacy setting to “discourage” search engine indexing sites on WordPress.com and Tumblr, and Automattic promises to “share only public content” hosted on those platforms. Additional opt-out settings will also “discourage” AI companies from trawling data, and Automattic plans to regularly update its partners on which users “newly opt out,” so that their content can be removed from future training and past source sets.

There is, however, one little caveat to all this:

“Currently, no law exists that requires crawlers to follow these preferences,” says Automattic.

“From what I have seen, I’m not exactly sure what could be shared with AI,” says Erin Coyle, an associate professor of media and communication at Temple University. “We do have a confusing landscape right now, in terms of what data privacy rights people have.”

To Coyle, nebulous access to copious amounts of online user information “absolutely speaks” to an absence of cohesive privacy legislation in the US. One of the biggest challenges impeding progress is the fact that laws, by and large, are reactive instead of preventative regulation.

“There is no data privacy in general.”

“It’s really hard for legislators to get ahead of the developments, especially in technology,” she adds. “While there are arguments to be made for them to be really careful and cautious… it’s also very challenging in times like this, when the technology is developing so rapidly.”

As companies like OpenAI, Google, and Meta continue their AI arms race, it’s the everyday people providing the bulk of the internet’s content—both public and private—who are caught in the middle. Clicking “Yes” to the manifesto-length terms and conditions prefacing almost every app, site, or social media platform is often the only way to access those services.

“Everything is about terms of service, no matter what website we’re talking about,” says Christopher Terry, a University of Minnesota journalism professor focused on regulatory and legal analysis of media ownership, internet policy, and political advertising.

Speaking to PopSci, Terry explains that basically every single terms of service agreement you have signed online is a legal contractual obligation with whoever is running a website. Delve deep enough into the legalese, and “you’re gonna see you agreed to give them, and allow them to use, the data that you generate… you allowed them to monetize that.”

Of course, when was the last time you actually read any of those annoying pop-ups?

“There is no data privacy in general,” Terry says. “With the digital lives that we have been living for decades, people have been sharing so much information… without really knowing what happens to that information,” Coyle continues. “A lot of us signed those agreements without any idea of where AI would be today.”

And all it takes to sign away your data for potential AI training is a simple Terms of Service update notification—another pop-up that, most likely, you didn’t read before clicking “Agree.”

You either opt out, or you’re in

Should Automattic complete its deal with OpenAI, Midjourney, or any other AI company, some of those very same update alerts will likely pop-up across millions of email inboxes and websites—and most people will reflexively shoo them away. But according to some researchers, even offering voluntary opt-outs in such situations isn’t enough.

“It is highly probable that the majority of users will have no idea that this is an option and/or that the partnership with OpenAI/Midjourney is happening,” Alexis Shore, a Boston University researcher focused on technology policy and communication studies, writes to PopSci. “In that sense, giving users this opt-out option, when the default settings allow for AI crawling, is rather pointless.”

“They’re going all in on it right now while they still can.”

Experts like Shore and Coyle think one potential solution is a reversal in approach—changing voluntary opt-outs to opt-ins, as is increasingly the case for internet users in the EU thanks to its General Data Protection Regulation (GDPR). Unfortunately, US lawmakers have yet to make much progress on anything approaching that level of oversight.

The next option, should you have enough evidence to make your case, is legal action. And while copyright infringement lawsuits continue to mount against companies like OpenAI, it will be years before their legal precedents are established. By then, it’s anyone’s guess what the AI industry will have done to the digital landscape, and your privacy. Terry compares the moment to a 19th-century gold rush.

“They’re going all in on it right now while they still can,” he says. “You’re going out there to stake out your claim right now, and you’re pouring everything you can into that machine so that later, when that’s a [legal] problem, it’s already done.”

 Neither OpenAI nor Midjourney responded to multiple requests for comment at the time of writing.

 

Win the Holidays with PopSci's Gift Guides

Shopping for, well, anyone? The PopSci team’s holiday gift recommendations mean you’ll never need to buy another last-minute gift card.