Polite warnings are surprisingly good at reducing hate speech on social media

NYU researchers tested a set of warnings that involved tagging the accounts using hateful language in their tweets.
Do warnings stop hate speech on Twitter? Kai Pilger / Unsplash

Hate speech is a sprawling, toxic problem that plagues social media. The Council on Foreign Relations says that it even percolates into real violence against minorities, a serious issue that governments around the world are racking their brains to solve. Many tech companies have been thinking up new ways to stem its spread—but it’s a difficult and complicated task

Researchers from NYU’s Center for Social Media and Politics had an idea: What if you tracked the followers of accounts that were banned for hate speech, and sent these followers (who also used hate speech in their tweets) a warning about their bad behaviors? Would these users be pressured into changing what they posted? It turns out, the answer is yes—at least for a brief time after receiving the warning. The researchers’ findings were published Monday in the journal Perspectives on Politics.

“One of the tradeoffs that we always face in these public policy conversations about whether to suspend accounts is what’s going to happen to these people on other platforms,” says Joshua Tucker, co-director of NYU’s Center for Social Media and Politics and an author on the paper. “There’s been more recent research showing that after a bunch of right-wing white nationalists in Britain were suspended, there was a big uptick in the amount of activity among these groups on Telegram.” 

[Related: Twitter’s efforts to tackle misleading tweets just made them thrive elsewhere]

They wanted to come up with a solution that would hit the “sweet spot” where the accounts wouldn’t necessarily be banned, but would receive some kind of push to stop them from using hate speech, says Mikdat Yildirim, a PhD student at NYU and the first author on this study. This way, the intervention would “not limit their rights to express themselves, and also prevent them from migrating to more radical platforms.” In other words, it was a warning, not a silencing.  

Collecting “suspension candidates” 

The plan? Create a set of six Twitter accounts operated like virtual, vigilante patrollers, finding, announcing, and tagging the offenders on their public feed. The warnings that these accounts posted had a similar structure. Each tagged the full username of an account that used hate speech, warned them that a given account they followed was suspended recently for using similar language, and that they could be next if they kept tweeting like they did. Each account worded their warnings slightly differently. 

But first, the researchers had to identify the potential offenders who were likely to get suspended. The team downloaded more than 600,000 tweets on July 21, 2020 that had been posted in the past week and narrowed them down to tweets that contained at least one word from hateful language dictionaries used in previous research (these were centered around racial or sexual hate). They followed around 55 accounts and were able to gather the follower list of 27 of them before those accounts got suspended. 

“We didn’t send these messages to all of their followers; we sent these messages to only those followers who employed hate speech in more than 3 percent of their tweets,” Yildirim explains. That resulted in a total of around 4,400 users who became part of the study. Out of these, 700 were in a control group that received no warnings at all, and 3,700 users were warned by one of the six researcher-run Twitter accounts. 

NYU / Perspectives on Politics

“We received more than 200 reactions to our tweets. Some of them were kinda angry with our warning, thinking that they have the right to express themselves however they wanted,” says Yildirim. “Some of them were careful in the sense that they wanted to know which tweets caused us to send them these warnings.”

[Related: Social media really is making us more morally outraged]

Within each of the users they followed, the researchers measured the ratio of their hateful to non-hateful tweets one month before sending the warning, one week before, one week after the warning tweet, and one month after. 

“What we found out is that our treatment group, on average, tweeted around 10 percent less tweets that contained hateful language than our control group,” Yildirim says. Moreover, they found that the warnings that were worded as politely as possible had the most effect in terms of reducing the hateful language. “But of course, one month after the treatment, the effects dissipated.” 

It would be great if it lasted longer, Tucker notes. But seeing an effect after one tweet from an account that they didn’t even follow was kind of “incredible.”

“It might be the case that all that really matters is people realizing that somebody’s watching them, or somebody says something to them,” he says. “Or, the powerful treatment is just really learning that an account you chose to follow was suspended for something you’re doing and that’s kind of a wakeup call for people.”

Twitter itself has been actively trying to combat hate speech on its platform. Earlier this year, the company debuted a feature that warns users before they tweet something potentially harmful or offensive. “But we don’t know the effectiveness of that against our experiment,” says Tucker. “Maybe they tried our experiment and never reported it to anybody.”

The fact is, it’s hard to gauge the benefits of certain features without extensive research. They might have observed dramatically different effects if it had been Twitter running these accounts, says Tucker. That’s why it’s “critical for social media companies to be more transparent about the research they’re running internally,” he says, “but also share more data with independent researchers and participate in projects with independent researchers where the research design is collaborative.”