Meta releases new anti-bias AI training dataset

With the likes of OpenAI’s ChatGPT and Google’s Bard, tech industry leaders are continuing to push their (sometimes controversial) artificial intelligence systems alongside AI-integrated products to consumers. Still, many privacy advocates and tech experts remain concerned about the massive datasets used to train such programs, especially when it comes to issues like data consent and compensation from users, informational accuracy, as well as algorithmically enforced racial and socio-political biases.

Meta hoped to help mitigate some of these concerns via Thursday’s release of Casual Conversations v2, an update to its 2021 AI audio-visual training dataset. Guided by a publicly available November literature review, the data offers more nuanced analysis of human subjects across diverse geographic, cultural, racial, and physical demographics, according to the company’s statement.

Meta states v2 is “a more inclusive dataset to measure fairness,” and is derived from 26,467 video monologues recorded in seven countries, offered by 5,567 paid participants from Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the United States who also provided self-identifiable attributes including age, gender, and physical appearance. Although Casual Conversations’ initial release included over 45,000 videos, they were drawn from just over 3,000 individuals residing in the US and self-identifying via fewer metrics.

Tackling algorithmic biases in AI is a vital hurdle in an industry long plagued by AI products offering racist, sexist, and otherwise inaccurate responses. Much of this comes down to how algorithms are created, cultivated, and provided to developers.

But while Meta touts Casual Conversations v2 as a major step forward, experts remain cautiously optimistic, and urge continued scrutiny for Silicon Valley’s seemingly headlong rush into an AI-powered ecosystem.

“This is [a] space where almost anything is an improvement,” Kristian Hammond, a professor of computer science at Northwestern University and director of the school’s Center for Advancing the Safety of Machine Intelligence, writes in an email to PopSci. Hammond believes Meta’s updated dataset is “a solid step” for the company—especially considering past privacy controversies—and feels its emphasis on user consent and research participants’ labor compensation is particularly important.

“But an improvement is not a full solution. Just a step,” he cautions.

To Hammond, a major question remains regarding exactly how researchers enlisted participants in making Casual Conversations v2. “Having gender and ethnic diversity is great, but you also have to consider the impact of income and social status and more fine-grained aspects of ethnicity,” he writes, adding, “There is bias that can flow from any self-selecting population.”

[Related: The FTC has its eyes on AI scammers.]

When asked about how participants were selected, Nisha Deo of Meta’s AI Communications team told PopSci via email, “I can share that we hired external vendors with our requirements to recruit participants,” and that compensatory rates were determined by these vendors “having the market value in mind for data collection in that location.”

When asked to provide concrete figures regarding pay rates, Meta stated it was “[n]ot possible to expand more than what we’ve already shared.”

Deo, however, additionally stated Meta deliberately incorporated “responsible mechanisms” across every step of data cultivation, including a comprehensive literature review in collaboration with academic partners at Hong Kong University of Science and Technology on existing dataset methodologies, as well as comprehensive guidelines for annotators. “Responsible AI built this with ethical considerations and civil rights in mind and are open sourcing it as a resource to increase inclusivity efforts in AI,” she continued.

For industry observers like Hammond, improvements such as Casual Conversations v2 are welcome, but far more work is needed, especially when the world’s biggest tech companies appear to be entering an AI arms race. “Everyone should understand that this is not the solution altogether. Only a set of first steps,” he writes. “And we have to make sure that we don’t get so focused on this very visible step… that we stop poking at organizations to make sure that they aren’t still gathering data without consent.”