How Meta's text-to-video generator works

Text-to-image generators powered by artificial intelligence, like DALL-E 2 and Stable Diffusion, have had a huge year. It’s almost impossible to scroll through Twitter without seeing some images generated from an (often ridiculous) written prompt. Researchers, though, are already looking at the next generation of generators: text-to-video.

In a paper published this week, researchers at Meta AI revealed a text-to-video generator they call Make-A-Video. It takes a written prompt like “a teddy bear painting a portrait” or “a dog wearing a superhero outfit with red cape flying through the sky” and returns a short video clip depicting the machine learning model’s best attempt at recreating it. The videos are clearly artificial, but very impressive all the same.

As well as written prompts, Make-A-Video can make videos based on other videos or images. It can add motion to a static image and create a video that links two images.

At the moment, Make-A-Video’s silent clips are composed of 16 frames output at 64 x 64 pixels that are then upscaled using another AI model to 768 x 768 pixels. They’re only five seconds long and just depict a single action or scene. While we’re a long way from an AI creating a feature film from scratch (though AI has previously written screenplays and even directed movies), the researchers at Meta intend to work on overcoming some of these technical limits with future research.

Like the best text-to-image generators, Make-A-Video works using a technique called “diffusion”. It starts with randomly generated noise and then progressively adjusts it to get closer to the target prompt. The accuracy of the results largely depends on the quality of the training data.

According to the blog post announcing it, Make-A-Video’s AI learned “what the world looks like from paired text-image data and how the world moves from video footage with no associated text.” It was trained with more than 2.3 billion text-image pairs from the LAOIN-5B database and millions of videos from the WebVid-10M and HD-VILA-100M databases.

meta's AI text to video generator — *Meta AI*

Meta claims that static images with paired text are sufficient for training text-to-video models as motion, actions, and events can be inferred from the images—like a woman drinking a cup of coffee or an elephant kicking a football. Similarly, even without any text describing them, “unsupervised videos are sufficient to learn how different entities in the world move and interact.” The results from Make-A-Video suggest they are right.

The researchers said they have done what they can to control the quality of the training data, filtering LAOIN-5B’s dataset of all text-image pairs that contained NSFW content or toxic words, they acknowledge that like “all large-scale models trained on data from the web, [their] models have learnt and likely exaggerated social biases, including harmful ones.” Preventing AIs from creating racist, sexist, and otherwise offensive, inaccurate, or dangerous content is one of the biggest challenges in the field.

For now, Make-A-Video is only available to researchers at Meta (although you can register your interest in getting access here). Although the videos the team has shown off are impressive, we have to accept they were probably selected to show the algorithm in the best possible light. Still, it’s hard not to recognize how far AI image generation has come. Just a few years ago, DALL-E’s results were only mildly interesting—now they’re photorealistic.

Text-to-video is definitely more challenging for AI to get accurate. As Mark Zuckerberg said in a Facebook post, “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.” The videos have an abstract, unnatural, janky quality to them—depicting not-so-natural motion.

Despite the low-quality, Zuckerberg called this tool “pretty amazing progress.”

Win the Holidays with PopSci's Gift Guides

As ocean waters warm, a race to breed heat-resistant coral As ocean waters warm, a race to breed heat-resistant coral

NASA is finishing its first off-world accident report NASA is finishing its first off-world accident report

The latest Google Photos redesign comes with handy new ways to navigate your endless photo collection The latest Google Photos redesign comes with handy new ways to navigate your endless photo collection

Can AI escape our control and destroy us? Can AI escape our control and destroy us?

This AI can see people through walls. Here’s how. This AI can see people through walls. Here’s how.

Let this AI bot turn your words into vaguely-recognizable pictures Let this AI bot turn your words into vaguely-recognizable pictures

Facebook is working on AI tools to fix photos ruined by blinking Facebook is working on AI tools to fix photos ruined by blinking

AI could make MRI scans as much as 10 times faster AI could make MRI scans as much as 10 times faster

STANK LOVE, BEAR WIG, and other sayings from AI-generated candy hearts STANK LOVE, BEAR WIG, and other sayings from AI-generated candy hearts

Get your phone’s AI assistant to actually assist you Get your phone’s AI assistant to actually assist you

Microsoft’s new video authenticator could help weed out dangerous deepfakes Microsoft’s new video authenticator could help weed out dangerous deepfakes

All you need to know to add closed captions to video calls, Netflix, and more All you need to know to add closed captions to video calls, Netflix, and more

The 10 most exceptional personal care products from 2020 The 10 most exceptional personal care products from 2020

9 Gmail features to get you out of your inbox and back to work 9 Gmail features to get you out of your inbox and back to work

Google’s wants AI to choose your best photos and mail you prints every month Google’s wants AI to choose your best photos and mail you prints every month

YouTube science videos are riddled with scams, plagiarism, and misinformation YouTube science videos are riddled with scams, plagiarism, and misinformation

Ring’s new battery-powered video doorbell captures footage before motion begins Ring’s new battery-powered video doorbell captures footage before motion begins

A new take on creating an invisibility shield borrows from classical physics A new take on creating an invisibility shield borrows from classical physics

Share

Win the Holidays with PopSci's Gift Guides