Text-to-image generators powered by artificial intelligence, like DALL-E 2 and Stable Diffusion, have had a huge year. It’s almost impossible to scroll through Twitter without seeing some images generated from an (often ridiculous) written prompt. Researchers, though, are already looking at the next generation of generators: text-to-video.
In a paper published this week, researchers at Meta AI revealed a text-to-video generator they call Make-A-Video. It takes a written prompt like “a teddy bear painting a portrait” or “a dog wearing a superhero outfit with red cape flying through the sky” and returns a short video clip depicting the machine learning model’s best attempt at recreating it. The videos are clearly artificial, but very impressive all the same.
As well as written prompts, Make-A-Video can make videos based on other videos or images. It can add motion to a static image and create a video that links two images.
At the moment, Make-A-Video’s silent clips are composed of 16 frames output at 64 x 64 pixels that are then upscaled using another AI model to 768 x 768 pixels. They’re only five seconds long and just depict a single action or scene. While we’re a long way from an AI creating a feature film from scratch (though AI has previously written screenplays and even directed movies), the researchers at Meta intend to work on overcoming some of these technical limits with future research.
Like the best text-to-image generators, Make-A-Video works using a technique called “diffusion”. It starts with randomly generated noise and then progressively adjusts it to get closer to the target prompt. The accuracy of the results largely depends on the quality of the training data.
According to the blog post announcing it, Make-A-Video’s AI learned “what the world looks like from paired text-image data and how the world moves from video footage with no associated text.” It was trained with more than 2.3 billion text-image pairs from the LAOIN-5B database and millions of videos from the WebVid-10M and HD-VILA-100M databases.
Meta claims that static images with paired text are sufficient for training text-to-video models as motion, actions, and events can be inferred from the images—like a woman drinking a cup of coffee or an elephant kicking a football. Similarly, even without any text describing them, “unsupervised videos are sufficient to learn how different entities in the world move and interact.” The results from Make-A-Video suggest they are right.
The researchers said they have done what they can to control the quality of the training data, filtering LAOIN-5B’s dataset of all text-image pairs that contained NSFW content or toxic words, they acknowledge that like “all large-scale models trained on data from the web, [their] models have learnt and likely exaggerated social biases, including harmful ones.” Preventing AIs from creating racist, sexist, and otherwise offensive, inaccurate, or dangerous content is one of the biggest challenges in the field.
For now, Make-A-Video is only available to researchers at Meta (although you can register your interest in getting access here). Although the videos the team has shown off are impressive, we have to accept they were probably selected to show the algorithm in the best possible light. Still, it’s hard not to recognize how far AI image generation has come. Just a few years ago, DALL-E’s results were only mildly interesting—now they’re photorealistic.
Text-to-video is definitely more challenging for AI to get accurate. As Mark Zuckerberg said in a Facebook post, “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.” The videos have an abstract, unnatural, janky quality to them—depicting not-so-natural motion.
Despite the low-quality, Zuckerberg called this tool “pretty amazing progress.”