Why you can't use Google's Imogen

This article was originally featured on Popular Photography.

Acute corgi lives in a house made of sushi. A dragon fruit wearing a karate belt in the snow. A brain riding a rocket ship heading towards the moon. These are just a few of the AI-generated images produced by Google’s Imagen text-to-image diffusion model, and the results are incredibly accurate—sometimes humorously so. Researchers from Google recently unveiled these results in a paper published last month—and discussed the moral repercussions that come with using this latest technology.

Google’s Imagen beats the competition

In their research paper, Google computer scientists confirmed that existing pre-trained large language models perform fairly well in creating images from text input. With Imagen, they simply increased the language model size and found that it led to more accurate results.

Imagen’s FID score ranked well above other text-to-image synthesizers. *Google Research, Brain Team*

To measure results, Imagen employed the Common Objects in Context (COCO) dataset, which is an open-source compendium of visual datasets on which companies and researchers can train their AI algorithms in image recognition. The models receive a Frechet Inception Distance (FID) score, which calculates their accuracy in rendering an image based on prompts from the dataset. A lower score indicates that there are more similarities between the real and generated images, with a perfect score being 0.0. Google’s Imagen diffusion model can create 1024-by-1024- pixel sample images with an FID score of 7.27.

According to the research paper, Imagen tops the charts with its FID score when compared to other models including DALL-E 2, VQ-GAN+CLIP, and Latent Diffusion Models. Findings indicated that Imagen was also preferred by human raters.

A dragon fruit wearing a karate belt is just one of the many images Imagen is capable of creating. *Google Research, Brain Team*

“For photorealism, Imagen achieves 39.2% preference rate indicating high image quality generation,” Google computer scientists report. “On the set with no people, there is a boost in the preference rate of Imagen to 43.6%, indicating Imagen’s limited ability to generate photorealistic people. On caption similarity, Imagen’s score is on-par with the original reference images, suggesting Imagen’s ability to generate images that align well with COCO captions.”

In addition to the COCO dataset, the Google team also created their own, which they called DrawBench. The benchmark consists of rigorous scenarios that tested different models’ ability to synthesize images based on “compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts,” going beyond the more limited COCO prompts.

Though fun, the technology presents moral and ethical dilemmas. *Google Research, Brain Team*

Moral implications of Imagen and other AI text-to-image software

There’s a reason why all the sample images have no people. In their conclusion, the Imagen team discusses the potential moral repercussions and societal impact of the technology, which is not always for the best. Already, the program exhibits a Western bias and viewpoint. While acknowledging that there is a potential for endless creativity, there are, unfortunately, also those who would may attempt to use the software for harm. It is for this reason, among others, that Imagen is not available for public use—but that could change.

“On the other hand, generative methods can be leveraged for malicious purposes, including harassment and misinformation spread, and raise many concerns regarding social and cultural exclusion and bias,” the researchers write. “These considerations inform our decision to not to release code or a public demo. In future work, we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.”

The researchers acknowledge that more work is required before Imagen can be responsibly released to the public. *Google Research, Brain Team*

Additionally, the researchers noted that due to the available datasets on which Imagen is trained, the program exhibits bias. “Dataset audits have revealed these datasets tend to reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups.”

While the technology is certainly fun (who wouldn’t want to whip up an image of an alien octopus floating through a portal while reading a newspaper?), it’s clear that it requires more work and research before Imagen (and other programs) can be responsibly released to the public. Some, like Dall-E 2, have deployed safeguards, but the efficacy remains to be seen. Imagen acknowledges the gargantuan, though necessary task of thoroughly mitigating negative consequences.

“While we do not directly address these challenges in this work, an awareness of the limitations of our training data guides our decision not to release Imagen for public use,” they finish. “We strongly caution against the use of text-to-image generation methods for any user-facing tools without close care and attention to the contents of the training dataset.”