How Artificial Intelligence Will Translate Facebook Photos For The Blind

The machine helps the users, and the users help the machine
Facebook's new automatic alternative text uses artificial intelligence to recognize objects and people in photos.
Facebook's new automatic alternative text uses artificial intelligence to recognize objects and people in photos. Facebook

Share

While it’s easy to dwell on the potential threats of artificial intelligence, much more often the field promises to make humans’ lives better. A.I. algorithms are meant to help us connect with our friends, find information, and even transport us through the physical world.

Starting today, Facebook is using artificial intelligence to automatically generate text captions for every photo on Facebook, to provide much-needed accessibility for the blind or visually impaired.

Because the developers wanted the text generated by the A.I. to be extremely accurate, they trained it intensively on images of just 100 different types of object, so at present it is limited to identifying human, pizza, baseball and the like, but as research progresses the captions will get increasingly versatile and complex.

To surf the internet, the visually impaired often rely on screen readers, which dictate words on the screen. However, screen readers are only as good as the content they can read. If text is missing, they can’t read it. Web standards dictate that images should have a field called alt text, describing in words what the image depicts. However, on most Facebook images, the only text available for screen readers is the status posted along with the photos.

By applying artificial intelligence algorithms, Facebook is able to scan each image and pull out some information about what it depicts. If someone posts a picture of a pizza, the algorithm will be able to automatically put the word “pizza” into the alt text of the image, so the screen reader can tell it to the user. The captions won’t be seen by most of the social network’s 1.5 billion users, but it marks a shift for those who can’t see photos on an increasingly visual platform.

Facebook is using this opportunity to democratize the way it does research. The company’s Accessibility and A.I. teams will get feedback from users and use it to direct further research. In March, Facebook published a study in tandem with Cornell University exploring how blind people used Facebook, in hopes to make a product geared towards what the community needs.

“It should be what people want that drives the research, rather than what we have in the research that drives the usage,” said Paluri. “Feedback allows us to investigate more.”

The challenge of recognizing and describing images is a prominent category of research in the field of artificial intelligence. New techniques and hardware are enabling deep learning, using layers of artificial neural networks, or tiny clusters of mathematical equations that mimic the brain’s neurons, to sort through data and look for patterns. These techniques can be applied to images, audio, text, or nearly any kind of data. In images, the pattern within a photograph of a a cat is different from the pattern for a dolphin.

But individual objects are simple. When objects interact with each other, or when there’s context around an action, that’s much more difficult, because the machine needs to actually understand something about the physical world, and know relationships between objects. To a naive machine, there’s no gravity or family relationships or love. There’s only data.

So to understand that a father and daughter are walking on a hiking trail, or that a cat is on a bed, a machine must first learn about the physical world.

And that’s just what Facebook’s Accessibility team needs, too. Right now, they have these recognized objects, called tags. A tag is a cat, a tag is a bed, a tag is a person. With that information, they can say there are four people with ice cream cones in a photo, or a pizza pie.

“Our goal is getting to a point where it’s describing much much more than tags. How do tags interact? What are the relationships between tags?” says Paluri. “Not just saying ‘cat’ and ‘bed.’ You want to say ‘cat on the bed,’ or ‘cat jumping over the bed.’ So this is a starting point.”

This is a starting point in many ways. Not only does the team have dreams of more context-based object recognition, but also making the recognition more interactive. Paluri suggests a potential feature where users could tap on different parts of the image to hear specific information.

But at the scale Facebook works at, precision needs to be a top priority. Two billion images are shared over Facebook, Instagram, Messenger and WhatsApp every day, so even one percent error can mean millions of mistakes. Engineers hand-tuned each of the roughly 100 concepts that the algorithm can detect, based on the importance of correctly classifying the object. For instance, the algorithm needs to be much more certain about something like gender than about whether an object is pizza. It can recognize objects from its library of 100 with from 80 percent confidence to 99 percent confidence. Facebook says that it can recognize at least one of the objects in more than 50 percent of the photos on Facebook.

Most of the concepts that the machine can understand are about people and physical objects. It knows eyeglasses, baseballs, and even selfies. However, there are some that the team purposefully didn’t put in, according to Paluri. Among them, certain animals.

Mistakes made by A.I. systems, especially when classifying images, can be culturally sensitive, like when Google’s Photos app labeled black people as gorillas last year. To avoid that sort of situation, “we want to start where we’re super confident and there’s a lot of positive feedback,” Paluri says.

Confidence can also be more innocuous. Paluri mentions cat paws.

“There might be a cat paw in the corner. Is there still a cat in the picture? This is an open question,” he said. “And maybe the image is about the paw, and that’s what makes it funny.”

There are many directions that the research can take, including trying to detect humor. But regardless, any improvement will rest on better algorithms, informed by real people’s needs. The promise of artificial intelligence is to make life easier for humans. We’re outsourcing the parts of our brains that machines can replace. By using software to augment ourselves, the world becomes a more accessible place.

The feature is available now on Facebook’s iOS app, and will be rolling out to other platforms soon, as well as to languages other than English.