I’m picturing a bike next to a fence. It’s in a European city somewhere, with narrow cobblestone streets, and the fence is in front of an old-looking brick building. The bike is shiny and blue, with a basket, sort of old fashioned. You can’t see the sky, but you can tell it’s a somewhat sunny day.
There’s no way I could possibly find a picture of a scene like this one on the Internet. Sure, I can type in keywords like “blue bike next to fence in Europe” and it will show me some results that are tangentially related if I’m lucky. My chances are slightly better if I happen to have such an image already at my disposal—that way, I can do a reverse image search and can crawl across sites not limited to English. But oftentimes the results will seem weird, with the wrong feeling or missing key components of the scene in my head.
Computers still can’t read our minds. But stock image website Shutterstock has created a whole new way to categorize images. The company’s new tool, which launches today on their web site, is one of many innovations in a recent but rapidly growing field called computer vision. And Shutterstock is hoping that it can transform your frustrating process of matching the image in your head to the one on your screen into something that’s actually fun.
A picture is worth a thousand words
It’s hard to find the right images online because most search engines rely on keywords. If a user is uploading that bike image to Shutterstock’s web site, for example, she provides all the keywords. If she’s uploading a batch of images that are similar, some of those keywords might not pertain to each individual image.
“All of these keywords together can be strange—that’s one of the problems that’s inherent when you treat media like a bag of words,” says Kevin Lester, the vice president of engineering for search and discovery at Shutterstock, one of the engineers behind the new computer vision tool.
So a lot of image databases fill in those gaps with user behavior. If people searching the words “bike” and “fence” download a particular image more often, that one probably contains those two things in it. It’s a simple concept, Lester says, but it’s still imperfect.
Computer vision can change all that by eliminating the need for keywords in the first place. Using a series of algorithms, a model can progressively survey each pixel in an image to pick out different features in it—the color, the shapes, the sharpness of the angles. Each calculation is a layer of the deep learning network. At the end of this process, the program generates a single number, a vector. If the model is good, the more similar the number, the more similar the images they quantify. The model trains itself to recognize these features, so the more images plugged into it, the better the model becomes.
As a field, computer vision has really only been around since 2012, when three researchers from the University of Toronto published a paper that has since been considered a watershed moment for the discipline.
And yet, in just four years, computer vision is crucial for a number of tech companies. Facebook’s model can identify faces in pictures with more than 97 percent accuracy; Google’s can solve those CAPTCHA puzzles—designed to weed out robots to verify that a user is human—with 99 percent accuracy.
A model for computer vision can be used for a number of different applications, but it’s usually trained for a particular task. Shutterstock is using it to detect visually similar images and do a reverse image search.
Seeing like a computer
One of the main ways people discover images on Shutterstock’s web site is in this category called visually similar. It’s those images that come up at the bottom when you click on one. Like this:
If the system is relying on keywords, the images it returns are sometimes related, but sometimes not. It’s inconsistent and spotty. For Shutterstock’s first computer vision model, the engineers used the schematic first outlined in that 2012 paper and trained on the site’s 70 million stock images. Even then, it wasn’t very good.
Visually Similar Shutterstock 1
“I don’t think anyone would consider those extremely similar, other than that the color palette seems to be somewhat consistent,” says Lester.
The engineers tweaked the model, then gave it weeks to retrain on the data to learn about particular features of images. And it got a little better:
Visually Similar Shutterstock 2
There were a few more iterations, but here are the results turned up in the final version of the tool:
Visually Similar Shutterstock 3
Through internal testing, Shutterstock was able to say that their new visually similar tool was significantly better than their past one that relied on keywords. Now, every time someone clicks on an image on their site (and that happens a lot—the company sells 4.7 images per second), the algorithm searches through the 70 million photos to serve up those it deems most similar. The site also uses the tool on its 4 million film clips, a growing area of business for the company.
Importantly, it does this search in just 200 milliseconds—that’s half the time it took from the company’s old model. And while a difference of 200 milliseconds might not sound like much, Lester says it makes a world of difference to impatient customers. “When we quicken the speed, we found that people search more, because what we did was reduce the cost of them doing the search, which means they were exploring our site more. And that in turn meant that they were more likely to sign up as customers,” he says.
Some types of images are more challenging than others for Shutterstock’s computer vision tool. It took a little longer to train the system on abstract images, Lester says, and sometimes it can interpret watermarks as essential parts of an image.
“The system is only as smart as what you train it on,” Lester says. “If there are things that are outside of its wheelhouse, it’s going to do not quite as well because it’s going to shoehorn it into something it understands.” But with such a large database that is changing all the time as contributors add more images, the company’s good model can only get better.
Simon Lucey, a computer vision professor at Carnegie Mellon University, was impressed with the results when he used Shutterstock’s web site. “What they’re doing is representative of what’s happening in computer vision at the moment: Big advances in deep learning,” he says. “For many tasks these models are achieving human-like performance.”
Getting a computer to understand an image, not just capture it, has been the holy grail for computer science for some time, Lucey says, and improvements to hardware and software have brought the technology there. Shutterstock’s tool is riding that wave of progress, he adds.
The limit does not exist
As models like Shutterstock’s improve, engineers run into semantic or philosophical issues. At a certain point, Lester says, people differ in how they define images at similar—that’s when he’ll know that his team can stop improving its model, he says. Then there’s the inevitability of offending someone, like when Google’s tool labeled a woman as a gorilla last year.
“When the computer makes false assumptions about this image is really this thing and that’s a bad, potentially offensive relationship, that’s when you start getting into the trouble areas with computer vision,” Lester says. To avoid issues like the one Google ran into, Shutterstock’s team identified potentially problematic distinctions and retrained its model on those images. If the model is smart enough in those areas, Lester says, it no longer makes those offensive associations.
Eventually sites like Shutterstock can use computer vision to power new types of searching, or new ways to interact with images. Someday you could search for a pair of shoes you see on a celebrity by dragging a box over that part of a photo—you wouldn’t need to describe the shoes or even know the name of their wearer.
“When you start changing your discovery experience to be more based on pixels, you can affect your searches in ways that so far the industry hasn’t seen,” Lester says.
For computer vision in general, the applications seem limitless. When combined with other types of technology like robotics and artificial intelligence, computer vision can help self-driving cars see pedestrians or enable robots to properly grip an object or help the blind see.
And while Lucey anticipates we’ll run into some more issues—with privacy, with people losing their jobs as industries transform—he believes that computer vision is a force that can be used to make the world better.
“I think much like video compression, computer vision is eventually just going to be not noticeable. We’ll take it for granted. It’s just going to work,” he says. “I think that’s a sign of good tech, where the tech itself is no longer noticeable.”
With Shutterstock’s new tool launching today, we’re one step closer to that reality. And Lester and his colleagues are excited to see how customers will put it to use. “The exciting thing about putting it into the market is to see what people want from it and how they use it,” says Lawrence Lazare, the product director for Search and Discovery at Shutterstock. “Sometimes people don’t use it how you think they will.”