Forget ‘Deep Dream,’ Google’s ‘Deep Stereo’ Can Recreate the Real World

Time to stop dreaming
Screenshot from Google's Deep Stereo video

John Flynn/Google/YouTube

Google’s artificial neural network has run rampant throughout the internet in the last few weeks, turning demure Twitter photos into surrealistic nightmares and taking the already-hellish Fear and Loathing in Las Vegas to a level only Hunter S. Thompson himself could have imagined.

But let us not forget: Google’s AI also serves practical means. Google engineers are using their layered artificial neural network, also called a deep network, to create unseen views from two or more images of a scene. They call it “Deep Stereo.” For example, if you have a photo from the left and right of a scene, the deep network will tell you what it looks like from anywhere the middle. Or if there are five photos of a room, the deep network can render unique views of what the room would look like from other angles, based on what it thinks should be there.

The network does this by figuring out the depth and color of each pixel of the images, and creates a 3D space using each input photo as a reference plane. Then, it works to fill in the gaps based on color and depth input from the original photos. At its current settings, the deep network can work on 96 planes of depth gathered from images.

Researchers initially trained the neural network to do this (much like they trained it to produce its own images) by running 100,000 sets of photos through the network and having it create unique views. The new images rendered had to be small, because according to the study, it takes 12 minutes to produce a tiny 512×512 pixel image, making the RAM required to process an entire image “prohibitively expensive.” The images used were “street scenes capture by a moving vehicle” which we can only assume is a Google Street View car, whose images as mentioned later on.

Researchers note that the two biggest drawbacks are the speed in which these images are processed, and that the network can only process 5 input images at a time—limiting resolution and accuracy.

Obvious application for this technique would be making Google’s Street View a more fluid experience, rather than having to jump from photo to photo taken by the car. However, if 512×512 pixel images are the current reasonable limit, and Google has pretty much imaged most of the driven world, we’re not expecting this feature to come any time soon.