Google I/O recap: All the cool AI-powered projects in the works

Added language options in Translate, creating a sense of place in Maps, and futuristic interpreter glasses.
"Scene exploration" was teased as a new feature coming to Google search
"Scene exploration" was teased as a feature that might be coming to Search soon. Google

Google held its annual I/O developers conference today, announcing hardware such as new Pixel phones, a round Pixel Watch, and even teased futuristic glasses that display real-time language translation in augmented reality. They also revealed new features, like a summarize option coming to Google Docs (think of it as an AI-generated TL;DR) and a Wallet app that can also hold a digital ID or vaccine card.

Notably, the tech giant also highlighted how AI has allowed them to build new features across a range of its services and apps—including Translate, Search, Maps, and more. Here’s what updates users can expect to come down the pike, both soon and in the future. 


Google’s work on language models has enabled it to expand its translation capabilities. Google said that it is adding 24 new languages to Google Translate including Bhojpuri, Lingala, and Quechua. Along with these new languages, Google has also published research on how they intend to build machine translation systems for languages that do not have large translation datasets available using high-quality monolingual datasets. They’re calling this technique the Zero-Shot Machine Translation. 

[Related: This new AI tool from Google could change the way we search online]

This technique creates translations without needing a thorough and traditional translation dictionary. According to a blog, to do this, they trained a language model to “learn representations of under-resourced languages directly from monolingual text using the MASS task,” where solving the tasks required the model to establish “a sophisticated representation of the language in question, developing a complex understanding of how words relate to other words in a sentence.”

Google is also rolling out auto-translated captions in 16 languages on YouTube in addition to the speech recognition models they’re already using to create text transcriptions for video. This feature will come to Ukrainian content next month as part of the effort to increase access to accurate information about the war.

Over the past few years, Google Search has introduced a variety of different tools to make it easier for people to find what they want in different ways, including voice search, hum to search, Google Lens, and more recently, multi-search, which allows users to combine photos with text prompts in queries. Multimodal technology also uses text, audio and video to create auto-generated “chapters” in YouTube videos.

[Related: Google is launching major updates to how it serves health info]

Today, Google introduced a feature called search “near me.” Here’s how that would work: In the Google app, users can take a picture or upload a screenshot, and add the text “near me” to find local retailers and restaurants that may have the apparel, goods, or food that they’re looking for. For example, if you’re fixing a broken faucet, you can take a photo of the faulty part and locate it in a nearby hardware store. 

As another example, if you come across a tasty-looking dish online that you would like to try, you can take a picture of it and Google can tell you what it is, and point you to highly rated local restaurants that offer it through delivery. Google multisearch will “understand the intricacies of this dish, it will combine it with your intent, the fact that you’re looking for local restaurants, and then it will scan millions of images, reviews, and community contributions on maps to find that nearby local spot,” Nick Bell, the lead of search experience at Google, explained in a press call. Local information via multisearch will be available globally in English later this year and roll out to more languages over time.  

[Related: Google’s about to get better at understanding complex questions]

Google teased another feature currently in development called “search within a scene,” or “scene exploration.” Typically, Google searches work with objects captured with a single frame, but scene exploration will allow users to pan their cameras around and get instant insights on multiple objects within the camera’s view. Imagine you’re at a bookstore, and using this function, you would be able to see info overlaid on the books in front of you. “To make this possible, we bring together computer vision, natural language understanding, and bring that together with the knowledge of the web and on-device technology,” Bell said. 


Google Maps started as a simple navigation app in 2005, but over the past few years, it has been pushing to “redefine what a map can be,” Miriam Daniel, VP of Google Maps, said in a press call before I/O. These include adding info about fuel-efficient routes (available now in the US and Canada and expanding to Europe later this year), the busyness of a destination, and notes about restaurants, like whether they have outdoor seating. 

Additionally, Google’s work with 3D mapping and computer vision has enabled them to add more depth and realism to Street View and aerial image by fusing together billions of officially collected and user-generated images. Instead of gray blocks of varying heights representing buildings, “immersive view” in maps will show you detailed architecture of landmarks like Big Ben up close as well as what it looks like at different times in the day with a “time slider.” Maps will also bring together information about weather and traffic conditions to inform you of what the place is going to be like. Users can also glide down to the street level where they will be able to virtually go inside restaurants or other spaces to get a sense of what it feels like before they decide to visit. This feature will be available on smartphones and other devices.  

[Related: Google Maps has temporarily disabled key features in Ukraine]

Immersive view is slated to roll out for landmarks, neighborhoods, restaurants, popular venues, and places in Los Angeles, London, New York, San Francisco, and Tokyo by the end of the year, with more cities coming soon. 

The Google Maps team announced that they will also be releasing the ARCore Geospatial API based off of their Live View technology for third-party developers. Live View and the corresponding global localization software has been used in AR to overlay arrows and directions in the real world that can be viewed through a live camera stream. Opening this API can enable developers to integrate this tech into their own apps. Daniel notes that some early developers have already found different ways to apply this tech. For example, micro-mobility company Lime has used this API to help commuters in London, Paris, Tel Aviv, Madrid, San Diego, and Bordeaux find parking spots for their e-scooters and e-bikes. 


A heavy research area at Google is natural language processing—that is, how to get machines to understand the nuances and imperfections of human speech (which is full of ums and pauses) and hold conversations. Some of their findings are helping make the Google Assistant better. “We really focused on the AI models and we realized we needed 16 different machine learning models processing well over 100 signals,” Nino Tasca, product manager at Google for Speech, said in a press call. “That’s everything like proximity, head orientation, gaze detection, and even the user’s intent with the phrase, just to understand if they’re really talking to the Google Assistant.”

Today, Google introduced a feature called “Look and Talk” on its Nest Hub Max device. If users opt in, they can just look at their device to activate Google Assistant to listen to what they want without saying “Hey, Google.” This feature uses Face Match and Voice Match technology to identify who’s talking and videos from these interactions are processed on device (like with the Tensor chip). “Look and Talk” will roll out in Android this week and iOS devices soon.  

Watch the full keynote, below: