Google's DeepMind has a new model for robots

Tech giant Google and its subsidiary AI research lab, DeepMind, have created a basic human-to-robot translator of sorts. They describe it as a “first-of-its-kind vision-language-action model.” The pair said in two separate announcements Friday that the model, called RT-2, is trained with language and visual inputs and is designed to translate knowledge from the web into instructions that robots can understand and respond to.

In a series of trials, the robot demonstrated that it can recognize and distinguish between the flags of different countries, a soccer ball from a basketball, pop icons like Taylor Swift, and items like a can of Red Bull.

“The pursuit of helpful robots has always been a herculean effort, because a robot capable of doing general tasks in the world needs to be able to handle complex, abstract tasks in highly variable environments — especially ones it’s never seen before,” Vincent Vanhoucke, head of robotics at Google DeepMind, said in a blog post. “Unlike chatbots, robots need ‘grounding’ in the real world and their abilities… A robot needs to be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up.”

That means that training robots traditionally required generating billions of data points from scratch, along with specific instructions and commands. A task like telling a bot to throw away a piece of trash involved programmers explicitly training the robot to identify the object that is the trash, the trash can, and what actions to take to pick the object up and throw it away.

For the last few years, Google has been exploring various avenues of teaching robots to do tasks the way you would teach a human (or a dog). Last year, Google demonstrated a robot that can write its own code based on natural language instructions from humans. Another Google subsidiary called Everyday Robots tried to pair user inputs with a predicted response using a model called SayCan that pulled information from Wikipedia and social media.

Some examples of tasks the robot can do. *DeepMind*

RT-2 builds off a similar precursor model called RT-1 that allows machines to interpret new user commands through a chain of basic reasoning. Additionally, RT-2 possesses skills related to symbol understanding and human recognition—skills that Google thinks will make it adept as a general purpose robot working in a human-centric environment.
More details on what robots can and can’t do with RT-2 is available in a paper DeepMind and Google put online.

RT-2 also draws from work done through vision-language models (VLMs) that have been used to caption images, recognize objects in a frame, or answer questions about a certain picture. So, unlike SayCan, this model can actually see the world around it. But to make it so that VLMs can control robots, a component for output actions needs to be added on to it. And this is done by representing different actions the robot can perform as tokens in the model. With this, the model can not only predict what the answer to someone’s query might be, but it can also generate the action most likely associated with it.

DeepMind notes that, for example, if a person says they’re tired and wants a drink, the robot could decide to get them an energy drink.

Win the Holidays with PopSci's Gift Guides

Here’s how your Paul McCartney wannabe can learn how to play the guitar Here’s how your Paul McCartney wannabe can learn how to play the guitar

This discounted e-scooter is perfect for anyone who loves shortcuts (and hates parking) This discounted e-scooter is perfect for anyone who loves shortcuts (and hates parking)

Google’s new robot butler was trained on social media and Wikipedia articles Google’s new robot butler was trained on social media and Wikipedia articles

Google is testing a new robot that can program itself Google is testing a new robot that can program itself

How a robotic arm could help the US Army lift artillery shells How a robotic arm could help the US Army lift artillery shells

Boston Dynamics’s bipedal robots can throw heavy objects now Boston Dynamics’s bipedal robots can throw heavy objects now

Google is training robots to interact with humans through ping pong Google is training robots to interact with humans through ping pong

It’s not a UFO—this drone is scooping animal DNA from the tops of trees It’s not a UFO—this drone is scooping animal DNA from the tops of trees

Why DARPA wants its robots to think like kids Why DARPA wants its robots to think like kids

Spider robots could soon be swarming Japan’s aging sewer systems Spider robots could soon be swarming Japan’s aging sewer systems

This AI is no doctor, but its medical diagnoses are pretty spot on This AI is no doctor, but its medical diagnoses are pretty spot on

Meta’s AI could shake up how we study protein structures Meta’s AI could shake up how we study protein structures

What Pong-playing brain cells can teach us about better medicine and AI What Pong-playing brain cells can teach us about better medicine and AI

From film to forensics, here’s how lidar laser systems are helping us visualize the world From film to forensics, here’s how lidar laser systems are helping us visualize the world

Meta wants to improve its AI by studying human brains Meta wants to improve its AI by studying human brains

A self-aware robot taught itself how to use its body A self-aware robot taught itself how to use its body

Artificial intelligence is everywhere now. This report shows how we got here. Artificial intelligence is everywhere now. This report shows how we got here.

These little robots could help find old explosives at sea These little robots could help find old explosives at sea

Share

Win the Holidays with PopSci's Gift Guides