Tech giant Google and its subsidiary AI research lab, DeepMind, have created a basic human-to-robot translator of sorts. They describe it as a “first-of-its-kind vision-language-action model.” The pair said in two separate announcements Friday that the model, called RT-2, is trained with language and visual inputs and is designed to translate knowledge from the web into instructions that robots can understand and respond to.
In a series of trials, the robot demonstrated that it can recognize and distinguish between the flags of different countries, a soccer ball from a basketball, pop icons like Taylor Swift, and items like a can of Red Bull.
“The pursuit of helpful robots has always been a herculean effort, because a robot capable of doing general tasks in the world needs to be able to handle complex, abstract tasks in highly variable environments — especially ones it’s never seen before,” Vincent Vanhoucke, head of robotics at Google DeepMind, said in a blog post. “Unlike chatbots, robots need ‘grounding’ in the real world and their abilities… A robot needs to be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up.”
That means that training robots traditionally required generating billions of data points from scratch, along with specific instructions and commands. A task like telling a bot to throw away a piece of trash involved programmers explicitly training the robot to identify the object that is the trash, the trash can, and what actions to take to pick the object up and throw it away.
For the last few years, Google has been exploring various avenues of teaching robots to do tasks the way you would teach a human (or a dog). Last year, Google demonstrated a robot that can write its own code based on natural language instructions from humans. Another Google subsidiary called Everyday Robots tried to pair user inputs with a predicted response using a model called SayCan that pulled information from Wikipedia and social media.
RT-2 builds off a similar precursor model called RT-1 that allows machines to interpret new user commands through a chain of basic reasoning. Additionally, RT-2 possesses skills related to symbol understanding and human recognition—skills that Google thinks will make it adept as a general purpose robot working in a human-centric environment.
More details on what robots can and can’t do with RT-2 is available in a paper DeepMind and Google put online.
RT-2 also draws from work done through vision-language models (VLMs) that have been used to caption images, recognize objects in a frame, or answer questions about a certain picture. So, unlike SayCan, this model can actually see the world around it. But to make it so that VLMs can control robots, a component for output actions needs to be added on to it. And this is done by representing different actions the robot can perform as tokens in the model. With this, the model can not only predict what the answer to someone’s query might be, but it can also generate the action most likely associated with it.
DeepMind notes that, for example, if a person says they’re tired and wants a drink, the robot could decide to get them an energy drink.