Ask someone to chart the progression of artificial intelligence (AI) models over the past few decades and you’ll likely hear some reference to how good they are at playing games. IBM shocked the world in 1997 when its Deep Blue model vanquished chess grandmaster Garry Kasparov at his own domain. Nearly two decades later, Google’s AlphaGo model trounced a human champion of the game Go, a feat some thought impossible at the time.
Since then, increasingly data rich AI models have graduated from board games to video games. Various models have used a training method called reinforcement learning—a technique that also plays a key role in training AI chatbots like ChatGPT—to teach machines how to learn and outperform humans at a range of Atari games.More recently, reinforcement learning has taught machines how to master incredibly complex strategy games including Dota 2 and Starcraft II.
But there’s one area of gaming remaining—at least for now—where computers still can’t hold a candle to flesh and bone humans. They are still not great at learning different kinds of more open-ended games quickly. When it comes to picking up a random title from a game store that they haven’t seen before and getting the gist, human gamers still learn the ropes much quicker than even the most advanced AI models.
That’s the key argument made in a recent paper authored by New York University computer science professor Julian Togelius and his colleagues. They note this distinction isn’t just a pat on the back for Homo sapiens. It may also shed light on a key element of what makes human intelligence so unique and why AI still has a long way to go before it can truly claim human-level intelligence—let alone surpass it.
“If you pit an LLM [large language model] against a game it has not seen before, the result is almost certain failure,” the authors write.
AI has been hooked on games from the beginning
Games have been useful testbeds for AI models for decades because they typically have predictable rules, defined goals, and varying mechanics. Those basic tenets track particularly well for reinforcement learning, where a model plays a game in simulation over and over again—sometimes millions of times—using trial and error to gradually improve until it reaches proficiency. This, in a basic sense, was how DeepMind was able to master Atari games in 2015. That same logic influences today’s popular large language models, albeit with the entire internet serving as training data.
And yet, that method runs into problems when asked to generalize. AI models crush humans at board games and certain video games because the constraints are clear and the goals are relatively straightforward. At the end of the day, Togelius and his colleagues argue that those models, impressive as they may seem, are still getting exceptionally good at a very specific task—and not much more. Even small variations to a game’s overall design can cause the whole thing to break down. A model might be superhuman when playing a specific game, but prove pretty incompetent when asked to improvise.
That distinction becomes even clearer considering the broader trend in modern gaming toward more open-ended and abstract titles. Take chess versus a high-budget third person adventure game like the open-world western “Red Dead Redemption.” While both are games in the basic sense, what it means to succeed or win in each are wildly different. “Red Dead Redemption” has many missions with clearly defined resolutions—shoot the bad guy, steal the horse. However, the overarching goal of the game is far less straightforward. What does it mean to win when the central drive is to embody a morally troubled Western outlaw?
Related Stories
Human gamers can intuit that; machines, not so much. Even in simpler games like “Minecraft,” the researchers note, an AI model may know to jump from one block to another while having absolutely no concept of what it actually means to jump.
“In sum, all well-designed games are expertly tailored to human capabilities, intuition, and common sense,” the authors write.
Lived experience appears to be our greatest advantage when playing against machines. The average gamer downloading a new release may not have been scrupulously trained by an office full of well-paid, Patagonia-clad engineers, but they do have years of interacting with and understanding objects and more abstract concepts that they will then encounter in the game. The authors note that human babies learn to recognize and identify individual objects somewhere around 18 to 24 months, simply by existing in the world. Machines need more hand-holding.
All of this translates to humans learning new games faster. Past studies show that a game-playing AI model using a curiosity-based reinforcement learning may require four million keyboard interactions to finish a game. That translates to around 37 hours of continuous play. The average human gamer, by contrast, will usually figure out even totally new mechanics in under 10 hours.
That said, game-playing AI is definitely still improving, even in more general settings. Just last year, Google DeepMind unveiled a model called SIMA 2, which the company describes as a significant step forward in AI learning to play 3D games in ways more similar to humans, including games it wasn’t specifically trained on. The key breakthrough involved taking an existing model and integrating reasoning capabilities from Google’s Gemini large language model. That combination helped it better understand and interact with new environments.
Togelius and his colleagues say those models still have real ground to cover before they can be considered on par with a human gamer. Their proposed benchmark involves taking a model and having it play and win the top 100 games on Steam or the iOS App Store, without having been previously trained on any of them—and doing so in roughly the same time it would take a human. That’s a tall order.
“General video game playing, in the sense of being able to play any game of the top 100 on Steam or iOS App Store after only the same amount of playing time that a human would need, is a very hard challenge that we are nowhere near solving and not even seriously attempting,” the authors write. “It is not at all clear that current methods and models are suited to this problem.”
Beating that challenge isn’t just of interest to the gaming world. Togelius argues that a machine capable of generalizing in that way would likely need to excel at true creativity, forward planning, and abstract thinking, all qualities that feel far more distinctly human than what current AI models possess.
In other words, the true test of how well AI can achieve “human-level intelligence” might not come from generating deepfakes or writing trite novels, but from playing a whole lot of games.