Exclusive: Inside Project Natal’s Brain

The artificial intelligence behind Microsoft's Xbox 360 motion-sensing game controller

We may earn revenue from the products available on this page and participate in affiliate programs. Learn more ›

Deep in Microsoft’s lairs, the Xbox 360 team is working on more than just a new video-game system. They’re actually trying to solve an incredibly difficult problem in artificial intelligence. Their prototype Project Natal lets you control a game just with your body movements—no buttons or Wii-like wands—by watching you with a 3-D video camera. Sounds simple enough, but most cameras just snap images without having any idea what they’re looking at. To make Natal work, Microsoft has to teach its camera to understand what it sees.

Here at CES, Microsoft announced last night that Natal will go on sale “by the holidays.” Before the show, we were given an exclusive look at the smarts that make Natal tick.

For a closer look at how Natal learns, launch the gallery

The Brain

The part of Natal that players see looks like a webcam. (Microsoft’s not divulging details about this hardware yet, presumably because the release is many months in the future, but we do know that it measures relative distances using a black-and-white camera sensor and an near-infrared beam.) But it’s the software inside, which Microsoft casually refers to as “the brain,” that makes sense of the images captured by the camera. It’s been programmed to analyze images, look for a basic human form, and identify about 30 essential parts, such as your head, torso, hips, knees, elbows, and thighs.

In programming this brain–a process that’s still going on—Microsoft relies on an advancing field of artificial intelligence called machine learning. The premise is this: Feed the computer enough data—in this case, millions of images of people—and it can learn for itself how to understand it. That saves programmers the near-impossible task of coding rules that describe all the zillions of possible movements a body can make.

The process is a lot like a parent pointing to many different people’s hands and saying “hand,” until a baby gradually figures out what hands looks like, how they can move, and that, for instance, they don’t vanish into thin air when they’re momentarily out of sight.

How To Teach A Machine To See

Microsoft is currently training and improving the version of the brain that will ultimately go into the final product. How? By painstakingly gathering pictures of people in many different poses, and then running all this data through huge clusters of computers (as shown in the gallery) where the learning brain resides.

The process of gathering the data actually requires a lot of manual labor. First, reps went into homes around the world and recorded people moving in front of a specially built rig. The images captured are real people moving the way any ordinary person would. But those recordings can’t tell the computer anything useful about joints and limbs on their own, so programmers dive into the raw data and hand-code it to label each body part (at each frame!).

Microsoft also uses professionally staged motion-capture scenes, which provides similar data but without all the manual labor of coding by hand (since the systems use sensors that mark individual body parts). And Microsoft has a mini mo-cap studio of its own, where staff can make a quick recording when a new chunk of data is needed.

All of these marked-up images comprise tens of terabytes of information. Microsoft’s computer farms sift through this huge data set, letting the brain come up with probabilities and statistics about the human form. Once the brain is done learning, it and its stats get packaged into the Natal system. An early version is now making the rounds of trade shows, and later, more-accurate versions will eventually show up in your living room. Next, read about how it applies its hard-earned knowledge to decipher your game-playing moves.

Inside Natal’s Thought Process

What’s the brain thinking as it watches you jump around, swinging imaginary bats or head-butting imaginary soccer balls? The above screenshot shows what’s going on in it’s head—the different images represent different stages of Natal’s computational process. Here’s the step-by-step:

Step 1: As you stand in front of the camera, it judges the distance to different points on your body. In the image on the far left, the dots show what it sees, a so-called “point cloud” representing a 3-D surface; a skeleton drawn there is simply a rudimentary guess. (The image on the top shows the image perceived by the color camera, which can be used like a webcam.)

Step 2: Then the brain guesses which parts of your body are which. It does this based on all of its experience with body poses—the experience described above. Depending on how similar your pose is to things it’s seen before, Natal can be more or less confident of its guesses. In the color-coded person above [bottom center], the darkness, lightness, and size of different squares represent how certain Natal is that it knows what body-part that area belongs to. (For example, the three large red squares indicate that it’s highly probable that those parts are “left shoulder,” “left elbow” and “left knee”; as the pixels become smaller and muddier in color, such as the grayish pixels around the hands, that’s an indication that Natal is hedging its bets and isn’t very sure of its identity.)

Step 3: Then, based on the probabilities assigned to different areas, Natal comes up with all possible skeletons that could fit with those body parts. (This step isn’t shown in the image above, but it looks similar to the stick-figure drawn on the left, except there are dozens of possible skeletons overlaid on each other.) It ultimately settles on the most probable one. Its reasoning here is partly based on its experience, and partly on more formal kinematics models that programmers added in.

Step 4: Once Natal has determined it has enough certainty about enough body parts to pick the most probable skeletal structure, it outputs that shape to a simplified 3D avatar [image at right]. That’s the final skeleton that will be skinned with clothes, hair, and other features and shown in the game.

Step 5: Then it does this all over again—30 times a second! As you move, the brain generates all possible skeletal structures at each frame, eventually deciding on, and outputting, the one that is most probable. This thought process takes just a few milliseconds, so there’s plenty of time for the Xbox to take the info and use it to control the game.

(If you want to get into more details on the science, check out the machine-learning papers of Microsoft researcher Andrew Blake, on whose work Natal is partly based.