An ‘acoustic fingerprint’ should keep Alexa from waking during Amazon’s Super Bowl ad
The same technique works for when Stephen Colbert gets mischievous.
If you tune into the Super Bowl this weekend to consume some football, music, and advertisements, you’ll see a too-clever-by-half commercial for Alexa, Amazon’s voice assistant. When the celebrities and actors in the ad say the word “Alexa,” it shouldn’t trigger any Echo devices you have in your home. Here’s why.
Devices like Amazon Echo Dots, Google Home speakers, and Apple’s HomePod listen for wake words—”Alexa,” “Hey, Google,” or “Hey, Siri.” Ideally, they should only wake up when they hear those words or phrases spoken by someone in your house who actually wants to use the voice assistant to do something, like check the weather. The systems need to avoid false positives.
In the case of Amazon, for the Super Bowl ad (and other moments on television when people say “Alexa”) the company uses a strategy called “acoustic fingerprinting” to try to keep your device from triggering. With an ad that the company produced, creating the fingerprint and programming the Alexa system to ignore those instances can happen ahead of time. “When we have audio samples in advance — as we do with the Super Bowl ad — we fingerprint the entire sample and store the result,” Mike Rodehorst, a machine learning scientist with Amazon, said in a blog item. Amazon can then put that information, and fingerprints from other commercials, on the Echo devices themselves, and not in the cloud, so hopefully your device doesn’t wake up at all.
In general, an audio fingerprint is “a connected sequence,” says Alex Rudnicky, a research professor emeritus and expert in the field of speech processing at Carnegie Mellon University. “Sounds develop over time,” he says; that fact is a key aspect of what makes up the identity of a sound. Think about someone slowly saying the word “Alexa,” and imagine the variation of their voice as they say it. An acoustic fingerprint is thus a sequence of slices that overlap with each other and may begin every 10 milliseconds, he says. (Amazon has a more technical explanation of their approach in the fourth paragraph of their blog item.)
Rodehorst, of Amazon, said that when they’re processing information like this is in the cloud from commercials they know about, and trying to avoid those false positives, they can also use “the audio that follows the wake word,” meaning that have more data to work with.
Instructing Amazon devices to ignore a specific acoustic fingerprint from a commercial that the company itself made is likely more straightforward than dealing with a character on TV using the word “Alexa” in an organic, unexpected way.
In those cases, in the cloud, the company can take advantage of the fact that many devices would all hear the same “Alexa” simultaneously. For example, in late January, Stephen Colbert said in a “Midnight Confessions” bit, “Alexa, buy 20 bundles of Bounty paper towels, overnight shipping!” In instances like that, an “Alexa” hitting multiple devices helps the company (hopefully) realize what’s happening and prevent Alexa from actually ordering those paper towels. It can store that information to prevent an Echo device from waking up when the same bit is replayed later; I tried playing the same Colbert moment out loud, and my Echo Dot woke up briefly upon hearing the wake word and then turned off.
Amazon also has said it can use other strategies to avoid an “Alexa” coming from your television waking up your device. For example, since your TV doesn’t move around the room, but you might be in motion, it can take into account the timing of the audio hitting various microphones on your device. “Sound will of course reach closer microphones sooner than it does more distant microphones, so arrival-time differential indicates the distance and direction of the sound source,” two other Amazon scientists wrote in a blog item from last year.
Amazon, Carnegie Mellon’s Rudnicky comments, is “figuring out how not to screw up, and I like that.”
Amazon isn’t the only company that makes a voice assistant that could be spoofed by media coming from your television or computer; however, neither Apple nor Google would comment on their approach to this issue.