AI-powered headphones can tune into a single voice in a crowd

Noise cancelling tech just got more targeted.

By Mack DeGeurin

Posted on May 24, 2024

noise cancelling headphones new study — A University of Washington team has developed an artificial intelligence system that lets a user wearing headphones look at a person speaking for three to five seconds and then hear just the enrolled speaker’s voice in real time even as the listener moves around in noisy places and no longer faces the speaker. Pictured is a prototype of the headphone system: binaural microphones attached to off-the-shelf noise canceling headphones. CREDIT: Kiyomi Taguchi/University of Washington

Active Noise Canceling (ANC) technology embedded in leading brand name headphones and earbuds have made the world quieter. With a simple flip of a switch, coffee shop patrons and air travelers can turn down the volume press mute on cacophonous background noises and otherwise distracting chatter. But what happens when you do want to hear one person amidst an otherwise silent crowd speak? Currently, headphones users have to make a choice: Continue on muting out the entirety of their soundscape or switch off noise canceling to hold a conversation.

That choice between noise canceling and conversation could one day be a thing of the past thanks to a new AI-enabled “Target Speech Hearing” system devised by researchers from the University of Washington. In a recent paper published in the Association for Computer Machinery, the researcher’s claim their custom-made, proof-of-concept headphones can pick out a specific voice from a crowd and then lock on to that voice while simultaneously canceling out surrounding sounds. Headphone wearers simply gaze directly at the intended speaker’s face and let AI systems capture sound signals filled with their unique speech traits. The end result: A reality where headphones wearers can hold sustained conversations with an individual with noise canceling technology still equipped. Researchers believe this system could one day help people with partial hearing loss or simply make conversing in noisy areas a little less chaotic.

“In this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences,” University of Washington Paul G. Allen School of Computer Science & Engineering Professor and senior author Shyam Gollakota said in a statement. “With our devices you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.”

How the modified headphones capture an individual’s voice

To build out their system, the researchers took a pair of off-the-shelf commercial headphones and equipped them with microphones and an onboard AI neural network. In practice, a person wearing the headphone interested in singling out a speaker simply needs to look directly at them while pressing down on a button on the side of the device. That button initiates a process called “enrollment” where the headphones take in sound signals emanated from the targeted speaker. That signal is centered between microphones located on the left and right headphones The neural network analyzes that signal in real-time to identify specific speech traits connected to that particular person. That data is then sent to another neural network which is tasked with continuously separating the targeted speaker’s signal away from everything else.

The entire enrollment process only takes around three to five seconds. Once enrolled and focused on a targeted speaker, the system will actually improve over time as it continually receives more and more real-time training data. Headphones wearers don’t need to awkwardly remain stationary staring into a person’s eyes for the system to work either. After that brief initial enrollment process, researchers say the AI headphone system is able to “latch on” to the voice signal and continue tracking it even after the wearer turns their head. That means a person wearing the headphone can hear the isolated voice even if they are no longer face to face with the speaker. .

“The advantage of our approach is that the wearer only needs to look at the target speaker for a few seconds during which we enroll the target speaker,” the researchers write. “Subsequently, the wearer can look in any direction, move their head, or walk around while still hearing the target speaker.”

In the video above, University of Washington PhD candidate Malek Itani demonstrates using the headphones to lock on to a colleague in a campus common space peppered with other people. After looking at his colleague for several seconds, the speaker’s somewhat muted voice breaks through the noise canceling fog and is heard clearly. The pair repeated the test again outside, this time in front of a noisy fountain with similar results. Once enrolled, the headphone wearer turned away from the targeted speaker and could continue hearing him as they strolled through the university’s campus.

“The headphone system used AI technology to extract the voice Malek wants to hear while ignoring all sounds in the environment from that point on,” University of Washington PhD student and paper co-author Bandhav Veluri said.

Target Speech Hearing could prove useful for both convenience and accessibility

Previously, a system like this would have tried to first capture clean, noise free audio from a speaker and use that held the system identity as a speaker’s characteristics. Here, researchers took a different approach and opted to build a system that could quickly capture signals from a speaker even when they are surrounded by a noisy environment. The results were significant. Researchers claim their system achieves a signal clarity improvement of 7.01 dB using less than five seconds of training data. On a more human level, the researchers had 21 different test subjects spend around 420 minutes rating signal clarity while using the modified headphones in real world outdoor and indoor environments. These test subjects, on average, rated the quality of the target speaker’s voice nearly twice as high while using the system compared to without it.

This system isn’t perfect. For now, the enrollment process only works if the target speaker is the loudest voice in the room. Still, researchers are optimistic they can modify future systems to address that shortcoming. Travelers could one day use these headphones to focus in on a tour guide while blocking out background conversation in a busy museum. A pair of friends taking a stroll along a busy city street could similarly use the technology to continue on with a conversation free from potentially disruptive traffic noise. Looking forward, the researchers say they are exploring the possibility of embedding this new stem in brand name headphones and earbuds. One day, they hope the system could potentially be included as an accessibility feature in hearing aids.

“AI, and especially neural networks, have made great strides in speech processing,” Veluri said. “That application is really exciting and especially useful for people with hearing challenges where they want to amplify the voice of the person they want to hear.”

“This could be big, this could impact lots of people,” Itani added.