This AI can harness sound to reveal the structure of unseen spaces

It's called a neural acoustic field model, and it can also consider what noises would sound like as you traveled through virtual reality.
a stage filled with lights and music equipment

Deposit Photos

Imagine you’re walking through a series of rooms, circling closer and closer to a sound source, whether it’s music playing from a speaker or a person talking. The noise you hear as you move through this maze will distort and fluctuate based on where you are. Considering a scenario like this, a team of researchers from MIT and Carnegie Mellon University have been working on a model that can realistically depict how the sound around a listener changes as they move through a certain space. They published their work on this subject in a new preprint paper last week. 

The sounds we hear in the world can vary depending on factors like what type of spaces the sound waves are bouncing off of, what material they’re hitting or passing through, and how far they need to travel. These characteristics can influence how sound scatters and decays. But researchers can reverse engineer this process as well. They can take a sound sample, and even use that to deduce what the environment is like (in some ways, it’s like how animals use echolocation to “see”).

“We’re mostly modeling the spatial acoustics, so the [focus is on] reverberations,” says Yilun Du, a graduate student at MIT and an author on the paper. “Maybe if you’re in a concert hall, there are a lot of reverberations, maybe if you’re in a cathedral, there are many echoes versus if you’re in a small room, there isn’t really any echo.”

Their model, called a neural acoustic field (NAF), is a neural network that can account for the position of both the sound source and listener, as well as the geometry of the space through which the sound has traveled. 

To train the NAF, researchers fed it visual information about the scene and a few spectrograms (visual pattern representation that captures the amplitude, frequency, and duration of sounds) of audio gathered from what the listener would hear at different vantage points and positions. 

“We have a sparse number of data points; from this we fit some type of model that can accurately synthesize how sound would sound like from any location position from the room, and what it would sound like from a new position,” Du says. “Once we fit this model, you can simulate all sorts of virtual walk-throughs.”

The team used audio data obtained from a virtually simulated room. “We also have some results on real scenes, but the issue is that gathering this data in the real world takes a lot of time,” Du notes. 

Using this data, the model can learn to predict how the sounds the listener hears would change if they moved to another position. For example, if music was coming from a speaker at the center of the room, this sound would get louder if the listener walked closer to it, and would become more muffled if the listener walked into another room. The NAF can also use this information to predict the structure of the world around the listener. 

One big application of this type of model is in virtual reality, so that sounds could be accurately generated for a listener moving through a space in VR. The other big use he sees is in artificial intelligence. 

“We have a lot of models for vision. But perception isn’t just limited to vision, sound is also very important. We can also imagine this is an attempt to do perception using sound,” he says. 

Sound isn’t the only medium that researchers are playing around with using AI. Machine learning technology today can take 2D images and use them to generate a 3D model of an object, offering different perspectives and new views. This technique comes in handy especially in virtual reality settings, where engineers and artists have to architect realism into screen spaces. 

Additionally, models like this sound-focused one could enhance current sensors and devices in low light or underwater conditions. “Sound also allows you to see across corners. There’s a lot of variability depending on lighting conditions. Objects look very different,” Du says. “But sound kinda bounces the same most of the time. It’s a different sensory modality.”

For now, a main limitation to further development of their model is the lack of information. “One thing that was surprisingly difficult was actually getting data, because people haven’t explored this problem that much,” he says. “When you try to synthesize novel views in virtual reality, there’s tons of datasets, all these real images. With more datasets, it would be very interesting to explore more of these approaches especially in real scenes.”

Watch (and listen to) a walkthrough of a virtual space, below: