Audio engineering is making call center robots more ‘human’ and less annoying

There's more to it than smarter A.I.

By Sonia Weiser | Published Feb 22, 2017 6:30 PM EST

Technology

Say you’re on the phone with a company and the automated virtual assistant needs a few seconds to “look up” your information. And then you hear it. The sound is unmistakable. It’s familiar. It’s the clickity-clack of a keyboard. You know it’s just a sound effect, but unlike hold music or a stream of company information, it’s not annoying. In fact, it’s kind of comforting.

Michael Norton and Ryan Buell of the Harvard Business School studied this idea—that customers appreciate knowing that work is being done on their behalf, even when the only “person” “working” is an algorithm. They call it the labor illusion.

Now that interactive voice recognition (IVR) systems are becoming the new normal for customer support lines, and as they’re able to handle increasingly complex transactions, callers are expecting the same, if not better, service than they once received from human operators. But at the same time, customers still want the benefits of a live interaction—namely, personality. Even when we know there’s not a real person on the line, we want to feel “heard” and trust that we’re getting the best results possible.

“Even though technically you shouldn’t really care whether the website shows you its work or not, it really resonates with us,” explains Norton. “It makes us humanize the website, makes us feel like work is being done for us, and then it makes us actually like the product or service more.”

Sound design

A good IVR system provides the customers with a virtual assistant who’s clearly responding to the caller’s needs in a way that keeps him or her informed throughout the process, through both verbal and non-verbal audio cues.

So what goes into designing a successful IVR system?

To put it succinctly: a lot. Like, more than you probably thought.

Because it’s not just about making a working navigation system that gets the job done efficiently. It’s also about composing audio, finding voice talent that reflects the brand, and creating an experience that mimics a natural human-to-human interaction.

Take Delta Airlines as an example. In 2013, in partnership with Nuance Communications, the company launched its custom IVR system.

The process of designing the IVR started with identifying Delta’s “key brand attributes”—in this case, “optimism, determination, leadership, innovation, and passion,” says Gorm Amand, the Director and Global Discipline Leader/User Interface Design at Nuance. “What we wanted to do was come up with audio [for non-verbal cues] that reflected and promoted those attributes” while people navigate the customer service hotline.

How exactly do you take the word “determination” and turn it into song?

That responsibility fell on Nuance Senior Audio Engineer Dan Castellani. He started by studying the music and sounds Delta has used in its advertising campaigns, in order to understand “what Delta wanted in their brand from a musical standpoint.” From there, Castellani sat down at the piano and composed around 30 different iterations of possible filler sound, eventually narrowing it down to the four or five that best aligned with the pre-existing materials and were the least intrusive for customers. The final result is a trance-like sound they call percolation, somewhere in between a piece of music and a basic sound effect.

“It’s really analogous to the process of selecting a voice talent for one of these systems,” says Amand. “The voice embodies the system and automatically conveys brand, and most of us draw conclusions [from it] very quickly.”

When you call Delta, a male voice answers. It’s in the tenor range, seems friendly enough, and inspires a sense of trust—for me, anyway. This impression is in keeping with the results found in a 2014 study out of the University of Glasgow, Scotland, that looked into how different voices are perceived. Out of two male voice samples, the higher pitched sample was seen as less threatening. Think about it: Would you rather have a virtual assistant with Vin Diesel’s voice helping you plan your trip, or one with Paul Rudd’s? I’d personally rather give my credit card information to a Ruddbot.

Along with the pitch and gender association of the voice itself, brands need to consider the pacing. If the virtual assistant is talking too fast, it sounds like they’re reciting canned responses. Too slow, and people get impatient. Many times, “the technology can go faster than a human could, but it’s often not the right thing for establishing trust,” explains Jane Price of Interactions, a Massachusetts-based speech recognition and virtual assistant technology company. Slowing it down a little “helps to put it on track with [a customer’s] normal expectations and how they would prefer to communicate.”

This brings us back to the automated typing sound. Michael Pell, Interactions’ director of design services, decided that the company’s signature filler audio would be keyboard clacking. They’ve licensed the audio to other companies, including Hyatt, Humana, and LifeLock.

“With the filler you want it to do two things,” explains Pell. “You want people to understand that you’re still there and that you’re doing something for them. In the computer age, you can say work is being done by typing. It has the immediate naturally understood connotation of ‘I’m doing something for you.’”

Filler forever?

A well designed IVR really combines the best of two worlds. It’s a human interaction without a potential attitude problem. Robots can’t have bad days, and they’re never hangry. Plus, reducing the number of human call center assistants saves the companies a substantial amount of money.

Here’s a question though: what will happen to filler sounds when computers can handle complex, context-sensitive tasks in a second or less? Bill Byrne, the original inventor of what he calls “fetch audio” for Goog-411, Google’s first speech recognition effort, thinks filler may soon become a thing of the past. Goog-411 and other early iterations of speech recognition programs required extra time to execute customer requests, so that filler was a necessity, but already, as processes speed up, we’re hearing less of it.

Still, the teams at Nuance and Interactions have faith that filler will never become obsolete. Not because computers won’t get faster, but because innovation in the field of speech recognition will continue. Computers will be tasked with handling even more complex demands as the algorithms’ capabilities evolve, and will once again, need additional time to do so.

Plus, as Norton says, “we really like other people doing what we tell them to do,” especially when it’s a no fun, time-sucking, labor-intensive task. So being able to sit back and hear someone else do the typing will always be a gratifying experience. Even if it’s just a bot.

Robots