A Speech Recognition System Has Reached Human Parity

Microsoft made speech-to-text software that's as good as a professional transcriptionist

By Samantha Cole | Published Oct 19, 2016 8:07 PM EDT

Technology

Speech recognition software isn’t perfect, but it is a little closer to human this week, as a Microsoft Artificial Intelligence and Research team reached a major milestone in speech-to-text development: The system reached a historically low word error rate of 5.9 percent, equal to the accuracy of a professional (human) transcriptionist. The system can discern words as clearly and accurately as two people having a conversation might understand one another.

By combining Microsoft’s open-source Computational Network Toolkit, and being a little bit over-obsessed with this project, the team was able to beat its goal of human parity by years in just months, according to Microsoft’s blog. They hit the parity milestone around 3:30 a.m., when Xuedong Huang, the company’s chief speech scientist, woke up to the breakthrough.

This isn’t a breakthrough that’s only for AI wonks and researchers pulling all-nighters, however. It’s a difference you’ll likely notice when you’re talking to an AI assistant in the near future says Huang, as speech recognition becomes a mainstream user interface. “The recognition accuracy is foundational to any successful user interaction.” It’s the difference between cursing at your phone’s AI assistant when it mistakes “parity” for “parody” three times in a row, and being understood the first time, as if you’re speaking to a real human.

It’s highly accurate, but still imperfect, much like human transcriptionists might be. The biggest problem area where humans and the system disagree was in more nuanced signals, as the researchers note in their paper:

“We find that the artificial errors are substantially the same as human ones, with one large exception: confusions between backchannel words and hesitations. The distinction is that backchannel words like “uh-huh” are an acknowledgment of the speaker, also signaling that the speaker should keep talking, while hesitations like “uh” are used to indicate that the current speaker has more to say and wants to keep his or her turn. As turn-management devices, these two classes of words therefore have exactly opposite functions.”

It could be argued that a lot of people have this problem, as well, but ideally our robots will be better active listeners than our fellow humans. The system also tripped up on the word “I,” often omitting it completely, which could make a great plot of a dystopian sci-fi. Who does the system think “I” is?

microsoft