Graphing the Voice of Terror
The Osama tapes highlight a technical challenge: verifying the voice of the enemy.
Courtesy of Owl Investigations, Photography by John B. Carnett
The bottom spectrogram is a clean bin Laden sample taken from an ABC News interview. Note the distinct vocal formants and their harmonics arcing right-to-left across the page. The top graph is of the dirty November telephone recording, in which formants are obscured by background noise — so obscured, Tom Owen says, that no computer could ID the voice.
Last November’s split verdict on the Osama bin Laden tape was more than another disagreement between the United States and Europeans over the al Qaeda threat. It was a salvo in a war that is heating up over the future of forensic voice analysis, or voiceprinting.
On November 12, the independent Arabic station al Jazeera broadcast a recording of a call it claimed to have received from bin Laden, in which the al Qaeda leader praised recent terrorist attacks and promised more of the same. The CIA and National Security Agency immediately turned to their voice analysts. We don’t know exactly what tools the top-secret NSA brought to bear, but it’s very likely the agency’s experts were, like their peers in the private sector, trained to parse speech by comparing spectrograms, a kind of graphic speech rendering that has changed little since the 1940s. Picture scratchy inkblots etched across a ribbon of paper and you have an idea what they were poring over.
The television networks turned to independent but agency-connected experts for their own judgment: Was this tape real? Within days, the verdict was in: bin Laden, alive and plotting.
Across the sea, Switzerland’s IDIAP (Dalle Molle Institute for Perceptual Artificial Intelligence) turned to biometric software to analyze the tape. The institute’s computers boiled the problem down to a shiny turquoise data point on the “non bin Laden” side of an algorithmically derived decision boundary. The Swiss analysis came with the qualifier that the study was motivated by “pure scientific curiosity, to . . . see what conclusion our state-of-the-art speaker authentication system would reach.” The Swiss biometrics program put the likelihood of the voice being that of a bin Laden impostor at 55 to 60 percent. Equivocal at best, but enough to throw cold water on the American verdict, and by implication on traditional methods of forensic voice identification.
Back in the New World, the Old School wasn’t impressed.
To show me why, Tom Owen, one
of North America’s busiest forensic voice analysts — and one of only eight certified by the American Board of Recorded Evidence — invited me to his basement sound lab in Colonia, New Jersey. It was Owen who the major U.S. television networks turned to for verification of the government claim about the bin Laden tapes. On the afternoon of my visit, Owen had just finished teaching a month-long class in voice identification for a group of Saudi intelligence officers. Conveniently, a captain of the Saudi Interior Ministry’s forensics department had been on hand when Owen received the bin Laden tape for analysis last November. Translation was not a problem.
A former audio engineer for New York’s Lincoln Center, Owen fell into forensics in the 1980s, when an NYPD detective showed up at his sound studio with a “dirty” recording of a bomb threat. Owen cleaned up the background noise, as he had on countless old recordings of singers from Enrico Caruso to Dionne Warwick. It gave him a taste for forensic work.
Floor-to-ceiling racks of spectrum analyzers, signal processors, equalizers, mixers, amplifiers and record-playback systems wrap around the walls of Owen’s soundproof basement. But as is often the case in forensics, the master’s favorite tool remains a piece of vintage equipment — a reel-to-reel Voice Identification 700 spectrograph built in 1973. It differs little from the analog machines U.S. Army intelligence officers built to identify and track German radio operators during World War II.
Before my arrival, Owen had cranked up the machine to produce a neat pile of spectrograms from a 1998 ABC News interview with bin Laden, one of the only samples of the al Qaeda leader’s voice that Owen considers 100 percent verified. The machine’s stylus translated the acoustic energy of bin Laden’s voice into a voiceprint, etching data across a paper strip attached to the machine’s spinning drum.
Looking at the voiceprints, I can easily make out the scratchy, bar-shaped formants, or voice frequencies, produced by each syllabic utterance. The smudges resemble so many boxy notes stacked on an eight-line measure. The human voice doesn’t emit single notes, Owen explains, but chords, or harmonics.
Owen hands me a spectrogram of the November al Jazeera broadcast. A storm of black lines covers the paper strip from top to bottom, end to end. With Owen’s coaching, I imagine I can see the underlying formant bars, all but obscured behind a dark veil of background noise and broadcast carrier signals. A biometrics program could never sort through the noise, Owen insists. “They’re designed to work with perfect samples.” Cleaning up the tape won’t work either, he says. “That’s fine if all you want to do is hear what he’s saying more clearly. But cleaning up background noise removes the high and low frequencies I need to make my identification.” A biometric system demands the same frequencies, he says, and while he believes the NSA has obtained samples of bin Laden’s voice that he is not privy to, he doesn’t believe the agency has made biometric breakthroughs on the analysis side.
“I know for a fact they have things the FBI and the CIA don’t have. But their technology is mostly devoted to listening,” Owen says.
How certain can Owen’s methods be with a short, poor-quality recording? Not only was the tape dirty, but there were only a half-dozen words in common between the November tape and the ABC interview. (The standards of the American Board of Recorded Evidence demand no fewer than 20 identical words — preferably spoken in the same order — to verify a positive voice identification.)
Owen notes that examining a spectrogram is only half of his job. His is the art of listening for the multitude of quirky mannerisms and pronunciation foibles peculiar to each voice. A trained ear can detect the subtle whistle caused by a missing tooth, a person’s tendency to swallow in the middle of a sentence, even the way someone sets his or her jaw when speaking.
Owen plays me what he calls a short-term memory tape, a crucial tool in aural, or by-ear, voice identifications. The spliced tape toggles between 2.5-second segments of bin Laden’s ABC interview and the scratchy al Jazeera broadcast; what Owen listens for — what voice identification is based on — are peculiarities in the way a voice expresses the formant structure, especially the vowels. “Same guy,” says Owen. He insists bin Laden’s voice is plenty peculiar but refuses to elaborate on those vocal quirks and risk giving impostors a road map.
To my untrained ear, it could be Darth Vader behind the static. All this seems somewhat ineffable — a mixture of art and science understood by only eight sanctioned experts in the country. This is the sort of gray area that tends to make legal observers worry about the state of forensic science.
“Too often, I’ve seen cases of people wrongly accused of making threatening calls,” admits retired Michigan detective Lonnie Smrkovsky, the acknowledged grandfather of forensic audio analysis. “I think at some point in time, we have to find a way to fully automate voice identification.”
Way back in the 1980s, Smrkovsky eagerly lent his expertise to efforts by the Los Angeles County Sheriff’s Department to do just that. Funded by a National Institute of Justice grant, the project fizzled after two years when sexier projects such as DNA analysis siphoned away federal money.
But corporate America threw plenty of money at the problem when it saw the potential for voice-activated bank and credit card accounts and voice-based security systems. The last decade has seen tremendous progress, says Larry Heck, director of speech R&D at Nuance Communications, a commercial leader in voiceprint technology. “We’ve got the algorithms to measure the physical characteristics of a person’s voice,” he explains. “But we’re still working on the behavioral stuff.”
In other words, a good biometric program can ace the spectrographic analysis of a human voice — the first half of a human expert’s assessment. That’s sufficient for identifying a nice, clean sample of someone repeating his or her name into a high-quality microphone. Under ideal circumstances, the error rates of the best biometric speaker verification systems come in under one-half of 1 percent. The problems arise when the samples are dirty.
Which brings us back to the Swiss analysis of the purported bin Laden broadcast. IDIAP is an internationally renowned biometrics institute, and it calibrated its voice recognition software to recognize the al Qaeda leader’s voice using 15 authenticated recordings. Its researchers then tested the program’s accuracy against 15 other authenticated bin Laden recordings and 16 recordings of other Arabic speakers. The latter included two recordings of a person deliberately mimicking portions of the authenticated tapes. The recordings used to tune and test the system ranged in quality from good to mediocre to poor.
The system correctly rejected all 16 “non bin Ladens,” including the bin Laden imitators, and mistakenly excluded just 1 out of the 15 authenticated recordings — a success rate of 97 percent. It ranked the certainty of each determination by generating data points on a graph bisected by a yes-or-no decision boundary. (The farther from the bisecting line, the more mathematically certain the decision.) In the end, its analysis of the disputed broadcast produced a data point just to the “not bin Laden” side of the decision boundary; hence the 55 to 60 percent probability that the voice was not that of the al Qaeda leader.
The system has a long way to go, admits IDIAP general director Herve Bourlard. “There are things you can do to confuse a speaker verification system that are not going to confuse the human ear,” he says. “On the other hand, there are people who can fool the human ear with voice mimicry. But they will never confuse a computer.”
At this point, says Bourlard, biometrics should complement, not replace, forensic voice experts. But he has no doubt that computers will in many cases surpass the best-trained human ear.
“I don’t know if it’s two or five years away,” he says. “But we’re going to get there. For sure.”