A Computer Program Uncovers The Evolutionary History Of Words

Languages are hard; it takes a trained ear to tease out not just the verbiage but the idiomatic expressions, the tone, the regional trends and ever-shifting insults that make a person truly fluent. This is one reason why even the best apps and Google Translate just can’t hack it. Similarly, it takes a trained linguist to know how these words, all sprouted from one root, still grow into endless forms all signifying the same thing. Can a cunning computer solve this problem as well as a smart linguist can? The answer, in this case, may be yes.

A new machine-learning algorithm can use sound rules to suss out the most likely phonetic changes in a shifting language. All words shift over time and place, but certain vowels and pronunciations are going to shift more than others–you say tomato, I say tomahtoe, Canadians say “aboot,” and so on. Alexandre Bouchard-Côté and colleagues at the University of British Columbia in Vancouver developed a system that can suggest how words may have sounded in the past, and which sounds were the most likely to shift. Then they compared the results with analysis by human experts, and found the 85 percent of the computer’s suggestions were within a single character of the correct words.

They looked at 637 distinct Austronesian languages, which span the Pacific from the Philippines to Hawaii. They would start, for example, with the word for “star.” In Fijian, the word is kalokalo. In Pazeh, a Taiwanese aboriginal language, it’s mintol. People who speak the Bornean tongue of Melanau call it biten, and those who speak the Filipino dialect called Inabaknon know it as bitu’on. The root word, from which all of these languages evolved, is bituquen. The computer deduced that correctly.

The catch is that there’s a lot of front-end work before the computer can do its analysis. Linguists have to input a list of words in a given language, plus their meanings, and generate a sort of “tree of life” for language–a phylogenetic map showing how each word is related to the others. (It resembles in both form and function the phylogenetic map used by botanists and biologists to show how life is related.) But when it gets to work, the algorithm is efficient. It can recognize cognates, which are words with the same root, within languages, and then figure out the probable root.

The researchers acknowledge there’s still more advanced work to be done, but they hope it will be a boon to historical linguists the way genetic information has changed biology. Instead of morphological change–looking at a thing and seeing how it changes or compares to other things–is much simpler than looking at the genes. This algorithm can work in a similar fashion, computationally studying the roots of words and languages rather than using a specially trained ear. The paper appears this week in the Proceedings of the National Academy of Sciences.