Algorithms for Searching Among Chinese Characters Could Provide Effective Genome Search Engine
As scientists decode more and more genomes, the tree of life gets pretty complicated. It makes tough work for geneticists … Continued
As scientists decode more and more genomes, the tree of life gets pretty complicated. It makes tough work for geneticists or other researchers who want to understand which organisms share which genes — there are just so many comparisons. So there’s a growing need for a better, easily searchable bioinformatics database.
A Chinese computer scientist has a suggestion: mimic the way search engines index Chinese characters.
Technology Review’s blog helpfully describes why search engines like Google are so fast and why current bioinformatics search systems are not. Most search engines use an inverted index — rather than compiling a list of every single Web page and all its words, for every single word, they compile a list of the places where it appears.
Bioinformatics searches, by contrast, use a couple algorithms that basically compare the data from one genome to the data from another. This is relatively fast when there are only a few genomes, but as they grow exponentially, the searches take much longer.
A simple solution would be to switch to the Google approach — for every base pair “word,” make a list of the genes where they appear. But words are easy to spot, because they have spaces between them. Base pairs do not.
As it happens, Chinese characters don’t, either, but search engines have gotten around this. Wang Liang, a computer scientist at SOSO.com, one of the big three search engines in China, says the trick is to segment the words into “n-grams,” words that are n letters long.
Tech Review explains: There are 1-grams for one-letter words, 2-grams for two-letter words and so on. A search for a 3-letter word, like ABC, can be done by searching for AB and BC. Some Chinese search engines work this way, by indexing all the 2-gram combinations.
OK, then, how many n-grams are in a genetic word? The nucleotides A, T, G and C are only 1-grams, which makes them pretty useless as search terms. So some fuzzy math is required. Liang says DNA sequences follow Zipf’s law, which basically states that in any long document, half the words appear only once. This theory can be used to find an average length for DNA “words.”
Liang studied the genomes of arabidopsis, aspergillus, the fruit fly and the mouse, and found that a good average word length is 12 letters. Therefore, the best way to index genome data is to use 12-grams — that is, 12-letter combinations of A, T, G and C.
With that vocabulary, a Google-like inverted index becomes possible.