A Google For Genomes? A Chinese computer scientist has come up with a way to index genomic data that mimics the way search engines index Chinese characters. It could pave the way for a more easily searchable bioinformatics database. Wikimedia Commons/Webridge

As scientists decode more and more genomes, the tree of life gets pretty complicated. It makes tough work for geneticists or other researchers who want to understand which organisms share which genes -- there are just so many comparisons. So there's a growing need for a better, easily searchable bioinformatics database.

A Chinese computer scientist has a suggestion: mimic the way search engines index Chinese characters.

Technology Review's blog helpfully describes why search engines like Google are so fast and why current bioinformatics search systems are not. Most search engines use an inverted index -- rather than compiling a list of every single Web page and all its words, for every single word, they compile a list of the places where it appears.

Bioinformatics searches, by contrast, use a couple algorithms that basically compare the data from one genome to the data from another. This is relatively fast when there are only a few genomes, but as they grow exponentially, the searches take much longer.

A simple solution would be to switch to the Google approach -- for every base pair "word," make a list of the genes where they appear. But words are easy to spot, because they have spaces between them. Base pairs do not.

As it happens, Chinese characters don't, either, but search engines have gotten around this. Wang Liang, a computer scientist at SOSO.com, one of the big three search engines in China, says the trick is to segment the words into "n-grams," words that are n letters long.

Tech Review explains: There are 1-grams for one-letter words, 2-grams for two-letter words and so on. A search for a 3-letter word, like ABC, can be done by searching for AB and BC. Some Chinese search engines work this way, by indexing all the 2-gram combinations.

OK, then, how many n-grams are in a genetic word? The nucleotides A, T, G and C are only 1-grams, which makes them pretty useless as search terms. So some fuzzy math is required. Liang says DNA sequences follow Zipf's law, which basically states that in any long document, half the words appear only once. This theory can be used to find an average length for DNA "words."

Liang studied the genomes of arabidopsis, aspergillus, the fruit fly and the mouse, and found that a good average word length is 12 letters. Therefore, the best way to index genome data is to use 12-grams -- that is, 12-letter combinations of A, T, G and C.

With that vocabulary, a Google-like inverted index becomes possible.

[Technology Review]

3 Comments

In so specialized area, it would be worth to develop new mathematics set of numbers and laws, corresponding with numbers of genes. Something like Fibonacci number sequences, but perhaps even without numbers, using glyphs instead.
So computers would only have to "look" at the puzzle like pictures and how pieces fit trough channels in slices, similar to tetris or some other games.

OK this is getting too easy and I'm already a little drunk. Xspot... Much loves man.
I've often pondered on how to categorize genome sequences. Do genes really become exponentially complicated as they get longer or do they follow a rigid if-this-then-this formula? I have no idea. But at some point in time we will have to make computer algorithms to process and understand genes rapidly. So I can only hope that this technology makes gene understanding more readily available to the public.
As I said... I'm drunk at this point... PopSci ppl. :P

This is a truly valuable invention, it gives researchers the ability to quickly look for genetic matches and test their hypothesis in minutes instead in weeks across people and different species. If this can be turned into a service available to researchers around the world it has tremendous value! I am sure that university & research centers around the world would be willing to pay a couple hundred dollars per year to have access to such a service. Also, the companies that created the maps of the human genome should also put this on their list of ways to monetize their investments. Finally, this ties into the article below on sequencing of ancient human’s genes, being able to quickly compare gene sets between current and ancient humans would lead to some great discoveries!

www.popsci.com/science/article/2010-02/first-genome-ancient-human

Innovation Strategist and Manager
Ph.D. in Innovation Management from Purdue University
My Visual CV
www.visualcv.com/ik5pn40
www.linkedin.com/in/drbrianglassman
www.amazon.com/gp/pdp/profile/A9L4HI981KP09



June 2013: American Energy Independence

Five amazing, clean technologies that will set us free, in this month's energy-focused issue. Also: how to build a better bomb detector, the robotic toys that are raising your children, a human catapult, the world's smallest arcade, and much more.


Online Content Director: Suzanne LaBarre | Email
Senior Editor: Paul Adams | Email
Associate Editor: Dan Nosowitz | Email
Assistant Editor: Colin Lecher | Email
Assistant Editor: Rose Pastore | Email

Contributing Writers:
Rebecca Boyle | Email
Kelsey D. Atherton | Email
Francie Diep | Email
Shaunacy Ferro | Email

circ-top-header.gif
circ-cover.gif
bmxmag-ps