The Glory of Big Data

Suddenly, we can know the world completely. Next, we reprogram it
Jesse Lenz

Late in the first day of this year’s TED Conference, its understated curator, Chris Anderson, took the stage and made a pronouncement. “The computing power in some of the things that we’re seeing is really startling,” he said. “It feels to me as if things have suddenly notched up a level in an unexpected way. We’re used to Moore’s Law. We’re used to things getting better and better and better. And then some years, it suddenly just feels as if—kapow!—there’s a step change.”

Back when it was just an exclusive conference for geek geniuses and venture capitalists, TED stood for Technology, Entertainment, Design. Now that more than 1,000 TED Talks are available free online and have been watched 300 million times, those three initials are less meaningful than is the ethos expressed in the conference’s motto, “Ideas Worth Spreading.” Either way, TED has never been about gathering its speakers together to support some predigested “trend.”

In 2011 the volume of available data is predicted to continue along its exponential growth curve to 1.8 zettabytes. (A zettabyte is a trillion gigabytes; that’s a 1 with 21 zeros trailing behind it.)Which I guess is why Chris sounded so surprised and delighted by his own statement. We’d just heard from Mattias Astrom, whose company digitally maps the world’s cities and every building in them in gloriously faithful three-dimensional renderings. After that, the digital artist Aaron Koblin explained how he visualizes massive data sets (for instance, by tracing every flight in North America and then breaking down the data by time, type of plane, altitude and so on, and presenting it all in a sequentially spoolable rendering that is both revelatory and beautiful). Koblin specializes in crowdsourced projects, such as a Johnny Cash music video drawn by thousands of strangers, frame by frame. The level of detail is breathtaking; one fan lavished 31,000 brushstrokes on a single frame. And because this is digital space and each frame’s creation is recorded and cataloged online, you can watch every stroke, just as each artist drew it.

But it wasn’t until the next morning that we truly understood what Chris had been getting at. That’s when Deb Roy, who directs the Cognitive Machines group at the MIT Media Lab, took the stage and introduced us to the ultimate home movie, 240,000 hours of video and audio covering almost every interaction his son had with anyone throughout the Roy household from the moment the newborn arrived home from the hospital. This provides a complete 1:1 scale map of how the boy learned, especially how he learned to talk—to navigate the world of abstraction, language, data. Using a raw data set that exceeds 200 terabytes (more than 20 times the size of the complete printed collection of the Library of Congress in 2000), Roy can trace exactly how his son experienced each of the words he eventually uttered, and he has teased out some fascinating insights about language acquisition.

Roy has also shown that the methods he developed to store and analyze one child’s speech lessons can be applied more broadly, and he has begun doing so. In particular, he has turned his sweeping computational eye to the social-media sphere, to watch, say, a presidential pronouncement and all of its multiplying repercussions, tweets, retweets, abbreviations, distortions and rebuttals in real time, and in doing so drawing a detailed map of large social networks and how they evolve.

The amount of data available to us is increasingly vast. In 2010 we played, swam, wallowed, and drowned in 1.2 zettabytes of the stuff, and in 2011 the volume is predicted to continue along its exponential growth curve to 1.8 zettabytes. (A zettabyte is a trillion gigabytes; that’s a 1 with 21 zeros trailing behind it.) The IDC Digital Universe study from which I’ve plucked these numbers helpfully notes that if you were inclined to store all that data on the hard drives of 32-gigabyte iPads, doing so would require 57.5 billion devices—enough to erect a 61-foot-high wall 4,005 miles long, from Miami all the way to Anchorage.

One tiny part of that vast wall would house Google’s effort to create as complete a census as possible of the published word since 1500. The company has already gathered enough data—some 500 billion words from more than five million books—to plausibly claim the emergence of a new science, culturomics. Eventually the coinage, evolution and decline of every word and phrase could be traced across centuries. Using Google’s handy Ngram Viewer, we can already observe the explosion of the word “sex” after 1960. Or watch Rembrandt citations gradually grow, exceeding those of Cezanne in 1940, only to witness Picasso blow past both of them less than a decade later. These are not scholarly samples and inferences drawn painstakingly from a few great books; this is the exacting examination of how a word or phrase’s spelling and use actually mutated year by year.

So this is the paradigm shift whose fruits I witnessed in presentation after presentation at TED: a shift from a world of data sampling and extrapolation to one in which all data in a given realm can be collected and analyzed. That is Big Data.

AND BIG DATA is about to get much, much bigger, as we enter an era in which digital data merges with biology. This synthesis of codes takes the abstract world of digits and brings it back into the physical world. We of course know quite a bit about how life is expressed—in the four letters of DNA, in more than 20 amino acids, in thousands of proteins. We can copy life through cloning. Now we are beginning to be able to rewrite life, not just gene by gene, but entire genomes at a time. This is the difference between inserting a single word or paragraph into a Tolstoy novel (which is what biotechnology does) and writing the entire book from scratch (which is what synthetic biology does). It is far easier to fundamentally change the meaning and outcome of a novel, seed, animal or human organ if you write the entire thing.

No matter how you create or program a computer, you will not come downstairs the next morning to find a thousand new computers. Life code is different.We’ve come a long way, very quickly, to reach this point. A decade ago, simply reading the entire life code of a single organism was a breakthrough achievement in processing enormous dollops of data. In 1999, gene sequencers could read only a few hundred base pairs of DNA at a time, so Craig Venter’s human genome project relied on shotgun sequencing: Copy portions of a genome over and over again. Break them into random pieces. Feed these into a gene sequencer. Read the output and then use a computer to compare every sequence with every other sequence, looking for overlaps. When you find an overlap, begin to build up the whole of the genome, much as one builds a brick wall, overlaying brick by brick. A nifty trick, but one that most people until then had thought to be impossible because of the staggering computations involved. Yet Venter and his team built one of the most powerful private computers in the world (in the process becoming one of the largest users of electricity in Maryland) and solved the problem. Theirs is now the standard approach to reading genomes.

But sequencing the genome was a trivial computational exercise compared with the modeling of protein-protein interactions that is being attempted today. To begin with, you have to compare 20 amino acids, instead of four DNA base pairs. And because proteins can take so many more shapes than a strand of DNA, mapping the shape of their every combination is vastly more complex. Today’s computers are barely able to deal with a few of these variables. In spite of the achievements that Moore’s Law has wrought, life-sciences data is exceeding the scope and power of all current computer capabilities and storage.

In other words, in this new era—the transition from digital code to digital-plus-life code—the capacity to generate data exceeds our capacity to store and process it. In fact, life code is accumulating at a rate 50 percent faster than Moore’s Law; it at least doubles every 12 months. Without extraordinary advances in data storage, transmission and analysis, within the next five years we may simply be unable to keep up.

Then again, there’s good reason to expect that we’ll achieve the necessary technology breakthroughs. Because there is one other, absolutely fundamental change going on in the world of Big Data. When you marry life code and digital code, the emerging applications differ from the merely digital in one revolutionary way: This software builds its own hardware. No matter how you create or program a computer, you will not come downstairs the next morning to find a thousand new computers. Life code is different. In 2008, three scientists—Venter, Hamilton Smith and John Glass—and their colleagues took a basic gene sequence from a computer, programmed robots to pick the four chemicals that make up DNA from jars, and assembled the world’s largest organic molecule. They then developed techniques to insert this new molecule into a cell. Bottom line, they programmed a cell to become a different species. Some called it the world’s first synthetic life-form. It is really the first fully programmable life-form. And it reproduces.

Programmable cell platforms are like computer chips. They could eventually be designed to help create or do anything, if you figure out the right code for what you wish to make. I’m a cofounder and investor in a Venter spinout company, Synthetic Genomics, that’s attempting to program algae to generate gasoline (with Exxon), extract gas from coal (with BP), rapid-
prototype vaccines (with Novartis), and breed faster-growing plants (with Plenus). Life programming may also solve the problem of how to store gargantuan data sets. All digital data can be coded into life-forms, and all life-forms can be coded as digital data. In theory, this means you could eventually store, and copy, all the words and images from every issue of the New York Times in the gene code of a few bacteria.

I was blown away by the Big Data parade at TED 2011. But a new era of digital life code promises to dwarf today’s most glorious data achievements.