This Man Wants To Save Science Research From Itself

All DARPA's Paul Cohen needs to do is get past the problem of people

DARPA, the federal government's Defense Advanced Research Projects Agency, is to tech geeks what the CIA is to spy buffs: an alluring, mythologized, occasionally cryptic organization. Reading this, you're using at least one project to come out of the agency (the internet), and statistically, you have another within reaching distance: GPS and Siri were based on DARPA projects, too. If you want a look at the future five years out, there are worse places to start. So this summer, when me and gaggle of other reporters were invited to the Pentagon for a DARPA science fair, it wasn't surprising so many of us took the agency up on the offer.

Go to the Pentagon's check-in counter, whip out your driver's license, punch your social security number into a machine, walk down the halls with a chatty guard who asks where your shirt is from, and you've made it to the courtyard. When I visited, it was a warm May afternoon, frequently overcast, threatening to rain out the picnic that DARPA had graciously set up.

"Science fair" turned out to be a more accurate description than I thought. There were dozens of tables, organized by category: big data, cyberwar, even a side-tent showcasing a virtual reality project. There was software that could track down terrorists, an unhackable drone. And there was a balding, bearded, bespectacled man in the middle standing next to a simple sign that read Big Mechanism. Despite the self-effacing display, he seemed to always be talking with someone. I made a few laps around the booths before I introduced myself.

His name was Paul Cohen, a University of Arizona information science professor on loan to DARPA , tasked with executing an ambitious project: using data to fundamentally change the way science is done, creating a program that will read and store every paper ever written on a topic, then crunch that data to find new, possibly brilliant solutions to problems. It's a task no human or organization could accomplish alone -- and the pilot program will focus on treating cancer.

He kept explaining. He absorbs questions thoughtfully, takes his time to reply. The most excited I saw him was when another journalist passed through with a question: But machines can't really have original thoughts, right?

Cohen lit up faster than the uniformed personnel on a smoke break behind us. Original thoughts? Did the computer that beat Kasparov at chess have an original thought?

"You bet it had an original thought!"

***

There may be as many as 1.8 million scientific papers published per year, but their reach, though exact statistics are debated, is undoubtedly dismal: one study determined that more than half of papers are never read by anyone outside of authors and reviewers. In many ways, science still resembles the Middles Ages: small enclaves of researchers, myopically passing their papers to each other without connecting them into a grand vision.

Now we have the internet. We have research hubs, private corporations, universities. The problem isn't scale, it's volume. Experts read many papers that are key to understanding their fields, and institutions collectively read many more, but no person or organization can read them all, drawing connections between them along the way. We'll never see the forest with so many trees in the way.

The computer scientist Paul Newell eloquently, almost searchingly, put the problem like this, in comments on papers at a 1973 symposium. The essay is called "You can't play 20 questions with nature and win," and Cohen is a fan:

We never seem in the experimental literature to put the results of all experiments together. The paper by Posner in the present symposium is an excellent example -- excellent both in showing the skillful attempts we do currently make and in showing how far short this falls of really integrating the results. We do--Posner does--relate sets of experiments. But the linkage is extraordinarily loose. One picks and chooses among the qualitative summaries of a given experiment what to bring forward and juxtapose with the concerns of a present treatment.

The director of DARPA's Information Innovation Office, where Cohen is stationed, gave a talk to reporters after the demo festivities. A charming, unrelentingly excited man, he skipped to a PowerPoint slide that showed the massive Hangar 51 storage facility from Raiders of the Lost Ark, filled with shipping crates, comparing it to the wealth of information we constantly sift through. Imagine if the next great cure for a disease was there.

"There! It's in there," he said. "Go find it, Indiana Jones."

Cohen wants to bring that overwhelming amount of research together into a whole. This isn't as difficult as it sounds. It is exceptionally more difficult than it sounds.

Software that can truly understand language is still in its infancy. When reading an article -- either a scientific paper, or, say, an article like the one you're reading now -- software picks up on certain cues. Gathering the prominence of a word here, a phrase there, it can make some coherent sense out of language, at least enough to suggest a related article or paper. But that comprehension is fleeting. As anyone who has attempted to use Google translate can tell you, machines still frequently have trouble making decisions with context. They might be able to accurately translate a sentence or phrase, but trip up when making the intuitive leaps required of a paragraph or longer text.

Cohen thinks, or at least hopes, he can overcome these hurdles. He'll start small: a hundred or so papers on cancer. If all goes according to plan, the machine, known as Big Mechanism, will read them, and begin developing an understanding of the disease. Then Cohen will set it loose, allowing it to scrape the world's research databases for more and more information, thousands of papers. Eventually, it will establish a Grand Unified Theory of cancer, an understanding of how cells mutate and become cancerous, how they ravage the body, as well as which drugs slow the spread. One paper will describe the beginnings of the illness, another the treatment; Big Mechanism would place them all in harmony.

The end result would be a constellation of papers, all connected and each making their own contribution to the bigger picture. Researchers, armed with that, could begin adding tweaks to the model, experimenting to see what might happen when a new drug is added to the equation. Even better, Big Mechanism might one day be able to create potential treatments on its own. Our relationship to research, machines, and science could change.

The only unavoidable kink in the plan is people.

***

DARPA headquarters is a giant slab of glass in Arlington, across from a mall. It's imposing, but unmarked apart from the 675 street number above the entrance. The day after the demos, I met Cohen there. Security in the building was expectedly tight; I passed a metal detector, checked my phone, and was ushered up to the offices.

The floor where Cohen works is standard industrial office plan -- more university department than mad scientist laboratory. We made our way to a conference room thankfully marked Unclassified and continued where we left off. Cohen sat cross-legged on the chair across from me.

"When you hear a sentence," he explained, "you have to do some mental work."

There are countless assumptions and interpretations required of us to make sense out of language. There are vagaries in grammar and chunks that are meaningless without complete context.

Even in the seemingly dry world of health research, there are well documented examples of what Cohen calls "minor crimes": the temptation to bury the fact that a study was only conducted on worms, or that a correlation may not be as decisive as the prose would make you think. Big Mechanism has to account for those problems. "Scientists are not above rhetorical tricks," he said.

There will be an expected amount of noise, of course. Research may turn out to be irreproducable; a flaw in the data might call for a retraction. But the sheer volume of data in Big Mechanism would force the signal through the noise. Every datapoint, Cohen says, will be based on multiple papers, meaning it's backed up by more than a stray study. Once the machine has that foundation, the work on abosorbing new information into the model begins: the software will take the findings from a paper, and find how they fit in the grand scheme of research. With that model, we may find that research previously thought to be incompatible could suddenly cohere. Say the findings of Paper X appear to contradict the findings of Paper Y. A machine model could resolve the two by finding a third variable that explains the papers' varying findings. After that, it's only a short leap to developing new cancer treatments, or conducting an even more complicated experiment.

***

As unlikely as it sounds, cancer is a relatively easy system to understand, at least for the Big Mechanism project. The disease grows and spreads on the molecular level, and can be understood through a narrow lens: some chemistry, some biology. But if Big Mechanism succeeds, the project could move on to more complicated systems, where we truly have no clue how changes will affect an entire system. Cohen mentions climate: "Do you know how the climate works?"

Hm. No?

"You're in good company."

He brings up the specter of geoengineering, the controversial idea that climate change can be curbed by manipulating the environment. What would happen if we dumped 100 tons of iron into the ocean to stimulate plankton growth, as one rogue geoengineer did? The results could be catastrophic, or not. The relationships in the system -- between the land, the ocean, the plants and animals, the countless chemical processes -- is too complicated to foresee all the consequences of a drastic measure. But Big Mechanism might one day be able to. This is what drew DARPA, a Department of Defense branch, into the project. Another complicated structural organization might be the hierarchy of a terrorist group; Big Mechanism could instantly understand where the main players sit in a criminal organization, then continue incorporating new information on them into a constantly updating model.

Much of this is speculative. The project is still in the early stages, and to what degree it works -- if it works at all -- is anyone's guess. Big Mechanism may prove to be a theoretically fascinating idea that ultimately falls apart under practical applications. I pointed out to Cohen that we seem to be some years away from machines analyzing _Moby Dick _for metaphor. He said he would agree, although he's optimistic about predictions of what machines will accomplish.

Either way, those theoretical ideas? They're really something.

If the project does work out, if it not only proves worthwhile but "succeeds spectacularly," Cohen says it will be a case for creating a new kind of research language, where the ambiguities of language are excised. A block of standardized code could be placed at the top of every paper when it's published, detailing its findings and relevancy. A program would sort the information into the proper theory instantly -- no human language to worry about. Scientists would never again have to pore over studies until they find the relevant bits; they'd be available as soon as their findings were inputted. Of course, all of that itself requires fighting human inertia, which is another battle entirely.

Newell, in those comments from more than 40 years ago, put his ambivalence toward the problem like this:

That the same human subject can adopt many (radically different) methods for the same basic task, depending on goal, background knowledge, and minor details of payoff structure and task texture--all this--implies that the 'normal' means of science may not suffice. As to the first question, the harshness, I restate my initial point: this is my confused and distressed half speaking. My other half is tickled pink at how fast and how far we have come in the last decade, not to speak of the last two days.