How a US intelligence program created a team of ‘Superforecasters’

In Overmatched, we take a close look at the science and technology at the heart of the defense industry—the world of soldiers and spies.

AROUND 2011, when Warren Hatch’s job was moving money around on Wall Street, he read a book that stuck with him. Called Expert Political Judgment: How Good Is It? How Would We Know?, it was written by psychologist Phil Tetlock, who was then working as a business professor at the University of California, Berkeley. 

The book was a few years old, and Hatch was curious about what Tetlock had been up to since, so he went to the academic’s website. On the page, to his surprise, he found an invitation. Tetlock was looking for people who wanted to forecast geopolitical events. Did he want a chance to try to predict the future? Hatch says he remembers thinking, Who wouldn’t?

Hatch signed up right away and soon joined a virtual team of people who were trying to predict the likelihood of various hypothetical future global happenings. They were giving probability-based answers to questions like: Will Country A or Country B declare war on each other within the next six months? Or: Will X vacate the office of president of Country Y before May 10, Year Z? The answers would take this type of form: There is a 75 percent chance the answer to the question is yes, and a 25 percent chance it is no.

“I just thought it was a fun way to while away some time,” says Hatch.

It may have been fun for Hatch, but it was serious business for the US intelligence community, whose R&D arm—the Intelligence Advanced Research Projects Activity (IARPA)—was sponsoring the project. Tetlock, along with a team of scholars, was a participant in the spy agency’s Aggregative Contingent Estimation, or ACE, program.

The ACE program aimed to, as the description page put it, “dramatically enhance the accuracy, precision, and timeliness of intelligence forecasts.” 

Tetlock was leading an ACE team with colleague Barbara Mellers to try to make that dramatic enhancement—and to do it better than four other competing teams. Tetlock’s secret sauce ended up being a set of expert forecasters like Hatch. 

Becoming a Superforcaster

Hatch, at the time, didn’t know much about the grand vision that the head researchers, or IARPA, had in mind. After he’d been making predictions for Tetlock for a while, though, something strange happened. “Some of the better team members disappeared,” Hatch says.

It wasn’t nefarious: The researchers had deemed these skilled predictors to be “Superforecasters,” because of their consistent accuracy. Those predictors, Hatch later learned, had moved along and been placed in teams with other people who were as good as they were. 

Wanting to be among their ranks, Hatch began to mimic their behaviors. He started being more active in his attempts, leaving comments on his forecasts to explain his reasoning, revising his fortune-telling as new information came in. “And after a couple of months, it clicked,” he says. “I started to get it.” 

In the second year, Hatch was invited to become a Superforecaster.

Meet the Good Judgment group

The team, then headquartered at the University of Pennsylvania, called itself Good Judgment. It was winning ACE handily. “The ACE program started with this idea of crowd wisdom, and it has sought ways of going beyond the standard wisdom of the crowd,” says Steven Rieber, who managed the program for IARPA. 

The teams’ forecasts had to be increasingly accurate with each year of the competition. By the end of the first year, Good Judgment had already achieved the final year’s required level of accuracy in its forecasts. 

Eva Chen, a postdoc on one of the other (losing) ACE teams, was watching with interest as the first year transitioned into the second. “It’s a horse race,” she says. “So every time a question closes, you get to see how your team is performing.” Every time, she could see the Good Judgment group besting both her team and the crowd. What are they doing? she recalls wondering.

Chen’s group ended up shuttering, as did the rest of the teams except Good Judgment, which she later joined. That group was the only one IARPA continued working with. Chen made it her mission to discover what they were doing differently. 

And soon she found out: Her previous team had focused on developing fancy computational algorithms—doing tricky math on the crowd’s wisdom to make it wiser. Good Judgment, in contrast, had focused on the human side. It had tracked the accuracy of its forecasts and identified a group that was consistently better than everyone else: the so-called Superforecasters. It had also trained its forecasters, teaching them about factors like cognitive biases. (One of the most well-known such errors is confirmation bias, which leads people to seek and put more weight on evidence that supports their preexisting ideas, and dismiss or explain away evidence to the contrary.) And it had put them in teams, so they could share both knowledge about the topics they were forecasting and their reasoning strategies. And only then, with trained, teamed, tracked forecasts, did it statistically combine its participants’ predictions using machine learning algorithms.

While that process was important to Good Judgment’s success, the Superforecaster (now a trademarked term) element gets the most attention. But, curiously, Superforecasters—people consistently better at forecasting the future than even experts in a field, like intelligence analysts—were not an intended outcome of Good Judgment’s IARPA research. “I didn’t expect it,” says Rieber, “and I don’t think anyone did.”

These forecasters are better in part because they employ what Rieber calls “active open-minded thinking.” 

“They tend to think critically about not just a certain opinion that comes to mind but what objections are, or counterexamples to an opinion,” Rieber says. They are also good at revising their judgment in the face of new evidence. Basically, they’re skilled at red-teaming themselves, critiquing, evaluating, and poking holes in all ideas, including their own—essentially acting as devil’s advocate no matter where the opinion came from.

Seeing dollar signs in Superforecasters, the Good Judgment ACE team soon became Good Judgment Inc., spinning a company out of a spy-centric competition. Since then, curious fortune-seekers in sectors like finance, energy, supply chain logistics, philanthropy, and—as always—defense and intelligence, have been interested in paying for the future these predictive elites see.

Chen stayed on and eventually became Good Judgment’s chief scientist. The company currently has three main revenue streams: consulting, training workshops, and providing access to Superforecasters. It also has a website called Good Judgment Open, where anyone can submit predictions for crowdsourced topics, for fun and for a shot at being recruited as an official, company-endorsed Superforecaster.

Not exactly magic

But neither Good Judgment nor the Superforecasters are perfect. “We don’t have a crystal ball,” says Rieber. And their predictions aren’t useful in all circumstances: For one, they never state that something will happen, like a tree will definitely fall in a forest. Their forecasts are probability based: There is an 80 percent chance that a tree will fall in this forest and a 20 percent chance it won’t. 

Hatch admits the forecasts also don’t add much when there’s already lots of public probability-based predictions—as is the case with, say, oil prices—and also when there isn’t much public information, like when political decisions are made based on classified data.

From an intelligence perspective (where the intelligence community’s own ultrapredictors might have access to said classified information), forecasts nevertheless have other limitations. For one, guessing the future is only one aspect of a spy’s calculus. Forecasting can’t deal with the present (Does Country X have a nuclear weapons program at this particular moment?), the past (What killed Dictator Z?), or the rationale behind events (Why will Countries A and B go to war?). 

Second, questions with predictive answers have to be extremely concrete. “Some key questions that policymakers care about are not stated precisely,” says Rieber. For example, he says, this year’s intelligence threat assessment from the Office of the Director of National Intelligence states: “We expect that friction will grow as China continues to increase military activity around the island [of Taiwan].” But friction is a nebulous word, and growth isn’t quantified. 

“Nevertheless, it’s a phrase that’s meaningful to policymakers, and it’s something that they care about,” Rieber says.

The process also usually requires including a date. For example, rather than ask, “Will a novel coronavirus variant overtake Omicron in the US and represent more than 70 percent of cases?” the Good Judgment Open website currently asks, “Before 16 April 2023, will a SARS-CoV-2 variant other than Omicron represent more than 70.0% of total COVID-19 cases in the US?” It’s not because April is specifically meaningful: It’s because the group needs an expiration date.

That’s not usually the kind of question a company, or intelligence agency, brings to Good Judgment. To get at the answer it really wants, the company works around the problem. “We work with them to write a cluster of questions,” Chen says, that together might give the answer they’re looking for. So for example, a pet store might want to know if cats will become more popular than dogs. Good Judgment might break that down into “Will dogs decrease in popularity by February 2023?” “Will cats increase in popularity by February 2023?” and “Will public approval of cats increase by February 2023 according to polls?” The pet store can triangulate from those answers to estimate how they should invest. Maybe.

And now, IARPA and Rieber are moving into the future of prediction, with a new program called REASON: Rapid Explanation Analysis Sourcing Online. REASON throws future-casting in the direction it was probably always going to go: automation. “The idea is to draw on recent artificial intelligence breakthroughs to make instantaneous suggestions to the analysts,” he says. 

In this future, silicon suggestions will do what human peers did in ACE: team up with humans to try to improve their reasoning and so their guess at what’s coming next, so they can pass their hopefully-better forecasts on to the other humans: those who make decisions that help decide what happens to the world.

Seeding doubt

Outside the project, researcher Konstantinos Nikolopoulos, of Durham University Business School in England, had a different criticism for Superforecasting than its accuracy, whose rigor he saw others had followed up on and confirmed. Nevertheless, he says, “something didn’t feel right.” 

His qualm was about the utility. In the real world, actual Superforecasters (from Good Judgment itself) can only be so useful, because there are so few of them, and it takes so long to identify them in the first place. “There are some Superforecasters locked in a secret room, and they can be used at the discretion of whoever has access to them,” he says. 

So Nikolopoulos and colleagues undertook a study to see whether Good Judgment’s general idea—that some people are much better than others at intuiting the future—could be applied to a smaller pool of people (314, rather than 5,000), in a shorter period of time (nine months, rather than multiple years). 

Among their smaller group and truncated timeframe, they did find two superforecasters. And Nikolopoulos suggests that, based on this result, any small-to-medium-size organization could forecast its own future: Hold its own competition (with appropriate awards and incentives), identify its best-predicting employees, and then use them (while compensating them) to help determine the company’s direction. The best would just need to be better than average forecasters. 

“There is promising empirical evidence that it can be done in any organization,” says Nikolopoulos. Which means, although he doesn’t like this word, Good Judgment’s findings can be democratized.

Of course, people can still contract with Good Judgment and its trademarked predictors. And the company actually does offer a Staffcasting program that helps identify and train clients’ employees to do what Nikolopoulos suggests. But it does nevertheless still route through this one name-brand company. “If you can afford it, by all means, do it,” he says. “But I definitely believe it can be done in-house.”

Good Judgment would like you to visit its house and pay for its services, of course, although it does offer training for outsiders and is aiming to make more of that available online. In the future, the company is also aiming to get better at different kinds of problems—like those having to do with existential risk. “The sorts of things that will completely wipe out humanity or reduce it so much that it’s effectively wiped out,” says Hatch. “Those can be things like a meteor hitting the planet. So that’s one kind. And another kind is an alien invasion.” 

On the research side, the company hopes to improve its ability to see early evidence not of “black swans”—unexpected, rare events—but “really, really dark gray swans,” says Hatch. You know, events like pandemics.

Five years from now, will Good Judgment be successful at its version of predicting the future? Time will tell. 

Read more PopSci+ stories.

Sarah Scoles Avatar

Sarah Scoles

Contributing Editor

Sarah Scoles is a freelance science journalist and regular Popular Science contributor, who’s been writing for the publication since 2014. She covers the ways that science and technology interact with societal, corporate, and national security interests. The author of the books Making Contact, They Are Already Here, Astronomical Mindfulness, and the forthcoming Mass Defect, she lives in Denver and escapes to the mountains to search for abandoned mines and ghost towns as often as she can.