Courts use algorithms to help determine sentencing, but random people get the same results

There’s are plenty of things you shouldn’t leave up to random people on the internet: boat names (see: Boaty McBoatface), medical diagnoses (see: everyone on Twitter who thought your cold was pneumonia), and predicting whether convicted criminals are likely to reoffend based on demographic data (see: this story).

But according to a new study in Science Advances, we may as well be doing just that.

Though most of us live in blissful ignorance, algorithms run quite a few aspects of our existence. Bank loans, music recommendations, and the ads we’re served are often determined not by human judgment, but by a mathematical equation. This is not inherently problematic. The ability to process large quantities of data and condense them into a single statistic can be powerful in a positive way—it’s how Spotify can personally recommend music to all its subscribers every single week. If your new playlist misses the mark, it doesn’t really matter. But if you’re sentenced to 10 years in jail instead of five because some algorithm told the judge you were likely to reoffend in the near future, well, that’s a tad more impactful.

Judges often get a recidivism score as part of a report on any given convicted criminal, where a higher number indicates the person is more likely to commit another crime in the near future. The score is intended to influence a judge’s decision about how much jail time someone should get. A person who’s unlikely to commit another crime is less of a threat to society, so a judge will generally give them a shorter sentence. And because a recidivism score feels impartial, these numbers can carry a lot of weight.

Algorithms sold to courts across the United States have been crunching those numbers since 2000. And they did so without much oversight or criticism, until ProPublica released an investigation showing the bias of one particular system against black defendants. The algorithm, called COMPAS, could single out those who would go on to reoffend with roughly the same accuracy for each race. But it guessed wrong about twice as often for black people. COMPAS mislabeled a person who didn’t go on to reoffend as “high risk” almost twice as often for those individuals. And COMPAS also mistakenly assigned a higher number of “low risk” labels to white convicts who went on to commit more crimes. So the system essentially demonizes black offenders while simultaneously giving white criminals the benefit of the doubt.

That’s exactly the kind of systemic racism that algorithms are supposed to remove from the equation, which is pretty much what Julia Dressel thought when she read the ProPublica story. So she went to see Hany Farid, a computer science professor at Dartmouth, where Dressel was a student at the time. As computer scientists, they thought they might be able to do something about it—maybe even fix the algorithm. So they worked and they worked, but they kept coming up short.

“No matter what we did,” Farid explains, “everything got stuck at 55 percent accuracy, and that’s unusual. Normally when you add more complex data, you expect accuracy to go up. But no matter what Julia did, we were stuck.” Four other teams trying to solve the same problem all came to the conclusion: it’s mathematically impossible for the algorithm to be completely fair.

The problem is not in our algorithms (sorry, Horatio), but in our data.

So they tried a different approach. “We realized there was this underlying assumption that these algorithms were inherently superior to human predictions,” says Dressel. “But we couldn’t find any research that proved these tools actually were better. So we asked ourselves: what’s the baseline for human prediction?” The pair had a hunch that humans could get pretty close to the same accuracy as this algorithm. After all, it’s only right 65 percent of the time.

That lead Dressel and Farid to an online tool used by researchers everywhere: Mechanical Turk, a strangely named Amazon service that allows scientists to set up surveys and tests and pay users to take them. It’s an easy way to access a large group of basically randomized people to do studies just like this one.

The full COMPAS algorithm uses 137 features to make its prediction. Dressel and Farid’s group of random humans only had seven: sex, age, criminal charge, degree of the crime, non-juvenile prior count, juvenile felony count, and juvenile misdemeanor count. Based on just these factors and given no instructions on how to interpret the data to make a conclusion, a group of 462 people were simply asked whether they thought the defendant was likely to commit another crime in the next two years. They did so with almost exactly the same accuracy—and bias—as the COMPAS algorithm.

What’s more, the researchers found they could get very close to the same predictive ability by using just two of the original 137 factors: age and number of prior convictions. Those are the two biggest determining factors in whether or not a criminal will reoffend (or really, whether a criminal is likely to commit another crime and also get caught and convicted again).

Recidivism rates may seem like they’re directly measuring how likely a person is to commit a crime, but we don’t actually have a way to measure the number of people who break the law. We can only measure the ones we catch. And the ones we choose to convict. That’s where the data gets gummed up by our own systemic biases.

“It’s easy to say ‘we don’t put race into the algorithms’,” says Farid. “Okay, good. But there are other things that are proxies for race.” Namely, explains Dressel, conviction rates. “On a national scale, black individuals are more likely to have prior crimes on their record,” she says, “and this discrepancy is most likely what caused the false positive and false negative error rate.” For any given white and black person who commit exactly the same crime, the black person is more likely to get arrested, convicted, and incarcerated.

Let’s take an example. Two criminals, one white and one black, commit the same crime and both go to jail for it. Those same two are released after a year, and each commits another crime a few months later. By any rational definition, both have reoffended—but the black person is more likely to be arrested, tried, and convicted again. Because the dataset that informed both COMPAS and the online human participants is biased against black individuals already, both predictions will be biased as well.

Bias in algorithms doesn’t necessarily mean they’re useless. But Dressel and Farid—like many others in their field—are trying to warn against putting too much faith in these numbers.

“Our concern is that when you have software like COMPAS that’s a black box, that sounds complicated and fancy, that the judges may not be applying the proportional amount of confidence as they would if we said ‘12 people online think this person is high risk’,” Farid says. “Maybe we should be a little concerned that we have multiple commercial entities selling algorithms to courts that haven’t been analyzed. Maybe someone like the Department of Justice should be in the business of putting these algorithms through a vetting process. That seems like a reasonable thing to do.”

One solution might be to ask people with criminal justice experience to predict recidivism. They might have better insight than random people on the internet (and COMPAS). If actual experts can weigh in to help fix the flawed dataset, Farid and Dressel agree that these sorts of algorithms could have their uses. The key, they say, is that the companies making said algorithms be transparent about their methods, and upfront with courts about the limitations and biases that abound. It seems reasonable to assume that turning our decisions over to a data-crunching computer would save us from potential human biases against people of color, but that’s not the case. The algorithms are just doubling down on the same systemic mistakes we’ve been making for years, but churning out results with the misleading veneer of impartiality.

It’s entirely possible that we won’t ever be able to predict recidivism well. That sounds so obvious, but it’s easy to forget. “Predicting the future is really hard,” says Farid, and the fact that adding complex data to this algorithm didn’t make it any more accurate might mean there just isn’t a signal to detect in the first place. “And if that’s the case,” he says, “we should think carefully about the fact that we’re making decisions that affect people’s lives based on something that’s so hard to predict.”