Machine learning is finally helping us track COVID deaths faster and more accurately

A major update to the software the CDC uses to code deaths should offer more timely information about diseases.
Black person with short gray hair in a red collared shirt wearing a COVID mask looking directly into the camera
More up-to-speed death statistics could result in quicker action on the COVID-fighting front. Deposit Photos

In 2020 and 2021, COVID-19 became the third leading cause of death in the US. This May, the country passed the grim milestone of 1 million known COVID deaths. Although fewer people are dying from the virus now than during the height of the Omicron surge this winter or previous waves, new strains have continued to take lives.

As the pandemic drags on, understanding how many people are dying and who is most vulnerable remains crucial for efforts to avert further deaths. To that end, the Centers for Disease Control and Prevention (CDC) recently updated the software it uses to process all of the country’s mortality data. The change, powered by advanced computing techniques like machine learning, could supply health officials and the public with more up-to-date information about the disease

“Civil registration of births and deaths and understanding causes of death are really key to a functioning health system,” says Emily Smith, an assistant professor of global health at George Washington University. “There are a lot of ways to use this information.” 

Tracking the leading causes of death in a community and identifying where those deaths are concentrated helps public health officials direct resources, she adds. During a crisis like the COVID pandemic, having prompt information is particularly crucial. But the national statistics system has been slow to process and post mortality figures. When the US passed a million deaths from the virus earlier this year, the CDC’s tracker was still weeks behind.

“If the data aren’t as timely, then our situational awareness degrades by a week or two or maybe three.”

Robert Anderson, CDC’s National Center for Health Statistics

“Effective epidemic response is getting the right resources—whether that’s drugs or vaccines or prevention programs—to the right people at the right time,” Smith says. “Data helps us do that.” 

The CDC upgrade represents an important step forward. “It’s great to see the US moving ahead with this,” Smith notes. “More transparent, faster data is a great advance.”

Coding COVID-19

For decades, the CDC has relied on computers to analyze death certificates and assign four-digit codes to each report based on the underlying cause so they can be tracked by the National Vital Statistics System

However, only about 70 to 75 percent of the country’s death certificates could be coded automatically; the rest were flagged for review, which means a staff member would have to input the cause of death into the system by hand. “When you’re dealing with 2 to 3 million deaths [every year], 25 to 30 percent of records is quite a substantial number and requires quite a lot of resources,” says Robert Anderson, chief of the Mortality Statistics Branch at the National Center for Health Statistics.

The updated cause of death coding system, known as MedCoder, can handle a greater proportion of these records: It currently codes 85 percent of records automatically, and with continued improvements, “has the potential to code better than 90 percent of records,” Anderson says. “These records can be autocoded in minutes, whereas manual review might take a couple of weeks,” he adds. “It just means more information is available in a more timely fashion.”

MedCoder is more adept than past systems at dealing with variations in the terms that physicians, medical examiners, and coroners use to describe mortalities, Anderson explains.  The computer assigns one of 10,000 possible codes for causes of death to a record. For example, when COVID is mentioned on a death certificate, it chooses U07.1. To improve the results, Anderson and his team used machine learning techniques that drew upon a decade’s worth of national death certificate data to train MedCoder to recognize mistakes and other aberrations. So, when a doctor fills out the death certificate with “Coronavirus 2019,” “SARS-CoV-2,” “Delta variant,” or another name for the disease, the computer still codes it as U07.1. “The old system would say, ‘I don’t find that term in the dictionary,’ and kick it out for somebody to look at,” Anderson explains. “[Now] the computer says, ‘Okay, I know what to do with this and what code to assign.”

[Related: AI confirms the pandemic bummed people out]

While installing the upgrades between June 6 to 24, the National Center for Health Statistics paused its processing of death data reported by states and didn’t refresh the COVID surveillance datasets on the National Vital Statistics System’s public page. Counts from weeks earlier in 2022 may temporarily seem low while the system catches up and reprocesses these records, the agency’s website notes

“Once we get over this backlog here the system is going to function pretty much the way the old system did,” Anderson says. “I don’t want people to worry that the data that we’re putting out now is not comparable to the data we were putting out before. It is comparable; it’s just going to be a little more timely.”

Mortality numbers matter

It’s unusual for death certificates to mention which variant of SARS-CoV-2 afflicted the deceased person. But looking for patterns in more precise mortality data can help health experts understand how dangerous a new strain might be—and whether extra precautions are needed.

“If deaths are rising it increases the urgency,” Anderson says. “If the data aren’t as timely, then our situational awareness degrades by a week or two or maybe three.” 

[Related: Omicron variants keep getting better at dodging our immune systems]

It’s also possible that having speedier data would have allowed the US to recognize that it had reached 1 million COVID-19 deaths sooner. “Having better real-time data hypothetically should matter on a lot of different fronts,” Smith says. “It matters for public perception; it matters for political will.”

Reported deaths tend to lag behind other warning signs such as a rise in positive COVID tests or hospitalizations. However, these measures can be difficult to interpret. An uptick in hospitalizations can indicate that more people are becoming seriously ill, but might not capture the full scope of the problem because not everybody with severe disease has access to hospitals.

“Those are softer outcomes that incorporate both the severity of the disease and other social and economic factors, whereas death is a hard outcome.” Smith says. “Mortality is the ultimate indicator—it’s black and white.”