When scientists want to understand how individual human genomes vary, they turn to a single, central genetic sequence: the reference genome. That genome serves as a kind of standardized measurement, a yardstick, against which all other human variation can be measured.

But here’s the surprise: About 70 percent of that reference genome comes from a single man in Buffalo, New York, whose DNA was sequenced during the 1990 to 2003 Human Genome Project, the first attempt to record the full genome of a person. That raises obvious questions: Are variations from the reference genome  actually abnormal? The man behind the reference genome, known as RP11, is likely of mixed African and European ancestry, but how much information can one genome give about variation among 7 billion of us?

Geneticists have toyed with a variety of fixes for the problem. Sometimes, genetic medicine practitioners use population-specific reference genomes that might be more representative of someone with sub-Saharan African or East Asian ancestry. Others have proposed developing a “consensus reference,” which would be a Frankenstein-style assembly of the most common genetic variants, all stitched together. There could even be a reference genome based on that of humanity’s most recent common ancestor.

But all of those share a central limitation: reference genomes rely on the assumption that there is a baseline human genetic blueprint, and genetic diversity must be understood as variations from that baseline.

This week, research in Science lays out a new tool for investigating the human “pangenome.” The pangenome allows geneticists to map differences in an unlimited number of genomes all at once, which researchers say could capture complex variations and better tailor genetic medicine to people who aren’t European.

“What would be better would instead be, let’s compare to a whole diverse collection of a sampling of what we think humanity looks like,” says Benedict Paten, a computational biologist at the University of California Santa Cruz, and the senior author on the research.

Instead of looking at one single genome, says Paten, “we map out a network of possibilities.” Imagine two people with a slightly different sequence: AGTCA and ATTGA. In the pangenomic point of view, variations are represented as a series of branches on a tree: A leads to T or G, which leads back to T, which leads to C or G, which leads to A. Where two genomes are identical, they follow the same path. Where the genomes are different, the paths split off. Many people with similar genomes would be a bit like a bundle of strings, following the same pathway through a network of possible sequences.

[Related: We’re just beginning to understand how our genes and COVID-19 mix]

That makes it much easier to see variations in context, rather than as deviations from a norm. “Traditionally, when we have a reference, we talk about edits,” says Paten. “So we say, position one million and blah, there was a flip from an A to G.” In a pangenome, “instead of being described as edits, they’re just a sequence. They’re just a point in that network.”

The benchmark for human diversity is based on one man’s genome. A new tool could change that.
The conventional approach measures variation in an individual’s genome against a reference. Philip Kiefer
The benchmark for human diversity is based on one man’s genome. A new tool could change that.
In this pangenomic approach, the genomes are seen as taking different pathways through a series of possible variations. Philip Kiefer

Most immediately, that will help researchers understand deep patterns in our genes. The simplest changes—swaps of a single letter, or short insertions and deletions—are easy to identify using a reference genome. But there are more complicated patterns, which scientists call structural variants. An entire stretch of DNA might be reversed or repeated, or cut out and plopped down elsewhere. And even the best reference genome is a bad tool for understanding the full complement of structural variation.

Because genomic patterns vary somewhat by ancestry, the reference genome is especially bad at explaining variation in undersampled communities, from Tuscans to Yoruba—it may simply not have an analogue for a common feature of genomes in those communities. (It’s important to remember that ancestry doesn’t generally map onto cultural definitions of race, and that variations between populations are superficial or minor next to overwhelming commonalities.)

“When you’re looking at structural variants,” says Stephanie Fullerton, a bioethicist at the University of Washington who studies genetic medicine, scientists ask whether the variant is very unusual that “is probably breaking something super important? Or is this just something floating around in the human genome that is effectively neutral?”

Because the vast majority of genomic research has looked at people of European ancestry, researchers often don’t understand what population-specific variants mean for the health of non-Europeans.

Ambroise Wonkam, a human geneticist at the University of Cape Town, wrote in Nature earlier this year that in people of African descent, biased research means that “the likelihood of cardiomyopathies [a heart disease] or schizophrenia can be unreliable or even misleading using tools that work well in Europeans.” And, he pointed out, fewer than 2 percent of human genome sequences come from individuals in sub-Saharan Africa.

In the new paper, the researchers put the tool into action onto a variety of genomic databases from across the planet. They were able to pick out one structural variant, a deletion of a gene called RAMACL, that showed up in half of people of African descent, four percent of Americans with mixed ancestry, and just one percent in other groups. That suggests that the variant is a perfectly normal part of human diversity, when it otherwise might have been flagged as unusual, and potentially harmful. 

“This has been a problem up and down,” says Paten, “where people have studied one subpopulation and found a variant that looks interesting, and might be associated with something, but they haven’t had the context of how common that variant is in other populations.”

Fullerton agrees. “But does that help us help individual patients from underrepresented groups?” she asks. “That’s a far bigger question.”

On the one hand, it could give patients clarity on whether a feature of their genome is something to worry about, and give doctors tools for understanding the links between genes and illness. “If you’ve ever had any health concerns and had a doctor tell you, we don’t know what that means, it’s very frustrating, right?” she says. As genetic counseling, to guide management of breast cancer risk or inform complicated diagnoses, becomes more common, patients who aren’t represented by the reference genome could be left out. “So it can help with that information problem. But at the end of the day, knowing that this [gene] is causing disease doesn’t get you to, this is what we do about it. Particularly if you’re talking about patients who are lower socioeconomic status, or don’t have social capital to navigate the healthcare system, getting it answered is important, but it’s the very first step of a very long odyssey.”

And without more sequences from people who are underrepresented—particularly in the global south and Indigenous communities—there won’t be the underlying data to understand the link between disease and genetics. How to collect and share those sequences is a whole different set of questions: the history of genetics is full of ethical failures by academic researchers. Wonkam, the South African researcher, is calling for a project to sequence 2 million genomes in Africa—and to give the owners of those genomes power over how they will be used. The pangenome provides a framework for understanding human diversity, but people should decide how to fill it in.