Open data is a blessing for science—but it comes with its own curses

Large, open datasets are enabling new discoveries and fun citizen science tools like iNaturalist's Seek. But wrangling big data is no small feat.
two pairs of hands holding camera and phone to take photo of mushroom in field
Modern technologies are changing the way science is done. Nico Baum / Unsplash

Share

Imagine that you’re hiking, and you encounter an odd-looking winged bug that’s almost bird-like. If you open the Seek app by iNaturalist and point it at the mystery critter, the camera screen will inform you that what you’re looking at is called a hummingbird clearwing, a type of moth active during the day. In a sense, the Seek app works a lot like Pokémon Go, the popular augmented reality game from 2016 that had users searching outdoors for elusive fictional critters to capture. 

Launched in 2018, Seek has a similar feel. Except when users point their camera to their surroundings, instead of encountering a Bulbasaur or a Butterfree, they might encounter real world plant bulbs and butterflies that their camera identifies in real-time. Users can learn about the types of plants and animals they come across, and can collect badges for finding different species, like reptiles, insects, birds, plants, and mushrooms. 

How iNaturalist can correctly recognize (most of the time, at least) different living organisms is thanks to a machine-learning model that works off of data collected by its original app, which first debuted in 2008 and is simply called iNaturalist. Its goal is to help people connect to the richly animated natural world around them. 

The iNaturalist platform, which boasts around 2 million users, is a mashup of social networking and citizen science where people can observe, document, share, discuss, learn more about nature, and create data for science and conservation. Outside of taking photos, the iNaturalist app has extended capabilities compared to the gamified Seek. It has a news tab, local wildlife guides, and organizations can also use the platform to host data collection “projects” that focus on certain areas or certain species of interest. 

When new users join iNaturalist, they’re prompted to check a box that allows them to share their data with scientists (although you can still join if you don’t check the box). Images and information about their location that users agree to share are tagged with a creative commons license, otherwise, it’s held under an all-rights reserved license. About 70 percent of the app’s data on the platform is classified as creative commons. “You can think of iNaturalist as this big open data pipe that just goes out there into the scientific community and is used by scientists in many ways that we’re totally surprised by,” says Scott Loarie, co-director of iNaturalist. 

This means that every time a user logs or photographs an animal, plant, or other organism, that becomes a data point that’s streamed to a hub in the Amazon Web Services cloud. It’s one out of over 300 datasets in the AWS open data registry. Currently, the hub for iNaturalist holds around 160 terabytes of images. The data collection is updated regularly and open for anyone to find and use. iNaturalist’s dataset is also part of the Global Biodiversity Information Facility, which brings together open datasets from around the world. 

iNaturalist’s Seek is a great example of an organization doing something interesting and otherwise impossible without a large, open dataset. These kinds of datasets are both a hallmark and a driving force of scientific research in the information age, a period defined by the widespread use of powerful computers. They have become a new lens through which scientists view the world around us, and have enabled the creation of tools that also make science accessible to the public.

[Related: Your Flickr photos could help scientists keep tabs on wildlife]

iNaturalist’s machine learning model, for one, can help its users identify around 60,000 different species. “There’s two million species living around the world, we’ve observed about one-sixth of them with at least one data point and one photo,” says Loarie. “But in order to do any sort of modeling or real synthesis or insight, you need about 100 data points [per species].” The team’s goal is to have 2 million species represented. But that means they need more data and more users. They’re trying to create new tools, as well, that help them spot weird data, correct errors, or even identify emerging invasive species. “This goes along with open data. The best way to promote it is to get as little friction as possible in the movement of the data and the tools to access it,” he adds.

Loarie believes that sharing data, software code, and ideas more openly can create further opportunities for science to advance. “My background is in academia. When I was doing it, it was very much this ‘publish or perish, your data stays on your laptop, and you hope no one else steals your data or scoops you’ [mindset],” he says. “One of the things that’s really cool to see is how much more collaborative science has gotten over the last few decades. You can do science so much faster and at such bigger scales if you’re more collaborative with it. And I think journals and institutions are becoming more amenable to it.” 

Open data boom

Over the last decade, open data—data that can be used, adapted, and shared by anyone—has been a boon in the scientific community, riding on a growing trend of more open science. Open science means that any raw data, analysis software, algorithms, papers, documents used in a project are shared early as part of the scientific process. In theory, this would make studies easier to reproduce

In fact, many governments organizations and city offices are releasing open datasets to the public. A 2012 law requires New York City to share all of its non-confidential data collected by various agencies for city operation through an accessible web portal. In early spring, NYC hosts an open data week highlighting datasets and research that has used them. A central team at the Office of Technology and Information, along with data coordinators from each agency, helps establish standards and best practices, and maintain and manage the infrastructure for the open data program. But for researchers who want to outsource their data infrastructure, places like Amazon and CERN offer services to help organize and manage data.  

[Related: The Ten Most Amazing Databases in the World]

This push towards open science was greatly accelerated during the recent COVID-19 pandemic, when an unprecedented amount of discoveries were shared near-instantaneously for COVID-related research and equipment designs. Scientists rapidly publicized genetic information on the virus, which aided in vaccine development efforts. 

“If the folks who had done the sequencing had held it and guarded it, it would’ve slowed the whole process down,” says John Durant, a science historian and director of the MIT Museum. 

“The move to open data is partly about trying to ensure transparency and reliability,” he adds. “How are you going to be confident that results being reported are reliable if they come out of a dataset you can’t see, or an algorithmic process you can’t explain, or a statistical analysis that you don’t really understand? Then it’s very hard to have confidence in the results.” 

Growing datasets bring opportunities and concerns

Open data cannot exist without lots and lots of data in the first place. In the glorious age of big data, this is an opportunity. “From the time when I trained in biology, way back, you were using traditional techniques, the amount of information you had—they were quite important, but they were small,” says Durant. “But today, you can generate information on an almost bewildering scale.” Our ability to collect and accrue data has increased exponentially in the last few decades thanks to better computers, smarter software, and cheaper sensors

“A big dataset is almost like a universe of its own,” Durant says. “It has a potentially infinite number of internal mathematical features, correlations, and you can go fishing in this until you find something that looks interesting.” Having the dataset open to the public means that different researchers can derive all kinds of insights from varying perspectives that deviate from the original intention for the data. 

“All sorts of new disciplines, or sub-discipline have emerged in the last few years which are derived from a change in the role of data,” he adds, with data scientists and bioinformaticians as just two out of numerous examples. There are whole branches of science that are now sort of “meta-scientific,” where people don’t actually collect data, but they go into a number of datasets and look for higher level generalizations. 

Many of the traditional fields have also undergone technological revamps. Take the environmental sciences. If you want to cover more ground, more species, over a longer period of time, that becomes “intractable for one person to manage without using technology tools or collaboration tools,” says Loarie. “That definitely pushed the ecology field more into the technical space. I’m sure every field has a similar story like that.” 

[Related: Project Icarus is creating a living map of Earth’s animals]

But with an ever-growing amount of data, our ability to wrangle these numbers and stats manually becomes virtually impossible. “You would only be able to handle these quantities of data using very advanced computing techniques. This is part of the scientific world we live in today,” Durant adds. 

That’s where machine learning algorithms come in. These are software or computer commands that can calculate statistical relationships in the data. Simple algorithms using limited amounts of data are still fairly comprehensive. If the computer makes an error, you can likely trace back to where the error occurred in the calculation. And if these are open source, then other scientists can look at the code instructions to see how the computer got the output from the input. But more often than not, AI algorithms are described as a “black box,” meaning that the researchers who created it don’t even fully understand what’s going on inside and how the machine is arriving at the decision it’s making. And that can lead to harmful biases.

This is one of the core challenges that the field faces. “Algorithmic bias is a product of an age where we are using big data systems in ways that we do or sometimes don’t fully have control over, or fully know and understand the implications of,” Durant says. This is where making data and code open can help.  

[Related: Artificial intelligence is everywhere now. This report shows how we got here.]

Another problem that researchers have to consider is maintaining the quality of big datasets, which can impinge on the effectiveness of analytics tools. This is where the peer-review process plays an important role. Loarie has observed that the field of data and computer science moves incredibly fast with publishing and getting findings out on the internet whether it’s through preprints, electronic conference papers, or some other form. “I do think that the one thing that the electronic version of science struggles with is how to scale the peer-review process,” which keeps misinformation at bay, he says. This kind of peer review is important, for example, in iNaturalist’s data processing, too. Loarie notes that although the quality of data from iNaturalist as a whole is very high, there’s still a small amount of misinformation they have to check through community management. 

Lastly, having science that is open creates a whole set of questions around how funding and incentives might change—an issue that experts have been actively exploring. Storing huge amounts of data certainly is not free. 

“What people don’t think about, that for us is almost more important, is that to move data around the internet, there’s bandwidth charges,” Loarie says. “So, if someone were to download a million photos from the iNaturalist open data bucket, and wanted to do an analysis of it, just downloading that data incurs charges.” 

The future of open data

iNaturalist is a small nonprofit that’s part of the California Academy of Sciences and National Geographic Society. That’s where Amazon is helping. The AWS Open Data Sponsorship Program, launched in 2009, covers the cost of storage and the bandwidth charges for datasets it deems “of high value to user communities,” Maggie Carter, global lead of AWS Global Social Impact says in an email. They also provide the computer codes needed to access the data and send out notifications when datasets are updated. Currently, they sponsor around 300 datasets through this program ranging from audio recordings of rainforests and whales to satellite imagery to DNA sequences to US Census data. 

At a time where big data centers are getting closely scrutinized for their energy use, Amazon sees a centralized open data hub as more energy-efficient compared to everyone in the program hosting their own local storage infrastructure. “We see natural efficiencies with an open data model. The whole premise of the AWS Open Data program is to store the data once, and then have everyone work on top of that one authoritative dataset. This means less duplicate data that needs to be stored elsewhere,” Carter says, which she claims can result in a lower overall carbon footprint. Additionally, AWS is trying to run their operations with 100 percent renewable energy by 2025.

Despite challenges, Loarie thinks that useful and applicable data should be shared whenever possible. Many other scientists are onboard with this idea. Another platform from Cornell University, ebird, uses citizen science efforts as well to accrue open data for the scientific community—ebird data has also translated back to tools for its users, like bird song ID that aims to make it easier and more engaging to interact with wildlife in nature. Outside of citizen science, some researchers, like those working to establish a Global Library of Underwater Biological Sound, are seeking to pool professionally collected data from several institutions and research groups together into a massive open dataset. 

“A lot of people hold onto data, and they hold onto proprietary algorithms, because they think that’s the key to getting the revenue and the recognition that’s going to help their program be sustainable,” says Loarie. “I think all of us who are involved in the open data world, we’re kinda taking a leap of faith that the advantages of this outweigh the cost.”