October 14, 2025
3 min read
New DNA Search Engine Brings Order to Biology’s Big Data
MetaGraph compresses vast data archives into a search engine for scientists, opening up new frontiers of biological discovery
The Internet has Google. Now biology has MetaGraph. Detailed today in Nature, the search engine can quickly sift through the staggering volumes of biological data housed in public repositories.
“It’s a huge achievement,” says Rayan Chikhi, a biocomputing researcher at the Pasteur Institute in Paris. “They set a new standard” for analysing raw biological data — including DNA, RNA and protein sequences — from databases that can contain millions of billions of DNA letters, amounting to ‘petabases’ of information, more entries than all the webpages in Google’s vast index.
Although MetaGraph is tagged as ‘Google for DNA’, Chikhi likens the tool to a search engine for YouTube, because the tasks are more computationally demanding. In the same way that YouTube searches can retrieve every video that features, say, red balloons even when those key words don’t appear in the title, tags or description, MetaGraph can uncover genetic patterns hidden deep within expansive sequencing data sets without needing those patterns to be explicitly annotated in advance.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
“It enables things that cannot be done in any other way,” Chikhi says.
Indexing life’s library
The motivation behind MetaGraph was to address an accessibility problem in sequencing data sets. The size of these repositories has risen at a blistering pace in the past few decades, but this growth has presented challenges for the scientists using the data they contain. Raw sequencing reads are fragmented, noisy and too numerous to search directly. “The volume of the data, paradoxically, is the main inhibitor of us actually using the data,” says Artem Babaian, a computational biologist at the University of Toronto in Canada.
According to one of the study authors, André Kahles, a bioinformatician at the Swiss Federal Institute of Technology (ETH) Zurich in Switzerland, MetaGraph could help researchers to ask biological questions of repositories such as the Sequence Read Archive (SRA), a public database containing in excess of 100 million billion DNA letters.
They tackled the problem through the use of mathematical ‘graphs’ that links overlapping DNA fragments together, much like sentences that share the same words lining up in a book index.
The researchers integrated data from seven publicly funded data repositories, creating 18.8 million unique DNA and RNA sequence sets and 210 billion amino-acid sequence sets across all clades of life — including viruses, bacteria, fungi, plants and animals, including humans. They also developed a search engine for these sequences, in which users use text prompts to search these integrated archives of raw data.
“It is a totally new way to interact with this body of data,” says Kahles. “It’s compressed, but accessible on the fly.”
To demonstrate the utility of MetaGraph, the study authors used it to scan 241,384 human gut microbiome samples for genetic indicators of antibiotic resistance around the world, building on work that used an earlier version of the tool to track drug-resistance genes in bacterial strains that live in subway systems across major urban centres. The authors say they performed the analysis in about an hour on a high-powered computer.
Open road to discovery
MetaGraph is not the only massive-scale sequence search tool now on offer.
Chikhi and Babaian, for example, have built a platform called Logan, which stitches together billions of short sequencing reads to make longer, organized stretches of DNA. This design architecture allows the system to spot whole genes and their variants across even larger collections of sequencing reads than is possible with MetaGraph, albeit with certain trade-offs. “We have less functionality but more performance,” Chikhi says.
The added reach of Logan helped the researchers to uncover more than 200 million naturally occurring versions of a plastic-eating enzyme found in a variety of bacteria, fungi and insects — including some versions that work even better than enzymes designed in the lab. Chikhi and Babaian reported their findings in a preprint posted last month.
They and others have also used an earlier, narrower search tool tailored to viral-DNA repositories to reveal reams of previously undocumented viruses and viral contaminants in engineered T-cell therapies for treating cancer.
According to Babaian, such discoveries would not have been possible without two things: open-source search tools, available at sites such as metagraph.ethz.ch and logan-search.org, and the public sequencing repositories they tap into. With funding cuts threating other sorts of biological databases, Babaian stresses that these search innovations underscore the “critical importance of open data sharing”.
“These are resources to drive scientific progress across the world,” says Babaian. “They are opening up a completely new field of petabase-scale genomics” — and the most impactful applications are yet to come.
This article is reproduced with permission and was first published on October 8, 2025.
It’s Time to Stand Up for Science
If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.
I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.
If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.
In return, you get essential news, captivating podcasts, brilliant infographics, can’t-miss newsletters, must-watch videos, challenging games, and the science world’s best writing and reporting. You can even gift someone a subscription.
There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.