How to factor a genome for fun and profit: De Bruijn’s legacy and the mathematical models, data structures, and algorithms at the core of modern genomics

Talk
Rob Patro
Time: 
09.10.2025 15:00 to 16:00

Modern sequencing experiments generate enormous amounts of data — an estimated 1,000 petabytes (roughly 1 exabyte) each year. These experiments support research ranging from tracking microbial food contamination to studying cancer evolution in patients. The sheer scale of data production has transformed modern biology into a data-intensive discipline and, in many cases, a computational one. Realizing the full potential of these experimental capabilities for advancing our understanding of biological systems and improving human health requires more than applying established computer science principles at scale. Rather, it requires the development of fundamentally new and efficient algorithms, data structures, and computational methods.
In this talk, I will discuss how the De Bruijn graph, a mathematical construct from graph theory introduced in 1946, has evolved from a relatively esoteric object into a foundational model and a powerful tool in modern genomics. I will focus on two main lines of work from our lab that advance the state of the art in applying the De Bruijn graph to sequencing data at the exabyte scale. First, I will describe our efforts to develop highly parallel and memory-efficient methods for constructing the compacted and colored De Bruijn graph. These graph variants are essential in practice, and we build them both from large collections of reference genomes and from raw sequencing measurements. I will emphasize how careful modeling and succinct data structures can be applied to this problem. Second, I will present our work on data structures for indexing compacted and colored De Bruijn graphs. These indexes enable efficient querying of large genomic datasets at massive scale. Finally, I will highlight downstream applications of this work. For example, our methods have been used to compress, to a first-order approximation, all publicly available sequencing data in the NCBI Sequence Read Archive (SRA). They have also been applied to develop tools for accurate and efficient estimation of gene expression at the single-cell level, which the Alex’s Lemonade Stand Foundation has used to build a pediatric single-cell cancer atlas.