Algorithms and Data Structures for Organizing Genomic Data at Massive Scales

Talk
Rob Patro
Talk Series: 
Time: 
11.18.2022 11:00 to 12:00

Technological advances and economies of scale in biotechnology have led to advances that allow us to measure ever more DNA and RNA, and to make more complete measurements of genes and genomes. However, our overwhelming success in data generation has led to monumental computational challenges. To this end, I will discuss some recent work from the lab on the topic of sequence indexing. Specifically, I will discuss two related but separate challenges. First, I will discuss recent work in large-scale "raw" data indexing, which seeks to build indexing data structures that allow for query over very large repositories of "raw" sequencing data (so-called sequencing reads). These tools allow researchers to comb through mountains of experimental data to search for sequences of interest, especially those that may be lost when subsequent standard pipelines are applied to process this data. Second, I will discuss recent work in scaling reference-based indexes. As we assemble ever more partial and complete genomes across the tree of life, we often seek to analyze new data in light of the genomes we have successfully assembled. Most such analyses start with mapping or aligning the new data against a collection of one or many of these references to identify what (and in what abundance) is in each new sample. This necessitates indexing data structures that can answer so-called "locate" queries, to speed up the process of sequence alignment. I will discuss recent work in developing indices based upon the compacted colored De Bruijn Graph, and how efficient sampling schemes can be derived that allow substantial reductions in index size while retaining asymptotically optimal query performance.