From single cells to thousands of genomes: computational challenges and algorithmic solutions in high-throughput genomics

Talk
Rob Patro
Stony Brook University
Talk Series: 
Time: 
02.28.2019 11:00 to 12:00
Location: 

AVW 4172

The plummeting cost of high-throughput sequencing and the astounding variety of available sequencing assays has transformed much of biological research, and has enabled many fundamental discoveries. Unfortunately, it has also created a scientific regime in which the bottleneck in many experiments has ceased to be our ability to acquire data, and has instead become the difficulty of modeling and solving the computational challenges posed by these large and high-dimensional measurements. Simultaneously, we have been building sequencing data archives that hold immense potential, and in which latent discoveries wait to be uncovered. However, these resources remain essentially inert due to our inability to efficiently index and query "raw" experimental data. In this talk, I will discuss some of the methods that my lab has been developing to address these challenges as they arise in different contexts. In particular, I will describe our work on Mantis, an indexing approach to enable sequence search over large collections of raw, unassembled read data. I will discuss recent progress that highlights how the colored de Bruijn graph can enable efficient neighborhood queries in the high-dimensional space of sequencing experiments, and how this leads to a new scheme for encoding k-mer membership across sets of experiments in vastly less space than previous state-of-the-art approaches. I will also discuss our recent work on alevin, a novel method for quantifying gene abundance from tagged-end, single-cell sequencing experiments (e.g. scRNA-seq). Alevin introduces a new, graph-based model to describe how the evidence of tagged sequencing reads are related to expressed genes, and proposes a new, parsimony-based approach for resolving this evidence to arrive at accurate estimates of gene expression. Crucially, alevin is the first approach which allows resolving, rather than discarding, gene ambiguous reads in this type of scRNA-seq data.