PhD Proposal: Algorithms and Data Structures for Indexing, Querying and Analyzing Large Collections of Sequencing Data in the Presence or Absence of a Reference

Fatemeh Almodaresi
12.09.2019 11:00 to 13:00
IRB 3137

RNA sequencing is a popular assay for measuring transcriptomes and is useful for expression quantification, and sequence assembly of genes and transcripts. The vast amount of available raw RNA sequencing data, and its growth rate over the past decade brings us to an important computational challenge prior to any higher level analyses; The challenge of designing a memory- efficient data structure for indexing the underlying set of sequences that supports a time-efficient query. This challenge is not limited to only raw RNA sequences. By the continuing assembly of novel genomes and transcriptomes, specifically in metagenomics, having an efficient index over the list of assembled sequences is still an ongoing challenge. To be specific, there is still a high demand for an index providing a balance between memory and query time. We have developed new data structures (both reference-based and reference-free) to substantially reduce the memory used by these indices and still achieve similar query speed.We propose three separate indices, “Pufferfish”, an index over a set of genomes or transcriptomes, “Rainbowfish” and “Mantis” which are indices for indexing a set of raw sequencing data sets. Specifically we designed a new, succinct representation for the color information of colored de Bruijn graphs (cdbg) in “Rainbowfish”. This representation can be used in de novo assembly and variant detection to keep information about the sample of origin of each k-mer when combining many samples. For example, our data structure allows building a cdbg on a metagenomic data set in just a few gigabytes while the space taken by other structures is hundreds of gigabytes, showing an order-of-magnitude improvement. Moreover, the query time for searching a k-mer and fetching the color information in our representation is almost the same as the existing data structures.Using a similar representation as Rainbowfish to store the list of samples each k-mer appears in , combined with the use of the counting quotient filter to find the index associated to each k-mer, we’ve developed Mantis, a space-frugal index over thousands of samples with a fast query time up to 108 times faster than the sequence search tools at the time. Mantis is also a cdbg representation which supports fast graph traversal and other topological analyses in addition to large-scale sequence-level searches. Later, we update the color information representation by adopting a hierarchical encoding that exploits correlations among color vectors. Our results show significant improvements in the overall size and scalability of representation of the color information.In Pufferfish, however, we’ve developed a new data structure for indexing large collections of reference sequences rather than raw sequence samples by building an index for the compacted, colored de Bruijn graph (ccdbg) over the reference sequences. We have developed both a sparse and dense indexing scheme which allows one to trade off index space for query speed (though queries always remain asymptotically optimal). In addition to indexing the k-mers, we store informations such as position and orientation of each k-mer in the reference set which makes it useful for purposes such as alignment and quantification.For my thesis, I propose advancing the ideas explained earlier in two specific, but separate directions. For the reference-base index scheme, I focus on one of the applications of Pufferfish as the alignment tool used in estimating abundance of known genomes in metagenomic analysis. This work is along the lines of the tools such as Bracken and Karp and slightly different from metagenomic classification methods such as Kraken. In addition to that, I will continue working on the scalability limits of Mantis as a large-sequence-search index by investigating the potentials of merging two Mantises without the need to construct the color information in bit vector format and convert it to the hierarchical structure and by controlling the construction memory in a limited range with respect to the size of the input data.Examining Committee:

Chair: Dr. Rob Patro Dept rep: Dr. David Mount Members: Dr. Hector Corrada Bravo Dr. Mihai Pop