PhD Proposal: Exploring the implicit structure in RNA-seq data and its uses in efficient processing

Hirak Sarkar
12.10.2019 10:00 to 12:00
IRB 3137

The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously accelerated the storage needs and subsequent processing requirement of publicly-available multi-omics datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this proposal, we concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, often sequenced from multiple samples. One of the most popular use cases of RNA-seq is to obtain the transcript or gene level counts from the raw nucleotide read sequences and use the count values for downstream analysis such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this proposal, we have described these problems and demonstrated efficient state-of-the-art solutions. The proposal describes an alternate representation of an RNA-seq experiment encoded in the form of equivalence classes, where instead of treating a transcript individually, a group of transcripts have been regarded as a unit. We used the equivalence classes for a number of applications ranging from developing data-driven compression methodologies to clustering de-novo transcriptome. The other challenge of dealing with large RNA-seq datasets is developing efficient data structures for storing the reference which also enables fast queries of the read sequences. In this proposal, we described a succinct data structure for space frugal storing of the reference sequence and simultaneously enabling the fast query of k-mers.Although the amount of experimental RNA-seq data is vast, the absence of a ground truth makes the process of validating existing quantification tools complicated. One way to bypass the problem is to use simulated data. With the ever-growing multitude of tools for quantification, the use of simulations as a method of validation has also become commonplace. Unfortunately, simulation methodologies inherently start with a set of assumptions that eventually make the dataset, to an often important extent, different from real ones. The proposal discusses problems related to different simulation techniques and their effect on the validation of tools. These problems are also prevalent in the relatively recent domain of single-cell RNA-seq. Here, we have described our simulation tool Minnow for generating tagged-end droplet-based single-cell RNA-seq sequencing dataset and attempted to address some of the gaps between experimental and real datasets.Examining Committee:

Chair: Dr. Rob Patro Dept rep: Dr. Marine Carpuat Members: Dr. Mihai Pop Dr. Hector Corrada Bravo