PhD Proposal: Effective integration of genome-scale data across species and samples

Jason Fan
06.15.2022 13:00 to 15:00

IRB 3137

Recent advancements in technologies for genome-scale assays and high-throughput sequencing techniques have made measurement in model-organisms both accessible and abundant. As a result, novel algorithms that exploit similarities across multiple samples and/or multiple organisms have been designed to improve analyses and gain new insights. However, these models can be difficult to optimize in practice due to the large number of interactions that have to be modeled between multiple genes across multiple samples and across multiple organisms. Furthermore, simultaneous analysis of high-throughput sequencing data of multiple samples and organisms can be prohibitively costly in terms of space. This PhD proposal will present prior, ongoing and future work that address these challenges --- with emphasis on techniques that make analyses work well in practice.First, I will discuss prior work that integrates data across model-organisms. We present a novel matrix factorization framework for predicting synthetic-lethal genetic-interactions that are orders of magnitude faster to train than the state-of-the-art deep-learning based approach. Here, fast training and careful application of hyper-parameter tuning techniques are key to achieve state-of-the-art performance. Second, I will discuss a recently published metric and tool that is the first to enable model-selection for transcript abundance estimation algorithms in experimental RNA-Seq data where "ground-truth" is rarely available. Finally, I will discuss future and ongoing work on a new tool that enables space-efficient indexing of huge reference sequence collections.Examining Committee:

Chair:Department Representative:

Dr. Rob Patro Dr. Jordan Boyd-Graber Dr. Erin MolloyDr. Mihai PopDr. Max Leiserson