PhD Proposal: Applications of Graph Segmentation Algorithms in Quantitative Genomics Analysis

Talk
Mohamed Gunady
Time: 
07.19.2018 12:00 to 14:00
Location: 

subproblems of genomic analysis. Since graphs usually provide natural and efficient representation of sequences of data where some structural relationships are observed within the data, we study some graph applications in quantitative analysis of typical RNA-seq and Whole Genome Sequencing pipelines.Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, besides, the annotated transcripts are often a small subset of the possible transcripts of a gene. This work describes Yanagi, a tool for segmenting transcriptomes to create a library of maximal L-disjoint segments from a complete transcriptome annotation. That segment library preserves transcriptome substrings and structural relationships between tanscripts while eliminating unnecessary sequence duplications.First, we formalize the concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries. The resulting segment sequences can be used with pseudo-alignment tools to quantify gene expression and alternative splicing at the segment level and provide gene-level visualization of the segments for more interpretability. The notion of transcript segmentation as introduced here and implemented in Yanagi opens the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of RNA-seq analyses.Another use case of our graph segmentation approach is representing population reference genome graphs used in WGS, which can be crucial for some genomic analysis studying highly polymorphic genes, like HLA genes in human genome. Usually graph-based aligners are slow and computationally demanding. Using segments empowers any linear aligner with the efficient graph representation of population variations, while avoiding the expensive computational overhead of aligning over graphs.Finally, as future work we propose using segment counts as estimation statistics of transcripts expression levels. In addition to build a deep models to learn tissue-specific target features based on segment counts alongside sequences and chromatin measurements.

Examining Committee:

Chair: Dr. Hector Corrada Bravo Dept. rep: Dr. James Reggia Members: Dr. Mihai Pop Dr. Steve Mount