PhD Proposal: Data-driven Algorithms for Characterizing Structural Variation in Metagenomic Data

Talk
Harihara Subrahmaniam Muralidharan
Time: 
11.21.2022 12:00 to 14:00
Location: 

IRB 3137

High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. Binning is a process which is used to cluster contigs that are inferred to have originated from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs represent nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle(https://github.com/marbl/binnacle), that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.In the second part of the proposal, we present our analysis of the variants of Synechococcus spp. present in the mushroom and octopus springs from the Yellowstone National Park hot springs as a case study. The cyanobacterium Synechococcus is abundant in these mats along a stable temperature gradient from ~50C to ~70C and plays a key role in managing Carbon and Nitrogen cycles. Previous studies have isolated and generated quality reference sequences of two major Synechococcus spp.; OS-A and OS-B’. In this work, we propose a systematic approach to explore the genomic diversity of the Synechococcus spp. in 34 metagenomic samples from the two hot springs, comparing samples across time and temperature. We note that despite high abundance, Synechococcus does not assemble well and to that end, we also describe a reference guided scaffolding approach to detect putative variant groups that has not been reported before.Finally, coassembly is often performed to combine signals from multiple samples. Coassembly has been used to compare genomic features across samples and has been often used in metagenomic binning and strain level variant discovery. However, current approaches to coassembly are heavily compute intensive and do not scale for large sequencing depth. To that end, in this thesis we propose a framework to merge assemblies to form a co-assembly graph. Preliminary results indicate that the proposed approach scales to many samples. Graph traversals and heuristics that have been used in conventional genome assembly can be extended to this framework to spell out longer contiguous segments. We also hypothesize that algorithms similar to the ones described in MetaCarvel can be extended to this coassembly graph for variant discovery. Since this framework merges information across multiple samples, this approach can be used to perform sample level comparisons.

Examining Committee

Chair:

Dr. Mihai Pop

Department Representative:

Dr. David Mount

Members:

Dr. Robert Patro

Dr. Erin Molloy