PhD Proposal: Clustering Algorithms for Characterizing Microbial Communities

Talk
Tu Luan
Time: 
04.25.2023 13:00 to 15:00
Location: 

IRB 3137

Genomic sequence clustering, particularly 16S rRNA gene sequence clustering, is an important step in characterizing the diversity of microbial communities through an amplicon-based approach. As 16S rRNA gene datasets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We show an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the dataset, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the dataset first. Using real datasets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while effectively capturing the large clusters in the dataset. The experiments also show that SCRAPT is able to produce Operational Taxonomic Units (OTUs) that are less fragmented than popular tools like UCLUST, CD-HIT, and DNACLUST.The emergence of long-read sequencing technologies, capable of producing reads of 10,000 base pairs or longer, provides opportunities in various areas of genomic studies. In the latter sections of this proposal, we outline our future plans for characterizing microbial communities using clustering algorithms that incorporate long-read sequencing technologies. We plan to extend the SCRAPT algorithm to cluster full-length 16S rRNA gene sequences generated by long-read sequencing platforms.Metagenomic scaffolding is a process to reconstruct the original genomic sequences of organisms from metagenomic sequencing data, and it can be viewed as a process that involves clustering metagenomic assembled contigs originating from the same organism and creating a graph layout based on mate-pair or paired-end read information. Our second objective is to extend MetaCarvel, a specialized tool for metagenomic scaffolding, to perform hybrid metagenomic scaffolding, which would combine the strengths of both short and long-read sequencing data to improve contiguity and repeat resolution of metagenomic scaffolding.

Examining Committee

Chair:

Dr. Mihai Pop

Department Representative:

Dr. Aravind Srinivasan

Members:

Dr. Brantley Hall