PhD Defense: Accurate and Scalable Phylogeny Estimation via Graph Cuts

Talk
Yunheng Han
Time: 
07.01.2025 12:00 to 14:00
Location: 

IRB 3137

An evolutionary tree, referred to as a phylogeny, depicts relationships among species. Phylogenetic trees are critical for addressing fundamental questions in evolution, ecology, and genetics. However, they can be difficult to estimate because of the complexity of evolutionary processes being modeled, the hardness of the related optimization problems, as well as incomplete and inaccurate data in practice. This dissertation advances species tree reconstruction from genome-scale data when the evolutionary histories of individual genomic regions across the genome (called gene trees) can differ from the species tree due to a biological process called incomplete lineage sorting (ILS), which is modeled by the Multi-Species Coalescent (MSC).
In Chapter 2, I introduce TREE-QMC, a new method for estimating a species tree from gene trees based on four-leaf (induced) subtrees, called quartets, which have desirable properties under the MSC. TREE-QMC is a heuristic for the NP-hard Maximum Quartet Support Super Tree problem based on the divide-and-conquer approach proposed by Snir and Rao (2010). My main contribution is showing that a mathematical object called “the quartet graph” can be built directly from gene trees, without explicitly enumerating quartets. This result enables TREE-QMC to run in cubic time, making it the first method to break the quartic time barrier without down-sampling the input quartets since the divide-and-conquer approach was proposed in 2010. Moreover, I show that normalizing the quartet weights to account for “artificial taxa”, introduced during the divide phase so subproblem solutions can be combined during the conquer phase, can improve accuracy.
In Chapter 3, I reformulate the TREE-QMC algorithm to incorporate branch lengths and support values from gene trees in weighting quartets, improving the robustness of TREE-QMC to error-ridden inputs, including systematic homology errors. Although the weighted algorithm has a small increase in time complexity compared to the unweighted algorithm; in practice, the increase in runtime is also small, behaving more like a constant factor. Moreover, weighted TREE-QMC is highly competitive with the leading weighted ASTRAL, even outperforming it in terms of species tree accuracy on some challenging model conditions, such as large numbers of taxa.
Lastly, in Chapter 4, I leverage the TREE-QMC approach for scenarios where the species history is a network rather than a tree, which can occur due to hybridization or admixture. Recent research shows that tree-like aspects of a network are identifiable under the Network MSC and can be represented as a non-binary tree, called the “tree of blobs.’’ This object is useful for divide-and-conquer network reconstruction; however, the leading method TINNIK does not scale to large numbers of taxa. I show that building a tree with TREE-QMC and then contracting the edges based on statistical testing enables greater scalability and accuracy than TINNIK.
Overall, the algorithms presented in this dissertation advance large-scale quartet-based estimation of species trees from gene trees and highlight directions for future research.