CBCB Researchers Present Four Papers at Prestigious Algorithms in Bioinformatics Workshop

Molloy and Rubel Present Key Genomic Tools at WABI 2023 in Houston, focusing on new algorithms that explore novel computational methods to study the human genome.
Descriptive image for CBCB Researchers Present Four Papers at Prestigious Algorithms in Bioinformatics Workshop

Erin Molloy, an assistant professor of computer science, and doctoral student Tobias Rubel, are part of a team of researchers from the University of Maryland’s Center for Bioinformatics and Computational Biology (CBCB) presenting their work at the Workshop on Algorithms in Bioinformatics (WABI 2023).

The annual conference—held this year from September 4–6 in Houston, Texas—brings together experts from around the world who are focused on discrete algorithms and machine-learning methods that address important problems in molecular biology.

The UMD researchers are discussing work that includes a new algorithm for the parallel construction of suffix arrays—one of the key tools for processing vast amounts of genomic data—and other research that explores novel computational methods to study the human genome and the evolution of tumors.

“WABI is a very important conference in our field, and has a rich history of impactful papers,” says Michael Cummings, a professor of biology with an appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS) who is the director of CBCB. “The four papers in this conference are evidence of the quality of research from our faculty and students.”

One of the papers, “Fast, Parallel, and Cache-friendly Suffix Array Construction,” offers a new approach for building suffix arrays that coalesce efficiently with modern CPUs.

The suffix array is a classic data structure used to “index” sequences. That is, given some very long sequence, like a genome, the suffix array allows scientists to search for a particular query fast. This data structure can tell them if the query occurs and how many times, as well as the precise location in the underlying sequence.

Due to its importance, algorithms for building the suffix array have been studied for decades, says Jamshed Khan, a fifth-year computer science doctoral student who is lead author on the paper.

In addition to answering questions about how much computation is necessary to construct the suffix array, the problem has also been examined from the perspective of parallel algorithms. Specifically, if researchers have access to processors with many cores, how can they make use of this capability of parallel computation to speed up the suffix array construction?

Khan says the algorithm developed by the CBCB team is conceptually simpler than several existing parallel algorithms. “We demonstrate that when parallel algorithms start outperforming the fastest serial algorithms, ours is the fastest, and that it scales better as more parallel cores are made available,” he says.

Joining Khan in this work are co-authors Rob Patro, an associate professor of computer science, Laxman Dhulipala and Erin Molloy, both assistant professors of computer science, and Tobias Rubel, a third-year doctoral student in computer science.

Patro, Dhulipala and Molloy all have joint appointments in UMIACS.

Another paper presented at the workshop, “Fulgor: A Fast and Compact k-mer Index for Large-Scale Matching and Color Queries,” defines and develops a new generation of efficient and modular indexes for k-mer queries—short sequences of DNA and RNA with a fixed length.

“Specifically, our k-mer color query tells us which reference sequences contain a queried k-mer, but not where said k-mer occurs,” says lead author Jason Fan, who just defended his Ph.D. dissertation in computer science.

This would be like rapidly searching for a particular word in a set of books and articles and knowing it’s there but not exactly where, Fan explains.

However, this approximation is sufficient for useful computational analyses—such as taxonomic classification—and, more importantly, enables the analysis and indexing of massive collections of reference genome sequences.

“We demonstrate in our experiments that it is at least two times smaller than the prior state-of-the-art methods and at least twice as fast,” Fan says. “Excitingly, we have already begun to improve and develop new and more efficient algorithms that better exploit the structure of indexed data.”

In addition to Fan, co-authors of the paper are Rob Patro, Jamshed Khan, Noor Pratap Singh, a fifth-year doctoral student in computer science, and Giulio Ermanno Pibiri, an assistant professor of computer science at Ca’Foscari University of Venice.

Molloy and her students are presenting two additional papers at WABI 2023. The first, “Quartets Enable Statistically Consistent Estimation of Cell Lineage Trees Under an Unbiased Error and Missingness Model,” expands upon what researchers know about models of cancer evolution, proposing a new methodology for reconstructing the evolutionary history of a tumor.

Tumor phylogenies, as they are called, reveal how cells in a tumor evolved from a common ancestor cell through cell division. Molloy and lead author Yunheng Han, a fifth-year doctoral student in computer science, say that accurate reconstruction is critical for studying cancer progression and response to treatment, and ultimately the development of more effective therapies. Their new method is a promising path forward given its strong statistical properties.

The second paper, “Leveraging Constraints Plus Dynamic Programming for the Large Dollo Parsimony Problem,” presents a new algorithmic approach to the Dollo parsimony problem, which has the goal of finding the simplest evolutionary tree (phylogeny) that explains the data, under the assumption that mutations are rare and thus are unlikely to occur multiple times across the tree.

In contrast, the loss of the mutation is more likely and as such could reasonably occur multiple times. These assumptions hold true for tumor evolution and some data types for species evolution. The researchers implemented and evaluated their algorithm in the context of the latter—using both simulated and real data sets—finding that it reliably improves the performance of prior methods.

Future work will evaluate the utility of the method in the context of cancer.

Junyan Dai, a senior double majoring in mathematics and computer science, was the lead author of the paper, teaming up with Rubel, Han and Molloy.

—Story by Melissa Brachfeld, UMIACS communications group

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.