Efficiently Processing Single-Cell and Single-Nucleus RNA-Sequencing Data

A multi-institutional team that includes Associate Professor Rob Patro (left in photo) and Ph.D. student Dongze He (right) has released a toolkit for the efficient processing of single-cell and single-nucleus RNA sequencing data.

March 29, 2022

Descriptive image for Efficiently Processing Single-Cell and Single-Nucleus RNA-Sequencing Data

Rapid improvements in cell sequencing technologies in the last decade have provided clinicians and scientists with many valuable insights—from better treatment options for patients with heart disease and cancer to a much deeper understanding of how certain pathogens can affect plants and animals.

In particular, the exponential growth of high-throughput single-cell and single-nucleus RNA-sequencing technologies (collectively, single-cell transcriptomics technologies) have produced a wealth of new data. In fact, single-cell transcriptomic data constitutes the most ubiquitous components of single-cell multi-omics data, which was selected as the “2019 Technology of the Year” by the journal Nature Methods.

These technologies enable scientists to measure gene expression at the resolution of individual cells for tens or even hundreds of thousands of cells at a time. The measured gene expression can act as a crucial signal in understanding biological processes, disease progression, and even informing potential patient treatment options.

The result of this unprecedented resolution is that one can infer gene expression changes in all kinds of interesting biological contexts: How does gene expression differ between cells that respond to a drug versus those that are treatment resistant? How does gene expression differ among closely related cell types that happen to inhabit the same tissues within the body?

Single-cell sequencing has been a revolutionary tool in answering these kinds of questions.

But scientists must first “pre-process” this RNA-sequencing data—a crucial step that involves going from the raw sequencing data to a specific count of how abundant each gene is within each cell. And while there is popular commercial software available to accomplish this task, it is both time-consuming and memory intensive, as well as closed source.

Now, a multi-institutional team of researchers—including four with ties to the University of Maryland—has developed an accurate, computationally efficient, and lightweight toolkit for processing large amounts of raw single-cell and single-nucleus RNA sequencing data.

Their free suite of tools, called alevin-fry, is detailed in a paper published March 10 in Nature Methods.

“As the number and scale of single-cell, including single-nucleus, RNA-sequencing experiments grow, so do the costs associated with the processing of this data,” says Dongze He, a third-year doctoral student in computational biology at UMD and lead author on the paper. “Alevin-fry provides researchers an accurate, flexible and convenient way to process a multitude of types of single-cell data, simplifying and speeding up analysis and reducing computational costs of various single-cell related scientific activities.”

Instead of hours of processing time often requiring server-scale computers with large amounts of memory, the researchers say that their open-source toolset can process very large sets of single-cell data in only tens of minutes, using amounts of processing power and memory that is commonly available on commodity desktops and laptops, while retaining accurate results.

This exciting advancement is tied to a series of lightweight algorithms and efficient data structures, as well as a highly tuned implementation that can effectively make use of many processing threads at the same time. Applied in unison, these allow indexing a large amount of reference sequence—critical parts of the underlying genome—in small space, and quickly and accurately inferring the gene from which each sequencing read was generated.

The alevin-fry toolkit applies these lightweight approaches in a way that provides accurate results and should allow computational analysis to keep pace with the quickly-advancing biotechnology, says Rob Patro, an associate professor of computer science with an appointment in the University of Maryland Institute for Advanced Computer Studies.

“We think our software provides a compelling option for scientists working with single-cell and single-nucleus RNA-seq data, by enabling accurate and flexible gene quantification at a low computational cost,” Patro says. “As we receive feedback and input from other scientists using this method, we expect our software suite to grow in capability and comprehensiveness. Ultimately, we want this to be available to anyone looking for a seamless, efficient tool for advancing new discoveries using single-cell data.”

Other researchers working on the project are Mohsen Zakeri, who earned his doctorate in computer science from UMD in 2021 and is now a postdoctoral researcher at Johns Hopkins University; Hirak Sarkar, who earned his doctorate in computer science from UMD in 2020 and is now a postdoctoral researcher at Harvard Medical School; Charlotte Soneson, a research associate at the Friedrich Miescher Institute for Biomedical Research in Switzerland; and Avi Srivastava, a postdoctoral researcher at the New York Genome Center.

—Story by Melissa Brachfeld

The Department welcomes comments, suggestions and corrections. Send email to editor [-at-] cs [dot] umd [dot] edu.