Data-Efficient and Fault-Tolerant Exascale Computing

Talk

Yafan Huang

Time:

03.26.2026 11:00 to 12:00

Location:

IRB 4105 or https://umd.zoom.us/j/93666933047?pwd=gWgqOgGbBP6laZclyURdDG2mNdArBt.1

URL:

https://talks.cs.umd.edu/talks/4528

Modern high-performance computing (HPC) systems operate at massive scales, comprising thousands of nodes equipped with high-end CPUs and GPUs to support complex workloads such as large language model training, quantum simulation, and high-resolution scientific simulations. As these systems continue to scale, two major challenges identified by the U.S. Department of Energy (DOE) become increasingly critical: managing the growing volume of data and ensuring robust error resilience.
My research addresses both challenges by developing flexible, efficient, and broadly applicable software solutions. On the data-efficiency side, I design ultra-fast GPU-based compression frameworks, such as cuSZp, that achieve high compression ratios while preserving data fidelity for diverse applications. On the reliability side, I develop low-overhead fault-tolerance techniques that enable effective detection of complex faults with minimal performance impact. Together, these contributions provide scalable software solutions that improve data efficiency and reliability in next-generation HPC and AI systems.

Data-Efficient and Fault-Tolerant Exascale Computing

Talk

Talk

Talk

Event

Event

Event

Event

Event

Event

Event