Data-Efficient and Fault-Tolerant Exascale Computing

Talk
Yafan Huang
Time: 
03.26.2026 11:00 to 12:00

Modern high-performance computing (HPC) systems operate at massive scales, comprising thousands of nodes equipped with high-end CPUs and GPUs to support complex workloads such as large language model training, quantum simulation, and high-resolution scientific simulations. As these systems continue to scale, two major challenges identified by the U.S. Department of Energy (DOE) become increasingly critical: managing the growing volume of data and ensuring robust error resilience.
My research addresses both challenges by developing flexible, efficient, and broadly applicable software solutions. On the data-efficiency side, I design ultra-fast GPU-based compression frameworks, such as cuSZp, that achieve high compression ratios while preserving data fidelity for diverse applications. On the reliability side, I develop low-overhead fault-tolerance techniques that enable effective detection of complex faults with minimal performance impact. Together, these contributions provide scalable software solutions that improve data efficiency and reliability in next-generation HPC and AI systems.