Advancing Computational Science using Extreme-Scale Parallel Computing
Also on zoom-https://umd.zoom.us/j/97114322433?pwd=TWw0OG8yV3ZTc1d2V0RlYXB6RkNWQT09 Parallel and high performance computing (HPC) have been critical to the advancement of computational science disciplines for several decades now. Ensuring efficient use of HPC resources is important but challenging due to the increasing complexity of parallel codes and diversity of hardware platforms. In addition, factors such as shared resource contention, which are beyond programmer/end-user control, can also impact performance. In this talk, I will discuss several research directions that have a common goal of improving the performance of parallel software and systems. I will first describe the challenges in designing and implementing a highly scalable, parallel epidemic modeling code, and the benefits of using Charm++, an asynchronous, adaptive, task-based system. I will also discuss the phenomenon of performance variability on HPC systems and approaches to mitigating it. I will present a machine learning model based job scheduler that trains on historical performance data to adapt job scheduling decisions with the aim of reducing performance variability of parallel codes.