Analyzing Large-scale Data in High Performance Computing using Machine Learning
In the last decade, the amount of data generated on parallel systems from computational science simulations and monitoring tools has grown exponentially because of the increase in systems' sizes, availability of additional hardware counters and sensors, and larger parallel storage capabilities. Research on analyzing such data has been rapidly moving from manual and one-off tool efforts to using statistical analysis / machine learning. Several HPC facilities have started continuous monitoring of their systems and user jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of pending jobs. In this talk, I will present our work on modeling the performance of representative control jobs using longitudinal system-wide monitoring data to explore the causes of performance variability. Using machine learning, we are able to predict the performance of unseen jobs before they are executed based on the current system state. We analyze these prediction models in detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training. I will also briefly mention these other research directions we are working on in my research group: parallel deep learning, analyzing large graphs, and modeling epidemic diffusion (more details here: https://pssg.cs.umd.edu).