PhD Proposal: Telemetry-based Insights and Predictive Analytics for High-performance Computing Systems

Talk
Onur Cankur
Time: 
05.13.2025 14:00 to 16:00
Location: 

IRB IRB-5165

Modern high-performance computing systems stream massive volumes of telemetry data via always-on monitoring services such as Lightweight Distributed Metric Service. This longitudinal data enables post-mortem performance analysis, continuous system health monitoring, early anomaly detection, and predictive resource management for proactive scheduling. However, effectively leveraging such telemetry remains challenging due to the scale, heterogeneity, and evolving nature of HPC workloads and system behavior. Without advanced analysis techniques, key patterns in the telemetry may go undetected. This can result in missed optimization opportunities, inaccurate diagnostics, inefficient resource management, and ultimately reduced system utilization. This proposal presents research efforts that utilize rich telemetry data from a leading supercomputer to enhance performance, reliability, and resource management in HPC systems. First, I explain our work on identifying spatial and temporal trends in GPU usage by analyzing previously under-explored hardware counters. Second, I propose an anomaly detection approach using job-aware clustering and graph-based system modeling to identify performance anomalies. Third, I propose a phase-aware online detection and forecasting framework that identifies and predicts application execution phases from system-level telemetry to enable proactive resource management.