PhD Proposal: Towards Enabling GPU Performance Portability
IRB IRB-5237
Supercomputers increasingly rely on a widening range of Graphics Processing Units (GPUs) to provide maximal computing power at minimal energy consumption and deployment cost. As a result, their users need performance portability, or the ability for a single application to run with good performance (e.g., time to solution) across multiple supercomputers. Portable programming models allow asingle piece of code to execute on multiple GPU types through an abstract interface, but nevertheless performance portability remains challenging to achieve. The extent to which each of the many models available actually enable performance portability is not well understood, and the process of converting an existing application codebase to use a new model is time-consuming, making itdifficult to test and compare multiple options.
Furthermore, when a programming model fails to deliver desired performance on a new GPU architecture, identifying ways to improve performance requires expert-level understanding of the hardware and programming model. In this proposal, we focus on these key challenges to performance portability, dividing the problem of performance portability into three stages: planning, implementation, and optimization. We present results of a comparative study which provides a comprehensive overview of how well GPU programming models enable performance portability, along with a novel level of depth in the analysis of results. To address the tedium and time expense of converting an application to a portable programming model, we also present a benchmarking effort which compares agentic and non-agentic translation methods using a range of open-source and commercial large language models (LLMs). We propose additional work to develop an advanced framework for token-efficient full-repository translation using agentic planning and feedback.
We furthermore propose a method to automatically tune portable GPU kernels by deriving performance models from performance optimizations identified by correctness-preserving code mutations, enabling low-overhead and lightweight tuning of GPU kernels on new hardware platforms.