PhD Defense: LLM-driven Development of Performant and Portable GPU Codes

Talk

Joshua Davis

Time:

03.25.2026 14:00 to 16:00

Location:

URL:

https://talks.cs.umd.edu/talks/4534

Supercomputers increasingly rely on a widening range of Graphics Processing Units (GPUs) to provide maximal computing power at minimal energy consumption and deployment cost. As a result, their users need performance portability, or the ability for a single application to run with good performance (e.g., time to solution) across multiple supercomputers. Portable programming models allow a single piece of code to execute on multiple GPU types through an abstract interface, but nevertheless performance portability remains challenging to achieve. The extent to which each of the many models available actually enable performance portability is not well understood, and the process of converting an existing application codebase to use a new model is time-consuming, making it difficult to test and compare multiple options. Furthermore, when a programming model fails to deliver desired performance on a new GPU architecture, identifying ways to improve performance requires expert-level understanding of the hardware and programming model. In this dissertation, we focus on these key challenges to performance portability, dividing the problem of performance portability into three stages: planning, implementation, and optimization. We present results of a comparative study which provides a comprehensive overview of how well GPU programming models enable performance portability, along with a novel level of depth in the analysis of results. To address the tedium and time expense of converting an application to a portable programming model, we also present a benchmarking effort which compares agentic and non-agentic translation methods using a range of open-source and commercial large language models (LLMs). Finally, to improve the ease of optimizing GPU kernels on new architectures, we present KEET, an LLM-based agentic framework for analyzing and explaining the performance of CUDA kernels using Nsight Compute profiles.