Training at Scale: GPU Systems Driving Advances in LLMs and GNNs

Talk
Abhinav Bhatele
Talk Series: 
Time: 
09.26.2025 11:00 to 12:00

Significant advances in computer architecture (popularity of accelerators such as GPGPUs) and parallel computing (scalable libraries for dense and sparse linear algebra) have contributed to the on-going AI revolution. In particular, distributed LLM training relies on scalable matrix multiplication algorithms and efficient communication on high-speed interconnects. Pre-training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters and graph neural network (GNNs) on extremely large graphs requires hundreds to tens of thousands of GPUs. However, such training often suffers from significant scaling bottlenecks such as high communication overheads and load imbalance. In this talk, I will present several systems research directions that directly impact AI model training. First, I will describe my group's work in using a three-dimensional parallel algorithm for matrix multiplication in large-scale LLM training. We have implemented these techniques and additional performance optimizations in a highly scalable, open-source framework called AxoNN. Second, I will demonstrate the application of the same algorithm to full-graph GNN training when working with extremely large graphs. Finally, I will also discuss the need for scalable collective communication routines for large model training.