PhD Proposal: PhD Preliminary: Communication-efficient Hybrid Parallel Algorithms for Neural Network Training

Siddharth Singh
03.08.2024 09:30 to 11:30

IRB 5165

The trend toward larger neural networks for improved generalization in deep learning has led to significant computational challenges, necessitating parallel training across multiple GPUs. However, communication overheads pose a bottleneck for scalability. This thesis proposes to address these challenges by developing AxoNN, a highly scalable parallel framework for training large neural networks. I propose a five-dimensional hybrid parallel algorithm optimized to minimize communication costs while maintaining user-friendliness. I also plan to develop a communication/performance model that will guide users to configurations with minimal communication volumes. The implementation of AxoNN will focus on maximizing overlap between computation and communication, thereby reducing GPU idle times. Additionally, I plan to develop a user-friendly version of the framework that aims to greatly simplify the task of parallelizing neural network training for practitioners. By striking a balance between usability and efficiency, AxoNN promises to advance parallel deep learning for large-scale neural networks.