On Mitigating Congestion in High Performance Networks

Talk
Abhinav Bhatele
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
Talk Series: 
Time: 
02.07.2019 11:00 to 12:00
Location: 

AVW 4172

High performance networks enable fast communication between compute nodes on large clusters and supercomputers. Even so, many parallel programs spend a significant fraction of their execution time performing communication (process-to-process messages, filesystem reads/writes, etc.) on these networks. This is due to the sharing of network resources among different traffic classes and among concurrently running programs (jobs), which leads to network congestion, and as a result, run-to-run performance variability and performance degradation of individual programs (jobs). No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network. In this talk, I will present two novel algorithms to mitigate congestion on high performance networks by minimizing sharing of network links among jobs. The first algorithm is a new resource allocation policy used by the job scheduler on fat-tree network based systems to assign "isolated" node partitions to individual jobs. These isolated partitions prevent multiple jobs from sharing the same network links, and as a result, completely eliminate inter-job network interference. The second algorithm is a new adaptive routing algorithm that considers link congestion arising from overlapping network flows of multiple jobs. Our new adaptive flow-aware routing (AFAR) algorithm implements a greedy heuristic to migrate some flows from heavily congested network links to those with low network traffic. I will also present a brief overview of my research and plans for future research.