Provably Efficient and Scalable Shared-Memory Graph Processing

Talk
Laxman Dhulipala
Talk Series: 
Time: 
03.02.2021 13:00 to 14:00

Graph processing is a fundamental tool in many computational disciplines due to the widespread availability of graph data. However, processing large graphs quickly and cost-effectively is a major challenge, and existing approaches capable of processing very large graphs often have high computational cost, only solve a limited set of problems, and possess poor theoretical guarantees. Similarly, existing approaches for dynamic or streaming graphs often employ ad-hoc algorithms with unknown theoretical costs and suboptimal performance. This talk will show how shared-memory algorithms can serve as the foundation for a graph processing toolkit for static and evolving graphs that is cost-effective, has strong theoretical guarantees, and achieves state-of-the-art performance. In the first part of this talk I will describe a shared-memory approach for static graph processing. I will describe a rich interface for parallel graph processing which extends the Ligra interface with provably-efficient primitives. This interface enables simple and provably-efficient shared-memory implementations of over 20 fundamental graph problems. The implementations, which are publicly available as part of the Graph Based Benchmark Suite, solve these problems on the largest publicly available graph, the WebDataCommons hyperlink graph, with over 200 billion edges, in a matter of seconds to minutes using a commodity multicore machine. Compared to existing large-scale graph processing results, our results apply to a much broader set of problems, use orders of magnitude fewer resources, and in many cases run an order of magnitude faster. In the second part of this talk I will describe a provably-efficient approach for streaming graph processing. I will describe a new compressed purely-functional tree data structure, called a $C$-tree, which admits efficient parallel batch updates and enables a dynamic graph representation based on nested trees. Using this representation, I will describe a serializable graph-streaming system called Aspen that can concurrently apply updates and queries. Compared to existing work, Aspen achieves orders of magnitude higher update rates, while using less memory. Aspen scales to the largest publicly available graphs, and can concurrently update and analyze the WebDataCommons hyperlink graph on a single commodity multicore machine with a terabyte of memory.