PhD Proposal: Attacking Big Data: Efficient Algorithms with Sparsification and Parallelism

Jacob Gilbert
09.26.2022 15:00 to 17:00

IRB 2143

In an increasingly dynamic, globally-connected world, computer scientists deal with larger data sets and more uncertainty now more than ever. In order to model such big and volatile networks, several practical computational models have recently been proposed and studied, such as the stochastic graph model and the Massively Parallel Computations (MPC) model. Both of these models provide methods of dealing with big data. In the stochastic graph model, a main goal is to find a sparse subgraph of a given input graph that successfully covers interesting properties of the original graph without needing to store many edges. In the MPC model, we actually have linear memory but it is split across machines with small sublinear memory to mimic big data processing in the real-world. Furthermore, solutions for problems in both models often rely on intricate probability analysis and randomized algorithms. These two models have very related significant theoretical implications and real-world applications, and in my future research, I propose to study and contribute to the literature of efficient algorithms in these models.The stochastic graph model was proposed in recent years to provide a lens into possible algorithmic solutions for today’s enormous, changing data sets. In the model, a graph G is known and a subgraph, the realization of G, is unknown. The realization graph will randomly include each edge and vertex of G, and queries to find out if an edge or vertex are included in the realization graph are considered expensive and should be minimized. This theoretical model accurately reflects many real-world networks that may change over time due to random circumstance. Many significant scenarios, one even life-saving, have been studied and are well-modeled from the viewpoint of stochastic graphs, including kidney exchange, flight routing, communication networks, and more ([5] [7]). Furthermore, by limiting queries, stochastic graph algorithms only compute solutions on a small subgraph of the original graph. Ergo, the main challenge of the stochastic model is utilizing a sub-quadratic number of edges while maintaining useful graph properties of the original graph. Thus, stochastic graph results explore the limitations of sparsification in large graphs as well as improve our understanding of randomness and probability in graph theory as a whole. We explore the stochastic matching problem and find that in the most general stochastic graph model, we can find a constant-degree sparse sub-graph with an expected matching size which is better than a half-appproximation for weighted graphs.The MPC model has exploded in popularity in the last ten years. Central to modern data processing frameworks like Hadoop, MapReduce, and Spark [1], fast space-efficient MPC algorithms can have immediate impacts on the data processes that occur in servers and computers everyday. In the MPC model, there are a number of computing machines each with a small amount of memory, sublinear in the size of the input for whatever problem that is being solved. As a result, efficient MPC algorithms provide insight into parallel computation with limited storage space. The MPC model has a natural structure for modern data processing workflows where data sets cannot fit in a single storage device and many processors are available for computation on chunks of the data set. Additionally, most fundamental computer science problems with significant merit are equally significant in the MPC model and require novel solutions. We consider the problem of constructing a suffix tree in MPC with constant communication rounds, and we give constant round MPC algorithms for not only suffix tree, but suffix tree application longest palindrome search and longest common subsequence.
Examining Committee:

Chair:Department Representative:Members:

Dr. MohammadTaghi Hajiaghayi Dr. Soheil Feizi Dr. William GasarchDr. Laxman Dhulipala