PhD Defense: Enabling Graph Analysis over Relational Databases
Complex interactions and systems can be modeled by analyzing the connections between underlying entities or objects described by a dataset. These relationships form networks (graphs), the analysis of which has been shown to provide tremendous value in areas ranging from retail to many scientific domains. This value is obtained by using various network science methodologies– a field which focuses on studying network representations in the real world. In particular “graph algorithms” are often leveraged, which iteratively traverse the graph’s connections. To take advantage of the opportunity presented by graph algorithms, there have been a variety of specialized graph data management systems, and analysis frameworks, proposed in recent years, which have made significant advances in efficiently storing and analyzing graph-structured data. Most datasets however currently do not reside in these specialized systems but rather in general-purpose relational database management systems (RDBMS). A relational or similarly structured system is typically governed by a schema of varying strictness that implements constraints and is meticulously designed for the specific enterprise. These structured datasets contain many relationships between entities there-in that can only however be traversed via conducting expensive JOINs using SQL. These relationships can be seen as latent or “hidden” graphs that exist inherently inside structured databases. In order thus for users to efficiently traverse these latent graphs to conduct analysis, data often needs to be transformed, and migrated to specialized systems. This creates barriers that hinder and discourage graph analysis; our vision is to break these barriers.In this dissertation we investigate the opportunities and challenges involved in efficiently leveraging relationships within data stored in structured databases. First, we present GraphGen, a lightweight software layer that is independent from the underlying database, and provides interfaces for graph analysis of data in RDBMSs. GraphGen introduces an intuitive high-level language for specifying graphs of interest, and utilizes in-memory graph representations to tackle the problems associated with analyzing graphs that are hidden inside structured datasets. We show GraphGen can analyze such graphs in orders of magnitude less memory, and often computation time, while eliminating manual Extract-Transform-Load (ETL) effort.Second, we examine how in-memory graph representations of RDBMS data can be used to enhance relational query processing. We present a novel, general framework for executing GROUP BY aggregation over conjunctive queries which avoids materialization of intermediate JOIN results, and wrap this framework inside a multi-way relational operator called Join-Agg. We observed that Join-Agg can compute aggregates over certain relational and graph queries using orders of magnitude less memory and computation time.
Chair: Dr. Amol Deshpande Dean's rep: Dr. Louiqa Raschid Members: Dr. Daniel Abadi Dr. Peter Keleher Dr. Leilani Battle