Declarative Graph Analytics and Querying over Very Large, Dynamic Information Networks

Over the last decade, information networks have become ubiquitous and widespread. These include social networks, communication networks, financial transaction networks, citation networks, gene regulatory networks, disease transmission networks, ecological food networks, sensor networks, social contact graphs, and many more. Network data arises even in mundane applications like phone call data, IP traffic data, or parcel shipment data. Social contact graphs are expected to be available for analysis in near future, and can potentially be used to gain insights into various social phenomena as well as in disease outbreak and prevention. There is a growing need for data management systems that can support real-time ingest, storage, querying, and complex analytics over such network data. Network data is most naturally represented as a graph, with nodes representing the entities and edges denoting the interactions between them. However, despite much work on graph querying algorithms and graph programming frameworks in recent years, there is still a lack of established data management systems that provide declarative frameworks for querying and analyzing such graph-structured data, especially very large volumes of heterogeneous, complex-structured, and rapidly changing data. The raw observational network data is also often noisy and needs to cleaned and annotated through use of statistical models before querying and analysis. The increasing availability of historical traces of time-evolving graphs has also opened up opportunities in temporal evolutionary analysis as well as in data mining and comparative analytics over historical information. Similarly there is increasing interest in continuous query processing and real-time analytics, especially anomaly or event detection, on streaming graph data. Further, the graph sizes and the number of operations that need to be supported are growing at an unprecedented pace, necessitating use of parallel and distributed solutions, both for efficiency and for better fault-tolerance; however, graph operations are notoriously hard to parallelize.

In this project, we are building a graph data management system and a suite of tools aimed at supporting real-time and historical querying and analytics over very large, dynamic, heterogeneous, and noisy graphs. Some of the key components of our overall system include: I recently wrote a blog post for ACM SIGMOD Blog on Graph Data Management.

Project Participants

Publications

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 0916736, 1319432. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.