PhD Proposal: Real-Time Querying and Analytics on Large Dynamic Graphs

Talk
Jayanta Mondal
Time: 
04.30.2014 11:30 to 13:00
Location: 

AVW 3450

In today's world, networks are everywhere. There are social networks, communication networks, financial transaction networks, traffic information networks, disease transmission networks, and more. Most of these networks produce a large volume of data at a rapid rate and are highly dynamic in nature. There is tremendous value in storing, managing, and querying these large dynamic networks efficiently and in detecting events or anomalies in real-time. In my dissertation research, I plan to develop scalable techniques to address the challenges that arise in managing and querying such evolving network data by developing a streaming graph data management system.

In the first part of my proposal, I describe my work on building an in-memory, distributed graph data management system for managing large, highly dynamic networks, and supporting low-latency query processing over it. The key challenge in a distributed graph database is to partition a graph across a set of machines in such a way that it minimizes the number of distributed traversals as well as balances the load across partitions. Due to the hardness of the partitioning problem I propose aggressive replication of the nodes to minimize communication and support low-latency querying. I develop a hybrid replication policy that monitors the read-write frequencies of the nodes to dynamically decide what data to replicate, and whether to do eager or lazy replication. I also propose a clustering-based approach to amortize the costs of making these replication decisions, and introduce a fairness criterion to control the trade-off between eager and lazy replication. I provide both theoretical analysis and efficient algorithms for the optimization problems that arise and validate my hypothesis through extensive experimentation.

In the second part, I describe my work on executing continuous ego-centric aggregates on these graphs, where each node in the graph represents an ego and the corresponding aggregate is computed by aggregating the information generated in a pre-defined neighborhood of each ego. Examples of such queries include computation of personalized, tailored trends in social networks, anomaly or event detection in communication or financial transaction networks, local search and alerts in spatio-temporal networks, and so on. The major data-management challenges here are simultaneous evaluation of a large number of complex queries, high data rates, and dynamic changes to the graph structure. Keeping such challenges in mind, I propose EAGr, a flexible, general-purpose, and user-extensible in-memory framework that is built around the notion of an aggregation overlay graph, a pre-compiled data structure that encodes the computations to be performed when an update or a query is received. The overlay graph enables sharing of partial aggregates across different ego-centric queries, and also allows partial pre-computation of the aggregates to minimize the query latencies. I present several highly scalable techniques for constructing and managing the overlay graphs and show that my system can easily handle graphs of size up to 320 million nodes and edges, and achieve update and query throughputs of over 500K/s using a single, powerful machine.

Finally, I conclude the proposal with the future work I plan to pursue as a continuation of that research, which is divided into two parts: (a) devising declarative methods to specify anomalies/interesting events on these networks, and (b) developing efficient evaluation techniques for such queries using search space pruning and approximate data structures.
Examining Committee:
Committee Chair: - Dr. Amol Deshpande
Dept’s. Rep: - Dr. Hector Corrada Bravo
Committee Members: - Dr. V.S. Subrahmanian