Over the last decade, information networks have become ubiquitous and widespread. These include social
networks, communication networks, financial transaction networks, citation networks, gene regulatory
networks, disease transmission networks, ecological food networks, sensor networks, social contact
graphs, and many more. Network data arises even in mundane applications like phone call data,
IP traffic data, or parcel shipment data. Social contact graphs are expected to be available for analysis in near future, and
can potentially be used to gain insights into various social phenomena as well as in disease
outbreak and prevention. There is a growing need for data management systems that can support
real-time ingest, storage, querying, and complex analytics over such network data.
Network data is most naturally represented as a
, with nodes representing the entities and edges denoting the
interactions between them. However, despite much work on graph querying algorithms and graph
programming frameworks in recent years, there is still a lack of established data management systems that
provide declarative frameworks for querying and analyzing such graph-structured data, especially
very large volumes of heterogeneous, complex-structured, and rapidly changing data.
The raw observational network data is also often noisy and needs to cleaned and annotated through use
of statistical models before querying and analysis.
The increasing availability of historical traces of time-evolving graphs
has also opened up opportunities in temporal
evolutionary analysis as well as in data mining and comparative analytics over historical
information. Similarly there is increasing interest in continuous query processing and real-time
analytics, especially anomaly or event detection, on streaming graph data. Further, the graph sizes and the
number of operations that need to be supported are growing at an unprecedented pace, necessitating
use of parallel and distributed solutions, both for efficiency and for better fault-tolerance;
however, graph operations are notoriously hard to parallelize.
In this project, we are building a graph data management system and a suite of tools aimed at supporting real-time and
historical querying and analytics over very large, dynamic, heterogeneous, and noisy graphs. Some of the key
components of our overall system include:
- NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
CoRR Technical Report arXiv:1405.1499.
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego
network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the
user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the
... [more]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego
network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the
user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the
relevant portion of the graph that an analysis task needs, and loading it onto distributed memory. In this paper, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over
large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs, and to specify the subgraphs of interest declaratively. NScale uses Apache YARN, a state-of-the-art framework for efficient and fault-tolerant distribution of data
and computation. It features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and utilizes a cost-based optimizer to partition and load the graph onto distributed memory using as few machines as possible to
minimize the communication cost. It utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the subgraphs of interest. Our experimental results show orders-of-magnitude improvements
in performance, and drastic reductions in the cost of analytics, over vertex-centric approaches.
- NScale: Neighborhood-centric Analytics on Large Graphs;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
VLDB Demo 2014.
- VERTEXICA: Your Relational Friend for Graph Analytics!;
Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, Mike Stonebraker;
VLDB Demo 2014.
- EAGr: Supporting Continuous Ego-centric Aggregate Queries over Large Dynamic Graphs;
Jayanta Mondal and Amol Deshpande;
SIGMOD 2014 (also CoRR Technical Report arXiv:1404.6570).
In this work, we present EAGr, a system for supporting large numbers of
continuous neighborhood-based ("ego-centric") aggregate queries over large,
highly dynamic, and rapidly evolving graphs. Examples of such queries include
... [more]
In this work, we present EAGr, a system for supporting large numbers of
continuous neighborhood-based ("ego-centric") aggregate queries over large,
highly dynamic, and rapidly evolving graphs. Examples of such queries include
computation of personalized, tailored trends in social networks, anomaly/event
detection in financial transaction networks, local search and alerts in
spatio-temporal networks, to name a few. Key challenges in supporting such
continuous queries include high update rates typically seen in these
situations, large numbers of queries that need to be executed simultaneously,
and stringent low latency requirements. We propose a flexible, general, and
extensible in-memory framework for executing different types of ego-centric
aggregate queries over large dynamic graphs with low latencies. Our framework
is built around the notion of an aggregation overlay graph, a pre-compiled data
structure that encodes the computations to be performed when an update/query is
received. The overlay graph enables sharing of partial aggregates across
multiple ego-centric queries (corresponding to the nodes in the graph), and
also allows partial pre-computation of the aggregates to minimize the query
latencies. We present several highly scalable techniques for constructing an
overlay graph given an aggregation function, and also design incremental
algorithms for handling structural changes to the underlying graph. We also
present an optimal, polynomial-time algorithm for making the pre-computation
decisions given an overlay graph, and evaluate an approach to incrementally
adapt those decisions as the workload changes. Although our approach is
naturally parallelizable, we focus on a single-machine deployment and show that
our techniques can easily handle graphs of size up to 320 million nodes and
edges, and achieve update/query throughputs of over 500K/s using a single,
powerful machine.
- Stream Querying and Reasoning on Social Data;
Jayanta Mondal and Amol Deshpande;
Chapter in Encyclopedia of Social Network Analysis and Mining, Springer, 2014.
[pdf]
In this paper, we present an introduction to the new research area of "stream querying and reasoning" over social data.
This area combines aspects from several well-studied research
areas, chief among them, social network analysis, graph databases, and data streams. We provide a
... [more]
In this paper, we present an introduction to the new research area of "stream querying and reasoning" over social data.
This area combines aspects from several well-studied research
areas, chief among them, social network analysis, graph databases, and data streams. We provide a
formal definition of the problem, survey the related prior work, and discuss some of the key
research challenges that need to be addressed (and some of the solutions that have been proposed).
We note that we use the term "stream reasoning" in this paper to encompass
a broad range of tasks including various types of analytics, probabilistic reasoning, statistical
inference, and logical reasoning. We contrast our use of this term with the recent work where this term has been
used more specifically to
refer to integration of logical reasoning systems with data streams in the context of the Semantic Web.
Given the vast amount of work on this and related topics, it is not our intention to be
comprehensive in this brief overview. Rather we aim to cover some of the key ideas and
representative work.
- GrDB: A System for Declarative and Interactive Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Hui Miao, Amol Deshpande, Lise Getoor;
SIGMOD Demo 2013.
- HiNGE : Enabling Temporal Network Analytics at Scale;
Udayan Khurana, and Amol Deshpande;
SIGMOD Demo 2013.
- Efficient Snapshot Retrieval over Historical Graph Data;
Udayan Khurana, and Amol Deshpande;
ICDE 2013 (also CoRR Technical Report arXiv:1207.5777).
We address the problem of managing historical data for large evolving information networks like
social networks or citation networks, with the goal to enable temporal and evolutionary queries and
analysis. We present the design and architecture of a distributed graph database system that stores
... [more]
We address the problem of managing historical data for large evolving information networks like
social networks or citation networks, with the goal to enable temporal and evolutionary queries and
analysis. We present the design and architecture of a distributed graph database system that stores
the entire history of a network and provides support for efficient retrieval of multiple graphs from
arbitrary time points in the past, in addition to maintaining the current state for ongoing updates.
Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We
introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index
structure that enables compactly recording the historical information, and that supports efficient
retrieval of historical graph snapshots for single-site or parallel processing. Along with the
original graph data, DeltaGraph can also maintain and index auxiliary information; this
functionality can be used to extend the structure to efficiently execute queries like subgraph
pattern matching over historical data. We develop analytical models for both the storage space
needed and the snapshot retrieval times to aid in choosing the right parameters for a specific
scenario. In addition, we present strategies for materializing portions of the historical graph
state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph
data structure called GraphPool that can maintain hundreds of historical graph instances in main
memory in a non-redundant manner. We present a comprehensive experimental evaluation that
illustrates the effectiveness of our proposed techniques at managing historical graph information.
- Managing Large Dynamic Graphs Efficiently;
Jayanta Mondal, Amol Deshpande;
SIGMOD 2012.
[pdf]
There is an increasing need to ingest, manage, and query large volumes of
graph-structured data arising in applications like social networks,
communication networks, biological networks etc. Graph databases that can
... [more]
There is an increasing need to ingest, manage, and query large volumes of
graph-structured data arising in applications like social networks,
communication networks, biological networks etc. Graph databases that can
explicitly reason about the graphical nature of the data, that can support
flexible schemas and node/edge-centric analysis and querying, are ideal for
storing such data. However, although there is much work on single-site graph
databases and on efficiently executing specific types of queries over large
graphs, to date there is little work on understanding the challenges in
distributed dynamic graph data management, needed to handle the large scale of
such data. In this paper, we propose the design of an in-memory, distributed
graph data management system aimed at managing a large dynamically changing
graph and supporting low-latency query processing over it. The key challenge in
a distributed graph database is that, partitioning a graph across a set of
machines inherently results in a large number of distributed traversals across
partitions to answer even simple queries. We propose a suite of novel techniques
to minimize the communication bandwidth and the storage requirements. We have
implemented our framework as a middle-ware on top of an open-source key-value
store. We evaluate our system on a social graph, and show that our system is
able to handle very large graphs efficiently, and that it reduces the network
bandwidth consumption significantly.
- Ego-centric Graph Pattern Census;
Walaa Eldin Moustafa, Amol Deshpande, Lise Getoor;
ICDE 2012.
There is increasing interest in analyzing networks of all types including social, biological,
sensor, computer, and transportation networks. Broadly speaking, we may be interested in
global network-wide analysis (e.g., centrality analysis, community detection) where the
... [more]
There is increasing interest in analyzing networks of all types including social, biological,
sensor, computer, and transportation networks. Broadly speaking, we may be interested in
global network-wide analysis (e.g., centrality analysis, community detection) where the
properties of the entire network are of interest, or local ego-centric analysis where the
focus is on studying the properties of nodes (egos) by analyzing their neighborhood subgraphs.
In this paper we propose and study ego-centric pattern census queries, a new type of graph
analysis query, where a given structural pattern is searched in every node's neighborhood and
the counts are reported or used in further analysis. This kind of analysis is useful in many
domains in social network analysis including opinion leader identification, node
classification, link prediction, and role identification. We propose an SQL-based declarative
language to support this class of queries, and develop a series of efficient query evaluation
algorithms for it. We evaluate our algorithms on a variety of synthetically generated graphs.
We also show an application of our language in a real-world scenario for predicting future
collaborations from DBLP data.
- Declarative Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Galileo Namata, Amol Deshpande, Lise Getoor;
ICDE Workshop on Graph Data Management: Techniques and Applications (GDM 2011).
There is a growing interest in methods for analyzing data describing networks of
all types, including information, biological, physical, and social networks.
Typically the data describing these networks is observational, and thus noisy
... [more]
There is a growing interest in methods for analyzing data describing networks of
all types, including information, biological, physical, and social networks.
Typically the data describing these networks is observational, and thus noisy
and incomplete; it is often at the wrong level of fidelity and abstraction for
meaningful data analysis. This has resulted in a growing body of work on
extracting, cleaning, and annotating network data. Unfortunately, much of this
work is ad hoc and domain-specific. In this paper, we present the architecture
of a data management system that enables efficient, declarative analysis of
large-scale information networks. We identify a set of primitives to support the
extraction and inference of a network from observational data, and describe a
framework that enables a network analyst to easily implement and combine new
extraction and analysis techniques, and efficiently apply them to large
observation networks. The key insight behind our approach is to "decouple", to
the extent possible, (a) the operations that require traversing the graph
structure (typically the computationally expensive step), from (b) the
operations that do the modification and update of the extracted network. We
present an analysis language based on "Datalog", and show how to use it to
cleanly achieve such decoupling. We briefly describe our prototype system that
supports these abstractions. We include a preliminary performance evaluation of
the system and show that our approach scales well and can efficiently handle a
wide spectrum of data cleaning operations on network data.
This material is based upon work supported in part by the National Science Foundation
under Grants
.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.