- DataHub: Collaborative data science and dataset version management at scale;
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
CIDR 2015.
[abstract]
Relational databases have limited support for data collaboration,
where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a)
a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform,
DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
- VERTEXICA: Your Relational Friend for Graph Analytics!;
Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, Mike Stonebraker;
VLDB Demo 2014.
- NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
CoRR Technical Report arXiv:1405.1499.
[abstract]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego
network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the
user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the
relevant portion of the graph that an analysis task needs, and loading it onto distributed memory. In this paper, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over
large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs, and to specify the subgraphs of interest declaratively. NScale uses Apache YARN, a state-of-the-art framework for efficient and fault-tolerant distribution of data
and computation. It features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and utilizes a cost-based optimizer to partition and load the graph onto distributed memory using as few machines as possible to
minimize the communication cost. It utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the subgraphs of interest. Our experimental results show orders-of-magnitude improvements
in performance, and drastic reductions in the cost of analytics, over vertex-centric approaches.
- NScale: Neighborhood-centric Analytics on Large Graphs;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
VLDB Demo 2014.
- SWORD: Workload-aware Data Placement and Replica Selection for Cloud Data Management Systems;
Ashwin Kumar Kayyoor, Abdul Quamar, Amol Deshpande, Samir Khuller;
VLDB Journal (Special Issue on Data-Intensive Cloud Infrastructure), 23(6): 845-870, 2014.
- PStore: An Efficient Storage Framework for Managing Scientific Data;
Souvik Bhattacherjee, Amol Deshpande, and Alan Sussman;
SSDBM 2014.
[abstract]
In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data
generated by scientific simulations. PStore comprises of two modules, a data ingestion module and a query processing module, that respectively address two of the
key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a
very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression
scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is
in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on slice (also called range) queries.
PStore provides a suite of compression schemes that leverage existing techniques while extending some of them to provide support for diverse scientific
simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of
compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query
processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model
CWRF. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.
- EAGr: Supporting Continuous Ego-centric Aggregate Queries over Large Dynamic Graphs;
Jayanta Mondal and Amol Deshpande;
SIGMOD 2014 (also CoRR Technical Report arXiv:1404.6570).
[abstract]
In this work, we present EAGr, a system for supporting large numbers of
continuous neighborhood-based ("ego-centric") aggregate queries over large,
highly dynamic, and rapidly evolving graphs. Examples of such queries include
computation of personalized, tailored trends in social networks, anomaly/event
detection in financial transaction networks, local search and alerts in
spatio-temporal networks, to name a few. Key challenges in supporting such
continuous queries include high update rates typically seen in these
situations, large numbers of queries that need to be executed simultaneously,
and stringent low latency requirements. We propose a flexible, general, and
extensible in-memory framework for executing different types of ego-centric
aggregate queries over large dynamic graphs with low latencies. Our framework
is built around the notion of an aggregation overlay graph, a pre-compiled data
structure that encodes the computations to be performed when an update/query is
received. The overlay graph enables sharing of partial aggregates across
multiple ego-centric queries (corresponding to the nodes in the graph), and
also allows partial pre-computation of the aggregates to minimize the query
latencies. We present several highly scalable techniques for constructing an
overlay graph given an aggregation function, and also design incremental
algorithms for handling structural changes to the underlying graph. We also
present an optimal, polynomial-time algorithm for making the pre-computation
decisions given an overlay graph, and evaluate an approach to incrementally
adapt those decisions as the workload changes. Although our approach is
naturally parallelizable, we focus on a single-machine deployment and show that
our techniques can easily handle graphs of size up to 320 million nodes and
edges, and achieve update/query throughputs of over 500K/s using a single,
powerful machine.
- Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
EDBT 2014.
[abstract]
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a
single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether ``typical''
analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which
generally operate over smaller and more refined datasets. To address these trends, we propose ``scaling down'' Hadoop to run on shared-memory machines. This
paper presents a
prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar
and run it, without modification, on a multi-core shared-memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime
environment for execution on datasets of varying sizes. Our experiments show that Hone order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on
dataset sizes that fit into memory, Hone outperforms a fully-distributed 15-node Hadoop cluster in some cases as well.
- Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty;
Walaa Eldin Moustafa, Angelika Kimmig, Amol Deshpande, Lise Getoor;
ICDE 2014 (also CoRR Technical Report arXiv:1305.7006).
[abstract]
There is a growing need for methods which can capture uncertainties and answer queries over graph-structured data. Two common types of uncertainty are uncertainty over the attribute values of nodes and uncertainty over the existence of edges. In this paper, we
combine those with identity uncertainty. Identity uncertainty represents uncertainty over the mapping from objects mentioned in the data, or references, to the underlying real-world entities. We propose the notion of a probabilistic entity graph (PEG), a
probabilistic graph model that defines a distribution over possible graphs at the entity level. The model takes into account node attribute uncertainty, edge existence uncertainty, and identity uncertainty, and thus enables us to systematically reason about all
three types of uncertainties in a uniform manner. We introduce a general framework for constructing a PEG given uncertain data at the reference level and develop highly efficient algorithms to answer subgraph pattern matching queries in this setting. Our
algorithms are based on two novel ideas: context-aware path indexing and reduction by join-candidates, which drastically reduce the query search space. A comprehensive experimental evaluation shows that our approach outperforms baseline implementations by orders
of magnitude.
- Stream Querying and Reasoning on Social Data;
Jayanta Mondal and Amol Deshpande;
Chapter in Encyclopedia of Social Network Analysis and Mining, Springer, 2014.
[pdf] [abstract]
In this paper, we present an introduction to the new research area of "stream querying and reasoning" over social data.
This area combines aspects from several well-studied research
areas, chief among them, social network analysis, graph databases, and data streams. We provide a
formal definition of the problem, survey the related prior work, and discuss some of the key
research challenges that need to be addressed (and some of the solutions that have been proposed).
We note that we use the term "stream reasoning" in this paper to encompass
a broad range of tasks including various types of analytics, probabilistic reasoning, statistical
inference, and logical reasoning. We contrast our use of this term with the recent work where this term has been
used more specifically to
refer to integration of logical reasoning systems with data streams in the context of the Semantic Web.
Given the vast amount of work on this and related topics, it is not our intention to be
comprehensive in this brief overview. Rather we aim to cover some of the key ideas and
representative work.
- Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
SODA 2014 (also CoRR Technical Report arXiv:1303.0726).
[abstract]
Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of
x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal
is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean
functions. It has also been studied in learning theory in the context of learning with attribute costs. We consider the general problem of developing
approximation algorithms for Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for evaluating Boolean linear threshold
formulas. We also present an approximation algorithm for evaluating CDNF formulas (and decision trees) achieving a factor of O(log kd), where k is
the number of terms in the DNF formula, and d is the number of clauses in the CNF formula. In addition, we present approximation algorithms for
simultaneous evaluation of linear threshold functions, and for ranking of linear functions.
Our function evaluation algorithms are based on reductions to the Stochastic Submodular Set Cover (SSSC) problem. This problem was introduced by
Golovin and Krause. They presented an approximation algorithm for the problem, called Adaptive Greedy. Our main technical contribution is a new
approximation algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an extension of the Dual Greedy algorithm for Submodular Set
Cover due to Fujito, which is a generalization of Hochbaum's algorithm for the classical Set Cover Problem. We also give a new bound on the
approximation achieved by the Adaptive Greedy algorithm of Golovin and Krause.
- SPARSI: Partitioning Sensitive Data amongst Multiple Adversaries;
Theodoros Rekatsinas, Amol Deshpande, and Ashwin Machanavajjhala;
PVLDB 2013, to be presented at VLDB 2014 (also CoRR Technical Report arXiv:1302.6556).
[abstract]
We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware
data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is
protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing markets, where detailed and
fine-grained user-data must be disclosed. We consider a new data sharing paradigm and introduce the problem of privacy-aware data partitioning,
where a sensitive dataset must be partitioned among k untrusted parties (adversaries). The goal is to maximize the utility derived by partitioning
and distributing the dataset, while minimizing the amount of sensitive information disclosed. The data should be distributed so that an adversary,
without colluding with other adversaries, cannot draw additional inferences about the private information, by linking together multiple pieces of
information released to her. The assumption of no collusion is both reasonable and necessary in the above application domains that require release
of private user information. SPARSI enables us to formally define privacy-aware data partitioning using the notion of sensitive properties for
modeling private information and a hypergraph representation for describing the interdependencies between data entries and private information. We
show that solving privacy-aware partitioning is, in general, NP-hard, but for specific information disclosure functions, good approximate solutions
can be found using relaxation techniques. Finally, we present a local search algorithm applicable to generic information disclosure functions. We
apply SPARSI together with the proposed algorithms on data from a real advertising scenario and show that we can partition data with no disclosure
to any single advertiser.
- GrDB: A System for Declarative and Interactive Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Hui Miao, Amol Deshpande, Lise Getoor;
SIGMOD Demo 2013.
- HiNGE : Enabling Temporal Network Analytics at Scale;
Udayan Khurana, and Amol Deshpande;
SIGMOD Demo 2013.
- Algorithms for the Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
IPDPS 2013.
[abstract]
The energy costs for cooling a data center constitute a significant portion of the overall running costs. Thermal imbalance and hot spots that arise due to
imbalanced workloads lead to significant wasted cooling effort -- in order to ensure that no equipment is operating above a certain temperature, the data center
may be cooled more than necessary. Therefore it is desirable to schedule the workload in a data center in a "thermally aware" manner, assigning jobs to machines
not just based on local load of the machines, but based on the overall thermal profile of the data center. This is challenging because of the spatial
cross-interference between machines, where a job assigned to a machine may impact not only that machine's temperature, but also nearby machines. Here, we
continue formal analysis of the "thermal scheduling" problem that we initiated recently~\cite{sig-abstract}. There we introduced the notion of "effective load of
a machine" which is a function of the local load on the machine as well as the load on nearby machines, and presented optimal scheduling policies for a simple
model (where cross-effects are restricted within a rack) under the assumption that jobs can be split among different machines. Here we consider the more
realistic problem of "integral" assignment of jobs, and allow for cross-interference among different machines in adjacent racks in the data center. The integral
assignment problem with cross-interference is NP-hard, even for a simple two machine model. We consider three different heat flow models, and give constant
factor approximation algorithms for maximizing the number (or total profit) of jobs assigned in each model, without violating thermal constraints. We also
consider the problem of minimizing the maximum temperature on any machine when all jobs need to be assigned, and give constant factor algorithms for this
problem.
- SWORD: Scalable Workload-Aware Data Placement for Transactional Workloads;
Abdul Quamar, K.Ashwin Kumar and Amol Deshpande;
EDBT 2013.
[pdf] [abstract]
In this paper, we address the problem of transparently scaling out transactional (OLTP)
workloads on relational databases, to support "database-as-a-service" in cloud computing
environment. The primary challenges in supporting such workloads include choosing how to
"partition" the data across a large number of machines, minimizing the number of
"distributed transactions", providing high data availability, and tolerating failures gracefully.
Capturing and modeling the transactional workload over a period of time, and then exploiting
that information for data placement and replication has been shown to provide
significant benefits in performance, both in terms of transaction latencies and overall
throughput. However, such workload-aware data placement approaches can incur
very high overheads, and further, may perform worse than naive approaches if the workload changes.
In this work, we propose SWORD, a scalable workload-aware data partitioning and placement approach for
OLTP workloads, that incorporates a suite of novel techniques to significantly reduce the overheads
incurred both during the initial placement, and during query execution at runtime.
We model the workload as a hypergraph over the data items, and propose using a "hypergraph compression"
technique to reduce the overheads of partitioning. To deal with workload changes,
we propose an incremental data repartitioning technique that modifies data placement in small steps without
resorting to complete workload repartitioning.
We have built a workload-aware "active replication" mechanism in
SWORD to increase availability and enable load balancing. We propose the use of "fine-grained quorums" defined at the level of
groups of tuples to control the cost of distributed updates, improve throughput, and provide adaptability to different workloads.
To our knowledge, SWORD is the first system that uses fine-grained quorums in this context. The results of our experimental evaluation on
SWORD deployed on an Amazon EC2 cluster show that our techniques result in orders-of-magnitude reductions in the partitioning and
book-keeping overheads, and improve
tolerance to failures and workload changes; we also show that choosing quorums based on the query access patterns enables us to better handle query workloads
with different read and write access patterns.
- Efficient Snapshot Retrieval over Historical Graph Data;
Udayan Khurana, and Amol Deshpande;
ICDE 2013 (also CoRR Technical Report arXiv:1207.5777).
[abstract]
We address the problem of managing historical data for large evolving information networks like
social networks or citation networks, with the goal to enable temporal and evolutionary queries and
analysis. We present the design and architecture of a distributed graph database system that stores
the entire history of a network and provides support for efficient retrieval of multiple graphs from
arbitrary time points in the past, in addition to maintaining the current state for ongoing updates.
Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We
introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index
structure that enables compactly recording the historical information, and that supports efficient
retrieval of historical graph snapshots for single-site or parallel processing. Along with the
original graph data, DeltaGraph can also maintain and index auxiliary information; this
functionality can be used to extend the structure to efficiently execute queries like subgraph
pattern matching over historical data. We develop analytical models for both the storage space
needed and the snapshot retrieval times to aid in choosing the right parameters for a specific
scenario. In addition, we present strategies for materializing portions of the historical graph
state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph
data structure called GraphPool that can maintain hundreds of historical graph instances in main
memory in a non-redundant manner. We present a comprehensive experimental evaluation that
illustrates the effectiveness of our proposed techniques at managing historical graph information.
- Managing Large Dynamic Graphs Efficiently;
Jayanta Mondal, Amol Deshpande;
SIGMOD 2012.
[pdf] [abstract]
There is an increasing need to ingest, manage, and query large volumes of
graph-structured data arising in applications like social networks,
communication networks, biological networks etc. Graph databases that can
explicitly reason about the graphical nature of the data, that can support
flexible schemas and node/edge-centric analysis and querying, are ideal for
storing such data. However, although there is much work on single-site graph
databases and on efficiently executing specific types of queries over large
graphs, to date there is little work on understanding the challenges in
distributed dynamic graph data management, needed to handle the large scale of
such data. In this paper, we propose the design of an in-memory, distributed
graph data management system aimed at managing a large dynamically changing
graph and supporting low-latency query processing over it. The key challenge in
a distributed graph database is that, partitioning a graph across a set of
machines inherently results in a large number of distributed traversals across
partitions to answer even simple queries. We propose a suite of novel techniques
to minimize the communication bandwidth and the storage requirements. We have
implemented our framework as a middle-ware on top of an open-source key-value
store. We evaluate our system on a social graph, and show that our system is
able to handle very large graphs efficiently, and that it reduces the network
bandwidth consumption significantly.
- Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189).
[abstract]
We study the stochastic versions of a broad class of combinatorial problems
where the weights of the elements in the input dataset are uncertain. The class
of problems that we study includes shortest paths, minimum weight spanning
trees, and minimum weight matchings over probabilistic graphs, and other
combinatorial problems like knapsack. We observe that the expected value is
inadequate in capturing different types of "risk-averse" or "risk-prone"
behaviors, and instead we consider a more general objective which is to maximize
the "expected utility" of the solution for some given utility function, rather
than the expected weight (expected weight becomes a special case). We show that
we can obtain a polynomial time approximation algorithm with "additive error"
$\epsilon$ for any $\epsilon>0$, if there is a pseudopolynomial time algorithm
for the "exact" version of the problem. (This is true for the problems mentioned
above). Our result generalizes several prior results on stochastic shortest
path, stochastic spanning tree, and stochastic knapsack. Our algorithm for
utility maximization makes use of the separability of exponential utility and a
technique to decompose a general utility function into exponential utility
functions, which may be useful in other stochastic optimization problems.
- Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2011.
[abstract]
As a result of decades of research and industrial development, modern query
optimizers are complex software artifacts. However, the quality of the query
plan chosen by an optimizer is largely determined by the quality of the
underlying statistical summaries. Small selectivity estimation errors,
propagated exponentially, can lead to severely sub-optimal plans. Modern
optimizers typically maintain one-dimensional statistical summaries and make the
attribute value independence and join uniformity assumptions for efficiently
estimating selectivities. Therefore, selectivity estimation errors in today's
optimizers are frequently caused by missed correlations between attributes. We
present a selectivity estimation approach that does not make the independence
assumptions. By carefully using concepts from the field of graphical models, we
are able to factor the joint probability distribution of all the attributes in
the database into small, usually two dimensional distributions. We describe
several optimizations that can make selectivity estimation highly efficient, and
we present a complete implementation inside PostgreSQL's query optimizer.
Experimental results indicate an order of magnitude better selectivity
estimates, while keeping optimization time in the range of tens of milliseconds.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB 2009 (also CoRR Technical Report arXiv:0904.1366).
[pdf] [talk] [abstract] Recipient of the best paper award.
The dramatic growth in the number of application domains that naturally generate
probabilistic, uncertain data has resulted in a need for efficiently supporting
complex querying and decision-making over such data. In this paper, we present a
unified approach to ranking and top-k query processing in probabilistic
databases by viewing it as a multi-criteria optimization problem, and by
deriving a set of features that capture the key properties of a probabilistic
dataset that dictate the ranked result. We contend that a single, specific
ranking function may not suffice for probabilistic databases, and we instead
propose two parameterized ranking functions, called PRF-w and PRF-e, that
generalize or can approximate many of the previously proposed ranking functions.
We present novel generating functions-based algorithms for efficiently ranking
large datasets according to these ranking functions, even if the datasets
exhibit complex correlations modeled using probabilistic and/xor trees or Markov
networks. We further propose that the parameters of the ranking function be
learned from user preferences, and we develop an approach to learn those
parameters. Finally, we present a comprehensive experimental study that
illustrates the effectiveness of our parameterized ranking functions, especially
PRF-e, at approximating other ranking functions and the scalability of our
proposed algorithms for exact or approximate ranking.