- Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189).
[abstract]
We study the stochastic versions of a broad class of combinatorial problems
where the weights of the elements in the input dataset are uncertain. The class
of problems that we study includes shortest paths, minimum weight spanning
trees, and minimum weight matchings over probabilistic graphs, and other
combinatorial problems like knapsack. We observe that the expected value is
inadequate in capturing different types of "risk-averse" or "risk-prone"
behaviors, and instead we consider a more general objective which is to maximize
the "expected utility" of the solution for some given utility function, rather
than the expected weight (expected weight becomes a special case). We show that
we can obtain a polynomial time approximation algorithm with "additive error"
$\epsilon$ for any $\epsilon>0$, if there is a pseudopolynomial time algorithm
for the "exact" version of the problem. (This is true for the problems mentioned
above). Our result generalizes several prior results on stochastic shortest
path, stochastic spanning tree, and stochastic knapsack. Our algorithm for
utility maximization makes use of the separability of exponential utility and a
technique to decompose a general utility function into exponential utility
functions, which may be useful in other stochastic optimization problems.
- Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2011.
[abstract]
As a result of decades of research and industrial development, modern query
optimizers are complex software artifacts. However, the quality of the query
plan chosen by an optimizer is largely determined by the quality of the
underlying statistical summaries. Small selectivity estimation errors,
propagated exponentially, can lead to severely sub-optimal plans. Modern
optimizers typically maintain one-dimensional statistical summaries and make the
attribute value independence and join uniformity assumptions for efficiently
estimating selectivities. Therefore, selectivity estimation errors in today's
optimizers are frequently caused by missed correlations between attributes. We
present a selectivity estimation approach that does not make the independence
assumptions. By carefully using concepts from the field of graphical models, we
are able to factor the joint probability distribution of all the attributes in
the database into small, usually two dimensional distributions. We describe
several optimizations that can make selectivity estimation highly efficient, and
we present a complete implementation inside PostgreSQL's query optimizer.
Experimental results indicate an order of magnitude better selectivity
estimates, while keeping optimization time in the range of tens of milliseconds.
- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011.
[abstract]
Probabilistic database systems have successfully established themselves as a tool for
managing uncertain data. However, much of the research in this area has
focused on efficient query evaluation and has largely ignored two key
issues that commonly arise in uncertain data management:
First, how to provide "explanations" for query
results, e.g., ``Why is this tuple in my result?'' or ``Why does this
output tuple have such high probability?''. Second, the problem of
determining the "sensitive" input tuples for the given query, e.g., users
are interested to know the input tuples that can substantially alter the output,
when their probabilities are modified (since they may be unsure about the input
probability values).
Existing systems provide the "lineage/provenance" of each of the output tuples in
addition to the output probabilities, which is a boolean formula indicating the
dependence of the output tuple on the input tuples. However, lineage does not
immediately provide a quantitative relationship and it is not informative when
we have multiple output tuples.
In this paper, we propose a unified framework
that can handle both the issues mentioned above to facilitate robust query
processing. We formally define the notions of "influence" and "explanations" and provide algorithms to determine the top-$\ell$ influential set of
variables and the top-$\ell$ set of explanations for a variety of queries,
including "conjunctive" queries, "probabilistic threshold" queries,
"top-k" queries and "aggregation" queries.
Further, our framework naturally enables highly efficient incremental
evaluation when input probabilities are modified (e.g., if uncertainty is
resolved).
Our preliminary experimental results demonstrate the benefits of our framework
for performing robust query processing over probabilistic databases.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB Journal, 20(2): 249-275, 2011.
[pdf] [abstract]
Ranking is a fundamental operation in data analysis and decision support, and
plays an even more crucial role if the dataset being explored exhibits
uncertainty. This has led to much work in understanding how to rank the tuples
in a probabilistic dataset in recent years. In this article, we present a
unified approach to ranking and top-$k$ query processing in probabilistic
databases by viewing it as a multi-criteria optimization problem, and by
deriving a set of "features" that capture the key properties of a probabilistic
dataset that dictate the ranked result. We contend that a single, specific
ranking function may not suffice for probabilistic databases, and we instead
propose two "parameterized ranking functions", called \PRFs\ and \PRFe, that
generalize or
can approximate many of the previously proposed ranking functions. We present novel
"generating functions"-based algorithms for
efficiently ranking large datasets according to these ranking functions, even if the
datasets exhibit complex correlations modeled using "probabilistic and/xor trees"
or "Markov networks". We further propose that the parameters of the ranking function be
"learned" from user preferences, and we develop an approach to learn those parameters.
Finally, we present a comprehensive experimental study that illustrates the effectiveness of
our parameterized ranking functions, especially \PRFe, at approximating other ranking functions and
the scalability of our proposed algorithms for exact or approximate ranking.
- Declarative Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Galileo Namata, Amol Deshpande, Lise Getoor;
ICDE Workshop on Graph Data Management: Techniques and Applications (GDM 2011).
[abstract]
There is a growing interest in methods for analyzing data describing networks of
all types, including information, biological, physical, and social networks.
Typically the data describing these networks is observational, and thus noisy
and incomplete; it is often at the wrong level of fidelity and abstraction for
meaningful data analysis. This has resulted in a growing body of work on
extracting, cleaning, and annotating network data. Unfortunately, much of this
work is ad hoc and domain-specific. In this paper, we present the architecture
of a data management system that enables efficient, declarative analysis of
large-scale information networks. We identify a set of primitives to support the
extraction and inference of a network from observational data, and describe a
framework that enables a network analyst to easily implement and combine new
extraction and analysis techniques, and efficiently apply them to large
observation networks. The key insight behind our approach is to "decouple", to
the extent possible, (a) the operations that require traversing the graph
structure (typically the computationally expensive step), from (b) the
operations that do the modification and update of the extracted network. We
present an analysis language based on "Datalog", and show how to use it to
cleanly achieve such decoupling. We briefly describe our prototype system that
supports these abstractions. We include a preliminary performance evaluation of
the system and show that our approach scales well and can efficiently handle a
wide spectrum of data cleaning operations on network data.
- Energy Efficient Monitoring in Sensor Networks;
Amol Deshpande and Samir Khuller and Azarakhsh Malekian and Mohammed Toossi;
Algorithmica, Volume 59, Number 1, 94-114, 2011.
[pdf] [abstract]
We study a set of problems related to efficient battery energy utilization for
monitoring applications in a wireless sensor network with the goal to increase
the sensor network lifetime. We study several generalizations of a basic problem
called Set k-Cover. The problem can be described as follows: we are given a set
of sensors, and a set of targets to be monitored. Each target can be monitored
by a subset of the sensors. To increase the lifetime of the sensor network, we
would like to partition the sensors into k sets (or time-slots), and activate
each set of sensors in a different time-slot, thus extending the battery life of
the sensors by a factor of k. The goal is to find a partitioning that maximizes
the total coverage of the targets for a given k. This problem is known to be
NP-hard. We develop an improved approximation algorithm for this problem using a
reduction to Max k-Cut. Moreover, we are able to demonstrate that this algorithm
is efficient, and yields almost optimal solutions in practice. We also consider
generalizations of this problem in several different directions. First, we allow
each sensor to be active in α different sets (time-slots). This means that the
battery life is extended by a factor of k/\alpha, and allows for a richer space
of solutions. We also consider different coverage requirements, such as
requiring that all targets, or at least a certain number of targets, be covered
in each time slot. In the Set k-Cover formulation, there is no requirement that
a target be monitored at all, or in any number of time slots. We develop a
randomized rounding algorithm for this problem. We also consider extensions
where each sensor can monitor only a bounded number of targets in any time-slot,
and not all the targets adjacent to it. This kind of problem may arise when a
sensor has a directional camera, or some other physical constraint might prevent
it from monitoring all adjacent targets even when it is active. We develop the
first approximation algorithms for this problem.
- Parallel Pipelined Filter Ordering with Precedence Constraints;
Amol Deshpande, Lisa Hellerstein;
Accepted to appear in Transactions on Algorithms..
[abstract]
In the parallel pipelined filter ordering problem, we are given a set of `n'
filters that run in parallel. The filters need to be applied to a stream of
elements, to determine which elements pass all filters. Each filter has a
"rate limit" r_i on the number of elements it can process per unit time, and a
"selectivity" p_i, which is the probability that a random element will pass the
filter. The goal is to maximize throughput. This problem appears naturally in
a variety of settings, including parallel query optimization in databases and
query processing over Web services. We present an O(n^3) algorithm for the
above problem, given tree-structured precedence constraints on the filters.
This extends work of Condon et al. and Kodialam, who presented algorithms for
solving the problem without precedence constraints. Our algorithm is
combinatorial and produces a sparse solution. Motivated by join operators in
database queries, we also give algorithms for versions of the problem in which
"filter" selectivities may be greater than or equal to 1. We prove a strong
connection between the more classical problem of minimizing total work in
sequential filter ordering (A), and the parallel pipelined filter ordering
problem (B). More precisely, we prove that A is solvable in polynomial
time for a given class of precedence constraints if and only if B is as
well. This equivalence allows us to show that B is NP-Hard in the
presence of arbitrary precedence constraints (since A known to be NP-Hard
in that setting).
- A Temporal Pattern Search Algorithm for Personal History Event Visualization;
Taowei Wang, Amol Deshpande, Ben Shneiderman;
IEEE TKDE 2010.
[pdf] [abstract]
We present Temporal Pattern Search (TPS), a novel algorithm for searching for
temporal patterns of events in historical personal histories. The traditional
method of searching for such patterns uses an automaton-based approach over a
single array of events, sorted by time stamps. Instead, TPS operates on a set of
arrays, where each array contains all events of the same type, sorted by time
stamps. TPS searches for a particular item in the pattern using a binary search
over the appropriate arrays. Although binary search is considerably more
expensive per item, it allows TPS to skip many unnecessary events in personal
histories. We show that TPS's running time is bounded by O(m^2 n lg(n)), where m
is the number of items in a search pattern, and n is the number of events in a
history. Although the asymptotic running time of TPS is inferior to that of a
non-deterministic finite automaton (NFA) approach (O(mn)), TPS performs better
than NFA under our experimental conditions. We also show TPS is very competitive
to Shift-And, a bit-parallelism approach, with real data. Since the experimental
conditions we describe here subsume the conditions under which analysts would
typically use TPS (i.e. within an interactive visualization program), we argue
that TPS is an appropriate design choice for us.
- Ranking Continuous Probabilistic Datasets;
Jian Li, and Amol Deshpande;
VLDB 2010.
[pdf] [abstract]
Ranking is a fundamental operation in data analysis and decision support, and
plays an even more crucial role if the dataset being explored exhibits
uncertainty. This has led to much work in understanding how to rank uncertain
datasets in recent years. In this paper, we address the problem of ranking when
the tuple scores are uncertain, and the uncertainty is captured using
"continuous" probability distributions (e.g. Gaussian distributions). We present
a comprehensive solution to compute the values of a "parameterized ranking
function" (PRF) for arbitrary continuous probability distributions (and thus
rank the uncertain dataset); PRF can be used to simulate or approximate many
other ranking functions proposed in prior work. We develop exact polynomial time
algorithms for some continuous probability distribution classes, and efficient
approximation schemes with provable guarantees for arbitrary probability
distributions. Our algorithms can also be used for exact or approximate
evaluation of k-nearest neighbor queries over uncertain objects, whose positions
are modeled using continuous probability distributions. Our experimental
evaluation over several datasets illustrates the effectiveness of our approach
at efficiently ranking uncertain datasets with continuous attribute uncertainty.
- Sharing-Aware Horizontal Partitioning for Exploiting Correlations during Query Processing;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2010.
[pdf] [abstract]
Optimization of join queries based on average selectivities is sub-optimal in
highly correlated databases. In such databases, relations are naturally divided
into partitions, each partition having substantially different statistical
characteristics. It is very compelling to discover such data partitions during
query optimization and create multiple plans for a given query, one plan being
optimal for a particular combination of data partitions. This scenario calls for
the sharing of state among plans, so that common intermediate results are not
recomputed. We study this problem in a setting with a routing-based query
execution engine based on eddies. Eddies naturally encapsulate horizontal
partitioning and maximal state sharing across multiple plans. We define the
notion of a conditional join plan, a novel representation of the search space
that enables us to address the problem in a principled way. We present a
low-overhead greedy algorithm that uses statistical summaries based on graphical
models. Experimental results suggest an one order of magnitude faster execution
time over traditional optimization for high correlations, while maintaining the
same performance for low correlations.
- Read-Once Functions and Query Evaluation in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2010.
[pdf] [abstract]
Probabilistic databases hold promise of being a viable means for large-scale
uncertainty management, increasingly needed in a number of real world
applications domains. However, query evaluation in probabilistic databases
remains a computational challenge. Prior work on efficient "exact" query
evaluation in probabilistic databases has largely concentrated on query-centric
formulations (e.g., "safe plans", "hierarchical queries"), in that, they only
consider characteristics of the query and not the data in the database. It is
easy to construct examples where a supposedly hard query run on an appropriate
database gives rise to a tractable query evaluation problem. In this paper, we
develop efficient query evaluation techniques that leverage characteristics of
both the query and the data in the database. We focus on tuple-independent
databases where the query evaluation problem is equivalent to computing marginal
probabilities of Boolean formulas associated with the result tuples. This latter
task is easy if the Boolean formulas can be factorized into a form that has
every variable appearing at most once (called "read-once"). However, a naive
approach that directly uses previously developed Boolean formula factorization
algorithms is inefficient, because those algorithms require the input formulas
to be in the disjunctive normal form (DNF). We instead develop novel, more
efficient factorization algorithms that directly construct the read-once
expression for a result tuple Boolean formula (if one exists), for a large
subclass of queries (specifically, conjunctive queries without self-joins). We
empirically demonstrate that (1) our proposed techniques are orders of magnitude
faster than generic inference algorithms for queries where the result Boolean
formulas can be factorized into read-once expressions, and (2) for the special
case of hierarchical queries, they rival the efficiency of prior techniques
specifically designed to handle such queries.
- Lineage Processing on Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2010.
[pdf] [abstract]
In this paper, we address the problem of scalably evaluating conjunctive queries
over correlated probabilistic databases containing tuple or attribute
uncertainties. Like previous work, we adopt a two-phase approach where we first
compute "lineages" of the output tuples, and then compute the probabilities of
the lineage formulas. However unlike previous work, we allow for arbitrary and
complex correlations to be present in the data, captured via a forest of
"junction trees". We observe that evaluating even read-once (tree structured)
lineages (e.g., those generated by "hierarchical" conjunctive queries),
polynomially computable over tuple independent probabilistic databases, is
\#P-complete for lightly correlated probabilistic databases like "Markov
sequences". We characterize the complexity of exact computation of the
probability of the lineage formula on a correlated database using a parameter
called "lwidth" (analogous to the notion of "treewidth"). For lineages that
result in low lwidth, we compute exact probabilities using a novel message
passing algorithm, and for lineages that induce large lwidths, we develop
approximate Monte Carlo algorithms to estimate the result probabilities. We
scale our algorithms to very large correlated probabilistic databases using the
previously proposed INDSEP data structure. To mitigate the complexity of lineage
evaluation, we develop optimization techniques to process a batch of lineages by
sharing computation across formulas, and to exploit any independence
relationships that may exist in the data. Our experimental study illustrates the
benefits of using our algorithms for processing lineage formulas over correlated
probabilistic databases.
- On Computing Compression Trees for Data Collection in Wireless Sensor Networks;
Jian Li, Amol Deshpande, and Samir Khuller;
INFOCOM 2010 (also CoRR Technical Report arXiv:0907.5442).
[pdf] [abstract]
We address the problem of efficiently gathering correlated data from a wired or
a wireless sensor network, with the aim of designing algorithms with provable
optimality guarantees, and understanding how close we can get to the known
theoretical lower bounds. Our proposed approach is based on finding an optimal
or a near-optimal "compression tree" for a given sensor network: a compression
tree is a directed tree over the sensor network nodes such that the value of a
node is compressed using the value of its parent. We consider this problem under
different communication models, including the "broadcast communication" model
that enables many new opportunities for energy-efficient data collection. We
draw connections between the data collection problem and a previously studied
graph concept, called "weakly connected dominating sets", and we use this to
develop novel approximation algorithms for the problem. We present comparative
results on several synthetic and real-world datasets showing that our algorithms
construct near-optimal compression trees that yield a significant reduction in
the data collection cost.
- PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB Journal Special Issue on Uncertain and Probabilistic Databases, 18(6): 1065-1090, 2009.
[pdf] [abstract]
Due to numerous applications producing noisy data, e.g., sensor data,
experimental data, data from uncurated sources, information extraction,
etc., there has been a surge of interest in the development of
probabilistic databases. Most probabilistic database models proposed to
date, however, fail to meet the challenges of real-world
applications on two counts: (1) they often restrict the kinds of
uncertainty that the user can represent; and (2) the query processing
algorithms often cannot scale up to the needs of the application. In
this work, we define a probabilistic database model, "PrDB", that
uses graphical models, a state-of-the-art probabilistic modeling
technique developed within the statistics and machine learning
community, to model uncertain data. We show how this results in a rich, complex
yet compact probabilistic database model, which can capture the commonly
occurring uncertainty models (tuple uncertainty, attribute uncertainty),
more complex models (correlated tuples and attributes) and allows
compact representation (shared and schema-level correlations). In addition, we show
how query evaluation in \PrDB\ translates into inference in an appropriately
augmented graphical model. This allows us to easily use any of a myriad of
exact and approximate inference algorithms developed within the
graphical modeling community. While probabilistic inference provides a generic approach
to solving queries, we show how the use of shared correlations, together
with a novel inference algorithm that we developed based on bisimulation,
can speed query processing significantly.
We present a comprehensive experimental evaluation of
the proposed techniques and show that even with a few shared
correlations, significant speedups are possible.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB 2009 (also CoRR Technical Report arXiv:0904.1366).
[pdf] [talk] [abstract] Recipient of the best paper award.
The dramatic growth in the number of application domains that naturally generate
probabilistic, uncertain data has resulted in a need for efficiently supporting
complex querying and decision-making over such data. In this paper, we present a
unified approach to ranking and top-k query processing in probabilistic
databases by viewing it as a multi-criteria optimization problem, and by
deriving a set of features that capture the key properties of a probabilistic
dataset that dictate the ranked result. We contend that a single, specific
ranking function may not suffice for probabilistic databases, and we instead
propose two parameterized ranking functions, called PRF-w and PRF-e, that
generalize or can approximate many of the previously proposed ranking functions.
We present novel generating functions-based algorithms for efficiently ranking
large datasets according to these ranking functions, even if the datasets
exhibit complex correlations modeled using probabilistic and/xor trees or Markov
networks. We further propose that the parameters of the ranking function be
learned from user preferences, and we develop an approach to learn those
parameters. Finally, we present a comprehensive experimental study that
illustrates the effectiveness of our parameterized ranking functions, especially
PRF-e, at approximating other ranking functions and the scalability of our
proposed algorithms for exact or approximate ranking.
- Bisimulation-based Approximate Lifted Inference;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
UAI 2009.
[pdf] [abstract]
There has been a great deal of recent interest in methods for performing
lifted inference, however most of this work assumes that the first-order
model is given as input to the system. Here, we describe lifted inference
algorithms that determine symmetries and automatically "lift" the
probabilistic model to speedup inference. In particular, we describe
approximate lifted inference techniques that allow the user to trade
off inference accuracy for computational efficiency by using a
handful of tunable parameters, while keeping the error bounded. Our
algorithms are closely related to the graph-theoretic concept of
bisimulation. We report experiments on both synthetic and real data to
show that in the presence of symmetries, run-times for inference can
be improved significantly with approximate lifted inference providing
speedups of upto 2 orders of magnitude over ground inference.
- Consensus Answers for Queries over Probabilistic Databases;
Jian Li and Amol Deshpande;
PODS 2009 (also CoRR Technical Report arXiv:0812.2049v1).
[pdf] [talk] [abstract]
We address the problem of finding a "best" deterministic query answer to a query
over a probabilistic database. For this purpose, we propose the notion of a
consensus world (or a consensus answer) which is a deterministic world (answer)
that minimizes the expected distance to the possible worlds (answers). This
problem can be seen as a generalization of the well-studied inconsistent
information aggregation problems (e.g. rank aggregation) to probabilistic
databases. We consider this problem for various types of queries including SPJ
queries, Top-k queries, group-by aggregate queries, and clustering. For
different distance metrics, we obtain polynomial time optimal or approximation
algorithms for computing the consensus answers (or prove NP-hardness). Most of
our results are for a general probabilistic database model, called "and/xor tree
model", which significantly generalizes previous probabilistic database models
like x-tuples and block-independent disjoint models, and is of independent
interest.
- Indexing Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2009.
[pdf] [abstract]
With large amounts of correlated probabilistic data being generated in a wide
range of application domains including sensor networks, information extraction,
event detection etc., effectively managing and querying them has become an
important research challenge. While there is an exhaustive body of literature on
querying independent probabilistic data, supporting efficient queries over
large-scale, correlated databases remains a challenge. In this paper, we develop
efficient data structures and indexes for supporting inference and decision
support queries over such databases. Our proposed
hierarchical data structure is suitable both for in-memory and disk-resident
databases. We represent the correlations in the probabilistic database using
a "junction tree" over the tuple-existence or attribute-value random
variables,
and use "tree partitioning" techniques to build an index structure over it.
We show how to efficiently answer inference and aggregation queries using such
an index, resulting in orders of magnitude performance benefits in most cases.
In addition, we develop novel algorithms for efficiently keeping the index
structure up-to-date as changes (inserts, updates) are made to the probabilistic database.
We present a comprehensive experimental study illustrating the benefits of
our approach to query processing in probabilistic databases.
- Minimizing Communication Cost in Distributed Multi-query Processing;
Jian Li, Amol Deshpande, and Samir Khuller;
ICDE 2009.
[pdf] [abstract]
Increasing prevalence of large-scale distributed monitoring and computing
environments such as sensor networks, scientific federations, Grids etc., has
led to a renewed interest in the area of distributed query processing and
optimization. In this paper we address a general, distributed multi-query
processing problem motivated by the need to minimize the communication cost in
these environments. Specifically we address the problem of optimally sharing
data movement across the communication edges in the communication network given
a set of overlapping queries and query plans for them (specifying the join
orders to be used to execute the queries). Most of the problem variations of our
general problem can be shown to be NP-Hard from a reduction from the Steiner
tree problem. However, we show that the problem can be solved optimally if the
communication network is a tree, and present a novel algorithm for finding the
optimal sharing plan. For general communication networks, we present efficient
approximation algorithms for several variations of the problem. Finally, we
present a preliminary experimental study over synthetic datasets showing both
the need for exploiting sharing of data movement and the effectiveness of our
algorithms at finding such plans.
- Efficient Query Evaluation over Temporally Correlated Probabilistic Streams;
Bhargav Kanagal, Amol Deshpande;
ICDE 2009 (short paper).
[pdf] [abstract]
Many real world applications such as sensor networks and other monitoring
applications naturally generate probabilistic streams that are highly correlated
in both time and space. Query processing over such streaming data must be
cognizant of these correlations, since they can significantly alter the final
query results. Several prior works have suggested approaches to handling
correlations in probabilistic databases. However those approaches are either
unable to represent the types of correlations that probabilistic streams
exhibit, or can not be applied directly to our problem because of their
complexity. In this paper, we develop a system for managing and querying such
streams by exploiting the fact that most real-world probabilistic streams exhibit
highly structured Markovian correlations. Our approach is based on the
previously proposed framework of viewing probabilistic query evaluation as
inference over graphical models; we show how to efficiently construct graphical
models for the common stream processing operators, and how to efficiently
perform inference over them in an incremental fashion. Our extensive
experimental evaluation illustrates the advantages of exploiting the structured
nature of correlations in probabilistic streams.
- Graphical Models for Uncertain Data;
Amol Deshpande, Lise Getoor, and Prithviraj Sen;
Book Chapter. Managing and Mining Uncertain Data, ed. C. Aggarwal, Springer, 2009..
[pdf] [abstract]
Graphical models are a popular and well-studied framework for compact
representation of a joint probability distribution over a large number of
interdependent variables, and for efficient reasoning about such a distribution.
They have been proven useful in a wide range of domains from natural language
processing to computer vision to bioinformatics. In this chapter, we present an
approach to using graphical models for managing and querying large-scale
uncertain databases. We present a unified framework based on the concepts from
graphical models that can model not only tuple-level and attribute-level
uncertainties, but can also handle arbitrary correlations that may be present
among the data; our framework can also naturally capture "shared correlations"
where the same uncertainties and correlations occur repeatedly in the data. We
develop an efficient strategy for query evaluation over such probabilistic
databases by casting the query processing problem as an "inference" problem in
an appropriately constructed graphical model, and present optimizations specific
to probabilistic databases that enable efficient query evaluation. We conclude
the chapter with a discussion of related and future work on these topics.
- Algorithms for Distributional and Adversarial Pipelined Filter Ordering Problems;
Anne Condon, Amol Deshpande, Lisa Hellerstein, and Ning Wu;
ACM Transactions of Algorithms, 5(2), Article 24, March 2009..
[abstract]
Pipelined filter ordering is a central problem in database query optimization. The problem is to
determine the optimal order in which to apply a given set of commutative filters (predicates) to a
set of elements (the tuples of a relation), so as to find, as efficiently as possible, the tuples
that satisfy all of the filters. Optimization of pipelined filter ordering has recently received
renewed attention in the context of environments such as the web, continuous high-speed data
streams, and sensor networks. Pipelined filter ordering problems are also studied in areas such as
fault detection and machine learning under names such as learning with attribute costs, minimum-sum
set cover, and satisficing search. We present algorithms for two natural extensions of the
classical pipelined filter ordering problem: (1) a "distributional type" problem where the filters
run in parallel and the goal is to maximize throughput, and (2) an "adversarial type" problem where
the goal is to minimize the expected value of "multiplicative regret". We present two related
algorithms for solving (1), both running in time O(n^2), which improve on the O(n^3 log(n))
algorithm of Kodialam. We use techniques from our algorithms for (1) to obtain an algorithm for (2).
- Model-based Querying in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Samuel Madden;
Chapter in Encyclopedia of Database Systems. Ling Liu and M. Tamer Ozsu, ed. 2009. .
[pdf]
- Data Compression in Sensor Networks;
Amol Deshpande;
Chapter in Encyclopedia of Database Systems. Ling Liu and M. Tamer Ozsu, ed. 2009. .
[pdf]
- Exploiting Shared Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2008.
[pdf] [abstract]
There has been a recent surge in work in probabilistic databases, propelled in
large part by the huge increase in noisy data sources --- from sensor data,
experimental data, data from uncurated sources, and many others. There is a
growing need for database management systems that can efficiently represent and
query such data. In this work, we show how data characteristics can be leveraged
to make the query evaluation process more efficient. In particular, we exploit
what we refer to as "shared correlations" where the same uncertainties and
correlations occur repeatedly in the data. Shared correlations occur mainly due
to two reasons: (1) Uncertainty and correlations usually come from general
statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation
procedure itself tends to re-introduce the same correlations. Prior work has
shown that the query evaluation problem on probabilistic databases is equivalent
to a probabilistic inference problem on an appropriately constructed
probabilistic graphical model (PGM). We leverage this by introducing a new data
structure, called the "random variable elimination graph" (rv-elim graph) that
can be built from the PGM obtained from query evaluation. We develop techniques
based on bisimulation that can be used to compress the rv-elim graph exploiting
the presence of shared correlations in the PGM, the compressed rv-elim graph can
then be used to run inference. We validate our methods by evaluating them
empirically and show that even with a few shared correlations significant
speed-ups are possible.
- Energy Efficient Monitoring in Sensor Networks;
Amol Deshpande and Samir Khuller and Azarakhsh Malekian and Mohammed Toossi;
LATIN 2008.
[pdf] [abstract]
In this paper we study a set of problems related to efficient energy management
for monitoring applications in wireless sensor networks. We study several
generalizations of a basic problem called Set "k"-Cover. The problem can be
described as follows: we are given a set of sensors, and a set of regions to be
monitored. Each region can be monitored by a subset of sensors. The goal is to
partition the sensors into "k" sets (or time-slots), so that by activating the
set of sensors in a time-slot, we can maximize coverage of the regions. By
activating each sensor in only one of the "k" time slots, we decrease its
battery consumption and extend its battery life significantly (by a factor of
"k"). This problem is known to be "NP"-hard. Our goal is to develop improved
approximation algorithms for this problem. Moreover, we are able to demonstrate
that this algorithm is practical, and yields almost optimal solutions in
practice.
We also consider generalizations of this problem in several
different directions. First, we allow each sensor to be active in "\alpha"
different sets (time-slots). This means that the battery life is extended by a
factor of "\frac{k}{\alpha}", and allows for a richer space of solutions. We
also consider different coverage requirements, such as requiring that the
regions be covered in each time slot, or a certain number of regions be
monitored in each time slot. In the Set "k"-Cover formulation, there is no
requirement that a region be monitored at all, or in any number of time slots.
We develop a randomized rounding algorithm for this problem.
We also
consider extensions where each sensor can monitor only a bounded number of
regions, and not all the regions adjacent to it. This kind of problem may arise
when a sensor has a directional camera, or some other physical constraint might
prevent it from monitoring all adjacent regions even when it is active. We
develop the first approximation algorithms for this problem.
- Predictive Modeling-based Data Collection in Sensor Networks;
Lidan Wang and Amol Deshpande;
EWSN 2008.
[pdf] [abstract] Selected as one of the best papers.
We address the problem of designing practical, energy-efficient protocols for
data collection in wireless sensor networks using predictive modeling. Prior
work has suggested several approaches to capture and exploit the rich spatio-temporal
correlations prevalent in WSNs during data collection.
Although shown to be effective in reducing the data collection cost, those
approaches use simplistic corelation models and further, ignore many
idiosyncrasies of WSNs, in particular the broadcast nature of communication.
Our proposed approach is based on approximating the joint probability distribution
over the sensors using "undirected graphical models", ideally suited to
exploit both the spatial correlations and the broadcast nature of communication.
We present algorithms for optimally using such a model for data collection under
different communication models, and for identifying an appropriate model to use for a given
sensor network. Experiments over synthetic and real-world datasets show that our
approach significantly reduces the data collection cost.
- Flow Algorithms for Parallel Query Optimization;
Amol Deshpande, Lisa Hellerstein;
ICDE 2008.
[pdf] [talk] [abstract]
In this paper we address the problem of minimizing the response time of a
multi-way join query using pipelined (inter-operator) parallelism, in a parallel
or a distributed environment. We observe that in order to fully exploit the
parallelism in the system, we must consider a new class of "interleaving" plans,
where multiple query plans are used simultaneously to minimize the response time
of a query (or maximize the tuple-throughput of the system). We cast the query
planning problem in this environment as a "flow maximization problem", and
present polynomial-time algorithms that (statically) find the optimal set of
plans to use for a large class of multi-way join queries. Our proposed
algorithms also naturally extend to query optimization over web services.
Finally we present an extensive experimental evaluation that demonstrates both
the need to consider such plans in parallel query processing and the
effectiveness of our proposed algorithms.
- Online Filtering, Smoothing and Probabilistic Modeling of Streaming data;
Bhargav Kanagal, Amol Deshpande;
ICDE 2008.
[pdf] [abstract] (Extended version)
In this paper, we address the problem of extending a relational database system
to facilitate efficient real-time application of dynamic probabilistic models to
streaming data. We use the recently proposed abstraction of model-based views
for this purpose, by allowing users to declaratively specify the model to be
applied, and by presenting the output of the models to the user as a
probabilistic database view. We support declarative querying over such views
using an extended version of SQL that allows for querying probabilistic data.
Underneath we use particle filters, a class of sequential Monte Carlo algorithms
commonly used to implement dynamic probabilistic models, to represent the
present and historical states of the model as sets of weighted samples
(particles) that are kept up-to-date as new readings arrive. We develop novel
techniques to convert the queries on the model-based view directly into queries
over particle tables, enabling highly efficient query processing. Finally, we
present exmodel to be applied, and by presenting the output of the models to the
user as a probabilistic database view. We support declarative querying over such
views using an extended version of SQL that allows for querying probabilistic
data. Underneath we use particle filters, a class of sequential Monte C
- Network-Aware Join Processing in Global-Scale Database Federations;
Xiaodon Wang, Randal Burns, Andreas Terzis, Amol Deshpande ;
ICDE 2008.
[pdf] [abstract]
We introduce join scheduling algorithms that employ a balanced network
utilization metric to optimize the use of all network paths in a global-scale
database federation. This metric allows algorithms to exploit excess capacity in
the network, while avoiding narrow, long-haul paths. We give a two-approximate,
polynomial-time algorithm for serial (left-deep) join schedules. We also present
extensions to this algorithm that explore parallel schedules, reduce resource
usage, and define trade-offs between computation and network utilization. We
evaluate these techniques within the SkyQuery federation of Astronomy databases
using spatial-join queries submitted by SkyQuery's users. Experiments show that
our algorithms realize near-optimal network utilization with minor computational
overhead.
- MauveDB: Supporting Model-based User Views in Database Systems;
Amol Deshpande, Sam Madden;
SIGMOD 2006.
[pdf] [talk] [abstract]
Real-world data --- especially when generated by distributed measurement
infrastructures such as sensor networks --- tends to be incomplete, imprecise,
and erroneous, making it impossible to present it to users or feed it directly
into applications. The traditional approach to dealing with this problem is to
first process the data using statistical or probabilistic "models" that can
provide more robust interpretations of the data. Current database systems,
however, do not provide adequate support for applying models to such data,
especially when those models need to be frequently updated as new data arrives
in the system. Hence, most scientists and engineers, who depend on models for
managing their data, do not use database systems for archival or querying at
all; at best, databases serve as a persistent raw data store.
In this
paper we define a new abstraction called "model-based views" and present the
architecture of "MauveDB", the system we are building to support such views.
Just as traditional database views provide logical data independence,
model-based views provide independence from the details of the underlying data
generating mechanism and hide the irregularities of the data by using models to
present a consistent view to the users. MauveDB supports a declarative language
for defining model-based views, allows declarative querying over such views
using SQL, and supports several different materialization strategies and
techniques to efficiently maintain them in the face of frequent updates. We have
implemented a prototype system that currently supports views based on regression
and interpolation, in the Apache Derby open source DBMS, and we present results
that show the utility and performance benefits that can be obtained by
supporting several different types of model-based views in a database system.
- Flow Algorithms for Two Pipelined Filter Ordering Problems;
Anne Condon, Amol Deshpande, Lisa Hellerstein, and Ning Wu;
PODS 2006.
[pdf] [talk] [abstract]
Pipelined filter ordering is a central problem in database query optimization,
and has received renewed attention in recent years in the context of
environments such as the web, continuous high-speed data streams and sensor
networks. In this paper we present efficient algorithms for two natural
extensions of the classical pipelined filter ordering problem: (1) a
"distributional type" problem in a parallel environment where the filters run in
parallel and the goal is to maximize the throughput of the system, and (2) an
"adversarial type" problem where the goal is to minimize the expected value of
"multiplicative regret". We show that both problems can be solved using similar
flow algorithms, which find an optimal ordering scheme in time "O(n^2)", where
"n" is the number of filters. Our algorithm for (1) improves on an earlier
"O(n^3 \log n)" algorithm of Kodialam.
- Approximate Data Collection in Sensor Networks using Probabilistic Models ;
David Chu, Amol Deshpande, Joseph M. Hellerstein, Wei Hong;
ICDE 2006.
[pdf] [talk] [abstract]
Wireless sensor networks are proving to be useful in a variety of settings. A
core challenge in these networks is to minimize energy consumption. Prior
database research has proposed to achieve this by pushing data-reducing
operators like aggregation and selection down into the network. This approach
has proven unpopular with early adopters of sensor network technology, who
typically want to extract complete "dumps" of the sensor readings, i.e., to run
"SELECT *" queries. Unfortunately, because these queries do no data reduction,
they consume significant energy in current sensornet query processors.
In this paper we attack the "SELECT *" problem for sensor networks. We propose a
robust approximate technique called "Ken" that uses "replicated dynamic
probabilistic models" to minimize communication from sensor nodes to the
network's PC base station. In addition to data collection, we show that Ken is
well suited to
anomaly- and event-detection applications.
A key challenge in this work is to intelligently exploit spatial
correlations "across" sensor nodes without imposing undue sensor-to-sensor communication burdens to maintain the models.
Using traces from two real-world sensor network deployments, we demonstrate that relatively simple models can provide
significant communication (and hence energy) savings without undue sacrifice in result quality or frequency. Choosing
optimally among even our simple models is NP-hard, but our experiments show that a greedy heuristic performs nearly as
well as an exhaustive algorithm.
- Model-based Approximate Querying in Sensor Networks ;
Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, Wei Hong;
International Journal on Very Large Data Bases (VLDB Journal), 2005.
[pdf] [abstract]
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
the data in the real world. In order to map the raw sensor readings onto
physical reality, a "model" of that reality is required to complement the
readings. In this article, we enrich interactive sensor querying with
statistical modeling techniques. We demonstrate that such models can help
provide answers that are both more meaningful, and, by introducing
approximations with probabilistic confidences, significantly more efficient to
compute in both time and energy. Utilizing the combination of a model and live
data acquisition raises the challenging optimization problem of selecting the
best sensor readings to acquire, balancing the increase in the confidence of our
answer against the communication
and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution
to this optimization problem, and a polynomial-time heuristic for identifying solutions that perform well in practice.
We evaluate our approach on several real-world sensor-network data sets, taking into account the real measured data and
communication quality, demonstrating that our model-based approach provides a high-fidelity representation of the real
phenomena and leads to significant performance gains versus traditional data acquisition techniques.
- Toward On-line Schema Evolution for Non-stop Systems ;
Amol Deshpande, Michael Hicks;
HPTS 2005.
[talk]
- Resource-Aware Wireless Sensor-Actuator Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden;
IEEE Data Engineering Bulletin March 2005.
[pdf] [abstract]
Innovations in wireless sensor networks (WSNs) have dramatically expanded
the applicability of control technology in day-to-day life, by enabling
the cost-effective deployment of large scale sensor-actuator systems.
In this paper, we discuss the issues and challenges involved in deploying control-oriented
applications over unreliable, resource-constrained WSNs, and describe the design
of our planned Sensor Control System (SCS) that can enable the rapid development and
deployment of such applications.
- Exploiting Correlated Attributes in Acquisitional Query Processing;
Amol Deshpande, Carlos Guestrin, Sam Madden, Wei Hong;
ICDE 2005.
[pdf] [talk] [abstract]
Sensor networks and other distributed information systems (such as the Web) must
frequently access data that has a high per-attribute "acquisition" cost, in
terms of energy, latency, or computational resources. When executing queries
that contain several predicates over such expensive attributes, we observe that
it can be beneficial to use correlations to automatically introduce low-cost
attributes whose observation will allow the query processor to better estimate
the selectivity of these expensive predicates. In particular, we show
how to build "conditional plans" that branch into one or more
sub-plans, each with a different ordering for the expensive query
predicates, based on the runtime observation of low-cost
attributes. We frame the problem of constructing the optimal
conditional plan for a given user query and set of candidate low-cost
attributes as an optimization problem. We describe an exponential
time algorithm for finding such optimal plans, and describe a
polynomial-time heuristic for identifying conditional plans that
perform well in practice. We also show how to compactly model
conditional probability distributions needed to identify correlations
and build these plans. We evaluate our algorithms against several
real-world sensor-network data sets, showing several-times performance
increases for a variety of queries versus traditional optimization techniques.
- Using Probabilistic Models for Data Management in Acquisitional Environments;
Amol Deshpande, Carlos Guestrin, Sam Madden;
CIDR 2005.
[pdf] [abstract]
Traditional database systems, particularly those focused on capturing
and managing data from the real world, are poorly equipped to deal
with the noise, loss, and uncertainty in data. We discuss a suite of
techniques based on probabilistic models that are designed to allow
database to tolerate noise and loss. These techniques are based on
exploiting correlations to predict missing values and identify outliers.
Interestingly, correlations also provide a way to give
approximate answers to users at a significantly lower cost and enable
a range of new types of queries over the correlation structure itself.
We illustrate a host of applications for our new techniques and
queries, ranging from sensor networks to network monitoring to data
stream management. We also present a unified architecture for
integrating such models into database systems, focusing in particular
on "acquisitional systems" where the cost of capturing data (e.g.,
from sensors) is itself a significant part of the query processing
cost.
- Lifting the Burden of History from Adaptive Query Processing;
Amol Deshpande, Joseph M. Hellerstein ;
VLDB 2004 .
[pdf] [abstract]
Adaptive query processing schemes attempt to reoptimize query plans
during the course of query execution. A variety of techniques for
adaptive query processing have been proposed, varying in the
granularity at which they can make decisions. The eddy
is the most aggressive of these techniques,
with the flexibility to choose tuple-by-tuple how to
order the application of operators.
In this paper we identify and address a fundamental limitation of the
original eddies proposal: the "burden of history" in routing. We
observe that routing decisions have long-term effects on the state of
operators in the query, and can severely constrain the ability of the
eddy to adapt over time.
We then propose a
mechanism we call STAIR that allows the query engine to manipulate
the state stored inside the operators and undo the
effects of past routing decisions. We demonstrate that eddies with
STAIR achieve both high adaptivity and good performance in the
face of uncertainty, outperforming prior eddy proposals by orders of magnitude.
- Model-Driven Data Acquisition in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, Wei Hong;
VLDB 2004.
[pdf] [abstract] Recipient of the best paper award.
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
the data in the real world. In order to map the raw sensor readings onto
physical reality, a "model" of that reality is required to complement the
readings. In this paper, we enrich interactive sensor querying with statistical
modeling techniques. We demonstrate that such models can help provide
answers that are both more meaningful, and, by introducing approximations
with probabilistic confidences,
significantly more efficient to compute in both time and energy. Utilizing the combination of
a model and live data acquisition raises the challenging optimization
problem of selecting the best sensor readings to acquire, balancing the increase
in the confidence of our answer against the communication and data acquisition costs in the network.
We describe an exponential
time algorithm for finding the optimal solution to this optimization problem, and a
polynomial-time heuristic for identifying solutions that
perform well in practice. We evaluate our approach on several
real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach
provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition
techniques.
- An Initial Study of Overheads of Eddies;
Amol Deshpande;
SIGMOD Record, March 2004.
[pdf] [abstract]
An eddy is a highly adaptive query processing operator that continuously
reoptimizes a query in respose to changing runtime conditions. It does this by
treating query processing as routing of tuples through operators and making
per-tuple routing decisions. The benefits of such adaptivity can be significant,
especially in highly dynamic environments such as data streams, sensor query
processing, web querying, etc. Various parties have asserted that the cost of
making per-tuple routing decisions is prohibitive. We have implemented eddies in
the PostgreSQL open source database system in the context of the TelegraphCQ
project. In this paper, we present an "apples-to-apples" comparison of
PostgreSQL query processing overhead with and without eddies. Our results show
that with some minor tuning, the overhead of the eddy mechanism is negligible.
- Cache-and-Query for Wide Area Sensor Databases;
Amol Deshpande, Suman Nath, Phil Gibbons, Srini Seshan;
SIGMOD 2003.
[pdf] [talk] [abstract] (IrisNet Project Web Page)
Webcams, microphones, pressure gauges and other sensors provide exciting new
opportunities for querying and monitoring the physical world. In this paper we
focus on querying wide area sensor databases, containing (XML) data derived from
sensors spread over tens to thousands of miles. We present the first scalable
system for executing XPATH queries on such databases. The system maintains the
logical view of the data as a single XML document, while physically the data is
fragmented across any number of host nodes. For scalability, sensor data is
stored close to the sensors, but can be cached elsewhere as dictated by the
queries. Our design enables self-starting distributed queries that jump directly
to the lowest common ancestor of the query result, dramatically reducing query
response times. We present a novel query-evaluategather technique (using XSLT)
for detecting (1) which data in a local database fragment is part of the query
result, and (2) how to gather the missing parts. We define partitioning and
cache invariants that ensure that even partial matches on cached data are
exploited and that correct answers are returned, despite our dynamic
query-driven caching. Experimental results demonstrate that our techniques dramatically increase query throughputs and decrease query response
times in wide area sensor databases.
- Using State Modules for Adaptive Query Processing;
Vijayshankar Raman, Amol Deshpande, Joe Hellerstein;
ICDE 2003.
[pdf] [abstract]
We present a query architecture in which join operators are decomposed into
their constituent data structures (State Modules, or SteMs), and data¤ow among
these SteMs is managed adaptively by an Eddy routing operator. Breaking the
encapsulation of joins serves two purposes. First, it allows the Eddy to observe
multiple physical operations embedded in a join algorithm, allowing for better
calibration and control of these operations. Second, the SteM on a relation
serves as a shared materialization point, enabling multiple competing access
methods to share results, which can be leveraged by multiple competing join
algorithms. Our architecture extends prior work significantly, allowing
continuously adaptive decisions for most major aspects of traditional query
optimization: choice of access methods and join algorithms, ordering of
operators, and choice of a query spanning tree. SteMs introduce significant
routing flexibility to the Eddy, enabling more opportunities for adaptation, but
also introducing the possibility of incorrect query results. We present
constraints on Eddy routing through SteMs that ensure correctness while
preserving a great deal of flexibility. We also
demonstrate the benefits of our architecture via experiments in the Telegraph dataflow system. We show that even a simple routing policy allows
significant ¤exibility in adaptation, including novel effects like the automatc "hybridization" of multiple algorithms for a single join.
- TelegraphCQ: An Architectural Status Report ;
Sailesh Krishnamurthy, Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Samuel Madden, Frederick Reiss, Mehul A. Shah;
IEEE Data Engineering Bulletin 26(1): 11-18 (2003).
- TelegraphCQ: Continuous Dataflow Processing for an Uncertain World ;
Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Mike Franklin, Joe Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, Mehul Shah ;
CIDR 2003.