- Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2011.
As a result of decades of research and industrial development, modern query
optimizers are complex software artifacts. However, the quality of the query
plan chosen by an optimizer is largely determined by the quality of the
... [more]
As a result of decades of research and industrial development, modern query
optimizers are complex software artifacts. However, the quality of the query
plan chosen by an optimizer is largely determined by the quality of the
underlying statistical summaries. Small selectivity estimation errors,
propagated exponentially, can lead to severely sub-optimal plans. Modern
optimizers typically maintain one-dimensional statistical summaries and make the
attribute value independence and join uniformity assumptions for efficiently
estimating selectivities. Therefore, selectivity estimation errors in today's
optimizers are frequently caused by missed correlations between attributes. We
present a selectivity estimation approach that does not make the independence
assumptions. By carefully using concepts from the field of graphical models, we
are able to factor the joint probability distribution of all the attributes in
the database into small, usually two dimensional distributions. We describe
several optimizations that can make selectivity estimation highly efficient, and
we present a complete implementation inside PostgreSQL's query optimizer.
Experimental results indicate an order of magnitude better selectivity
estimates, while keeping optimization time in the range of tens of milliseconds.
- Parallel Pipelined Filter Ordering with Precedence Constraints;
Amol Deshpande, Lisa Hellerstein;
Accepted to appear in Transactions on Algorithms..
In the parallel pipelined filter ordering problem, we are given a set of `n'
filters that run in parallel. The filters need to be applied to a stream of
elements, to determine which elements pass all filters. Each filter has a
... [more]
In the parallel pipelined filter ordering problem, we are given a set of `n'
filters that run in parallel. The filters need to be applied to a stream of
elements, to determine which elements pass all filters. Each filter has a
"rate limit" r_i on the number of elements it can process per unit time, and a
"selectivity" p_i, which is the probability that a random element will pass the
filter. The goal is to maximize throughput. This problem appears naturally in
a variety of settings, including parallel query optimization in databases and
query processing over Web services. We present an O(n^3) algorithm for the
above problem, given tree-structured precedence constraints on the filters.
This extends work of Condon et al. and Kodialam, who presented algorithms for
solving the problem without precedence constraints. Our algorithm is
combinatorial and produces a sparse solution. Motivated by join operators in
database queries, we also give algorithms for versions of the problem in which
"filter" selectivities may be greater than or equal to 1. We prove a strong
connection between the more classical problem of minimizing total work in
sequential filter ordering (A), and the parallel pipelined filter ordering
problem (B). More precisely, we prove that A is solvable in polynomial
time for a given class of precedence constraints if and only if B is as
well. This equivalence allows us to show that B is NP-Hard in the
presence of arbitrary precedence constraints (since A known to be NP-Hard
in that setting).
- Sharing-Aware Horizontal Partitioning for Exploiting Correlations during Query Processing;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2010.
[pdf]
Optimization of join queries based on average selectivities is sub-optimal in
highly correlated databases. In such databases, relations are naturally divided
into partitions, each partition having substantially different statistical
... [more]
Optimization of join queries based on average selectivities is sub-optimal in
highly correlated databases. In such databases, relations are naturally divided
into partitions, each partition having substantially different statistical
characteristics. It is very compelling to discover such data partitions during
query optimization and create multiple plans for a given query, one plan being
optimal for a particular combination of data partitions. This scenario calls for
the sharing of state among plans, so that common intermediate results are not
recomputed. We study this problem in a setting with a routing-based query
execution engine based on eddies. Eddies naturally encapsulate horizontal
partitioning and maximal state sharing across multiple plans. We define the
notion of a conditional join plan, a novel representation of the search space
that enables us to address the problem in a principled way. We present a
low-overhead greedy algorithm that uses statistical summaries based on graphical
models. Experimental results suggest an one order of magnitude faster execution
time over traditional optimization for high correlations, while maintaining the
same performance for low correlations.
- Minimizing Communication Cost in Distributed Multi-query Processing;
Jian Li, Amol Deshpande, and Samir Khuller;
ICDE 2009.
[pdf]
Increasing prevalence of large-scale distributed monitoring and computing
environments such as sensor networks, scientific federations, Grids etc., has
led to a renewed interest in the area of distributed query processing and
... [more]
Increasing prevalence of large-scale distributed monitoring and computing
environments such as sensor networks, scientific federations, Grids etc., has
led to a renewed interest in the area of distributed query processing and
optimization. In this paper we address a general, distributed multi-query
processing problem motivated by the need to minimize the communication cost in
these environments. Specifically we address the problem of optimally sharing
data movement across the communication edges in the communication network given
a set of overlapping queries and query plans for them (specifying the join
orders to be used to execute the queries). Most of the problem variations of our
general problem can be shown to be NP-Hard from a reduction from the Steiner
tree problem. However, we show that the problem can be solved optimally if the
communication network is a tree, and present a novel algorithm for finding the
optimal sharing plan. For general communication networks, we present efficient
approximation algorithms for several variations of the problem. Finally, we
present a preliminary experimental study over synthetic datasets showing both
the need for exploiting sharing of data movement and the effectiveness of our
algorithms at finding such plans.
- Algorithms for Distributional and Adversarial Pipelined Filter Ordering Problems;
Anne Condon, Amol Deshpande, Lisa Hellerstein, and Ning Wu;
ACM Transactions of Algorithms, 5(2), Article 24, March 2009..
Pipelined filter ordering is a central problem in database query optimization. The problem is to
determine the optimal order in which to apply a given set of commutative filters (predicates) to a
set of elements (the tuples of a relation), so as to find, as efficiently as possible, the tuples
... [more]
Pipelined filter ordering is a central problem in database query optimization. The problem is to
determine the optimal order in which to apply a given set of commutative filters (predicates) to a
set of elements (the tuples of a relation), so as to find, as efficiently as possible, the tuples
that satisfy all of the filters. Optimization of pipelined filter ordering has recently received
renewed attention in the context of environments such as the web, continuous high-speed data
streams, and sensor networks. Pipelined filter ordering problems are also studied in areas such as
fault detection and machine learning under names such as learning with attribute costs, minimum-sum
set cover, and satisficing search. We present algorithms for two natural extensions of the
classical pipelined filter ordering problem: (1) a "distributional type" problem where the filters
run in parallel and the goal is to maximize throughput, and (2) an "adversarial type" problem where
the goal is to minimize the expected value of "multiplicative regret". We present two related
algorithms for solving (1), both running in time O(n^2), which improve on the O(n^3 log(n))
algorithm of Kodialam. We use techniques from our algorithms for (1) to obtain an algorithm for (2).
- Flow Algorithms for Parallel Query Optimization;
Amol Deshpande, Lisa Hellerstein;
ICDE 2008.
[pdf] [talk]
In this paper we address the problem of minimizing the response time of a
multi-way join query using pipelined (inter-operator) parallelism, in a parallel
or a distributed environment. We observe that in order to fully exploit the
... [more]
In this paper we address the problem of minimizing the response time of a
multi-way join query using pipelined (inter-operator) parallelism, in a parallel
or a distributed environment. We observe that in order to fully exploit the
parallelism in the system, we must consider a new class of "interleaving" plans,
where multiple query plans are used simultaneously to minimize the response time
of a query (or maximize the tuple-throughput of the system). We cast the query
planning problem in this environment as a "flow maximization problem", and
present polynomial-time algorithms that (statically) find the optimal set of
plans to use for a large class of multi-way join queries. Our proposed
algorithms also naturally extend to query optimization over web services.
Finally we present an extensive experimental evaluation that demonstrates both
the need to consider such plans in parallel query processing and the
effectiveness of our proposed algorithms.
- Network-Aware Join Processing in Global-Scale Database Federations;
Xiaodon Wang, Randal Burns, Andreas Terzis, Amol Deshpande ;
ICDE 2008.
[pdf]
We introduce join scheduling algorithms that employ a balanced network
utilization metric to optimize the use of all network paths in a global-scale
database federation. This metric allows algorithms to exploit excess capacity in
... [more]
We introduce join scheduling algorithms that employ a balanced network
utilization metric to optimize the use of all network paths in a global-scale
database federation. This metric allows algorithms to exploit excess capacity in
the network, while avoiding narrow, long-haul paths. We give a two-approximate,
polynomial-time algorithm for serial (left-deep) join schedules. We also present
extensions to this algorithm that explore parallel schedules, reduce resource
usage, and define trade-offs between computation and network utilization. We
evaluate these techniques within the SkyQuery federation of Astronomy databases
using spatial-join queries submitted by SkyQuery's users. Experiments show that
our algorithms realize near-optimal network utilization with minor computational
overhead.
- Adaptive Query Processing;
Amol Deshpande, Zack Ives, Vijayshankar Raman;
Foundations and Trends in Databases, 1(1), 2007.
(A "greener" version of the pdf.)
As the data management field has diversified to consider settings in which
queries are increasingly complex, statistics are less available, or data is
stored remotely, there has been an acknowledgment that the traditional
... [more]
As the data management field has diversified to consider settings in which
queries are increasingly complex, statistics are less available, or data is
stored remotely, there has been an acknowledgment that the traditional
optimize-then-execute paradigm is insufficient. This has led to a plethora of
new techniques, generally placed under the common banner of adaptive query
processing, that focus on using runtime feedback to modify query processing in a
way that provides better response time or more efficient CPU
utilization.
In this survey paper, we identify many of the common
issues, themes, and approaches that pervade this work, and the settings in which
each piece of work is most appropriate. Our goal with this paper is to be a
"value-add" over the existing papers on the material, providing not only a brief
overview of each technique, but also a basic framework for understanding the
field of adaptive query processing in general. We focus primarily on intra-query
adaptivity of long-running, but not full-fledged streaming, queries. We conclude
with a discussion of open research problems that are of high importance.
- Flow Algorithms for Two Pipelined Filter Ordering Problems;
Anne Condon, Amol Deshpande, Lisa Hellerstein, and Ning Wu;
PODS 2006.
[pdf] [talk]
Pipelined filter ordering is a central problem in database query optimization,
and has received renewed attention in recent years in the context of
environments such as the web, continuous high-speed data streams and sensor
... [more]
Pipelined filter ordering is a central problem in database query optimization,
and has received renewed attention in recent years in the context of
environments such as the web, continuous high-speed data streams and sensor
networks. In this paper we present efficient algorithms for two natural
extensions of the classical pipelined filter ordering problem: (1) a
"distributional type" problem in a parallel environment where the filters run in
parallel and the goal is to maximize the throughput of the system, and (2) an
"adversarial type" problem where the goal is to minimize the expected value of
"multiplicative regret". We show that both problems can be solved using similar
flow algorithms, which find an optimal ordering scheme in time "O(n^2)", where
"n" is the number of filters. Our algorithm for (1) improves on an earlier
"O(n^3 \log n)" algorithm of Kodialam.
- Exploiting Correlated Attributes in Acquisitional Query Processing;
Amol Deshpande, Carlos Guestrin, Sam Madden, Wei Hong;
ICDE 2005.
[pdf] [talk]
Sensor networks and other distributed information systems (such as the Web) must
frequently access data that has a high per-attribute "acquisition" cost, in
terms of energy, latency, or computational resources. When executing queries
... [more]
Sensor networks and other distributed information systems (such as the Web) must
frequently access data that has a high per-attribute "acquisition" cost, in
terms of energy, latency, or computational resources. When executing queries
that contain several predicates over such expensive attributes, we observe that
it can be beneficial to use correlations to automatically introduce low-cost
attributes whose observation will allow the query processor to better estimate
the selectivity of these expensive predicates. In particular, we show
how to build "conditional plans" that branch into one or more
sub-plans, each with a different ordering for the expensive query
predicates, based on the runtime observation of low-cost
attributes. We frame the problem of constructing the optimal
conditional plan for a given user query and set of candidate low-cost
attributes as an optimization problem. We describe an exponential
time algorithm for finding such optimal plans, and describe a
polynomial-time heuristic for identifying conditional plans that
perform well in practice. We also show how to compactly model
conditional probability distributions needed to identify correlations
and build these plans. We evaluate our algorithms against several
real-world sensor-network data sets, showing several-times performance
increases for a variety of queries versus traditional optimization techniques.
- Lifting the Burden of History from Adaptive Query Processing;
Amol Deshpande, Joseph M. Hellerstein ;
VLDB 2004 .
[pdf]
Adaptive query processing schemes attempt to reoptimize query plans
during the course of query execution. A variety of techniques for
adaptive query processing have been proposed, varying in the
... [more]
Adaptive query processing schemes attempt to reoptimize query plans
during the course of query execution. A variety of techniques for
adaptive query processing have been proposed, varying in the
granularity at which they can make decisions. The eddy
is the most aggressive of these techniques,
with the flexibility to choose tuple-by-tuple how to
order the application of operators.
In this paper we identify and address a fundamental limitation of the
original eddies proposal: the "burden of history" in routing. We
observe that routing decisions have long-term effects on the state of
operators in the query, and can severely constrain the ability of the
eddy to adapt over time.
We then propose a
mechanism we call STAIR that allows the query engine to manipulate
the state stored inside the operators and undo the
effects of past routing decisions. We demonstrate that eddies with
STAIR achieve both high adaptivity and good performance in the
face of uncertainty, outperforming prior eddy proposals by orders of magnitude.
- An Initial Study of Overheads of Eddies;
Amol Deshpande;
SIGMOD Record, March 2004.
[pdf]
An eddy is a highly adaptive query processing operator that continuously
reoptimizes a query in respose to changing runtime conditions. It does this by
treating query processing as routing of tuples through operators and making
... [more]
An eddy is a highly adaptive query processing operator that continuously
reoptimizes a query in respose to changing runtime conditions. It does this by
treating query processing as routing of tuples through operators and making
per-tuple routing decisions. The benefits of such adaptivity can be significant,
especially in highly dynamic environments such as data streams, sensor query
processing, web querying, etc. Various parties have asserted that the cost of
making per-tuple routing decisions is prohibitive. We have implemented eddies in
the PostgreSQL open source database system in the context of the TelegraphCQ
project. In this paper, we present an "apples-to-apples" comparison of
PostgreSQL query processing overhead with and without eddies. Our results show
that with some minor tuning, the overhead of the eddy mechanism is negligible.
- Cache-and-Query for Wide Area Sensor Databases;
Amol Deshpande, Suman Nath, Phil Gibbons, Srini Seshan;
SIGMOD 2003.
[pdf] [talk] (IrisNet Project Web Page)
Webcams, microphones, pressure gauges and other sensors provide exciting new
opportunities for querying and monitoring the physical world. In this paper we
focus on querying wide area sensor databases, containing (XML) data derived from
... [more]
Webcams, microphones, pressure gauges and other sensors provide exciting new
opportunities for querying and monitoring the physical world. In this paper we
focus on querying wide area sensor databases, containing (XML) data derived from
sensors spread over tens to thousands of miles. We present the first scalable
system for executing XPATH queries on such databases. The system maintains the
logical view of the data as a single XML document, while physically the data is
fragmented across any number of host nodes. For scalability, sensor data is
stored close to the sensors, but can be cached elsewhere as dictated by the
queries. Our design enables self-starting distributed queries that jump directly
to the lowest common ancestor of the query result, dramatically reducing query
response times. We present a novel query-evaluategather technique (using XSLT)
for detecting (1) which data in a local database fragment is part of the query
result, and (2) how to gather the missing parts. We define partitioning and
cache invariants that ensure that even partial matches on cached data are
exploited and that correct answers are returned, despite our dynamic
query-driven caching. Experimental results demonstrate that our techniques dramatically increase query throughputs and decrease query response
times in wide area sensor databases.
- Using State Modules for Adaptive Query Processing;
Vijayshankar Raman, Amol Deshpande, Joe Hellerstein;
ICDE 2003.
[pdf]
We present a query architecture in which join operators are decomposed into
their constituent data structures (State Modules, or SteMs), and data¤ow among
these SteMs is managed adaptively by an Eddy routing operator. Breaking the
... [more]
We present a query architecture in which join operators are decomposed into
their constituent data structures (State Modules, or SteMs), and data¤ow among
these SteMs is managed adaptively by an Eddy routing operator. Breaking the
encapsulation of joins serves two purposes. First, it allows the Eddy to observe
multiple physical operations embedded in a join algorithm, allowing for better
calibration and control of these operations. Second, the SteM on a relation
serves as a shared materialization point, enabling multiple competing access
methods to share results, which can be leveraged by multiple competing join
algorithms. Our architecture extends prior work significantly, allowing
continuously adaptive decisions for most major aspects of traditional query
optimization: choice of access methods and join algorithms, ordering of
operators, and choice of a query spanning tree. SteMs introduce significant
routing flexibility to the Eddy, enabling more opportunities for adaptation, but
also introducing the possibility of incorrect query results. We present
constraints on Eddy routing through SteMs that ensure correctness while
preserving a great deal of flexibility. We also
demonstrate the benefits of our architecture via experiments in the Telegraph dataflow system. We show that even a simple routing policy allows
significant ¤exibility in adaptation, including novel effects like the automatc "hybridization" of multiple algorithms for a single join.
- TelegraphCQ: An Architectural Status Report ;
Sailesh Krishnamurthy, Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Samuel Madden, Frederick Reiss, Mehul A. Shah;
IEEE Data Engineering Bulletin 26(1): 11-18 (2003).
- TelegraphCQ: Continuous Dataflow Processing for an Uncertain World ;
Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Mike Franklin, Joe Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, Mehul Shah ;
CIDR 2003.
- Decoupled Query Optimization in Federated Databases;
Amol Deshpande, Joe Hellerstein;
ICDE 2002.
[pdf] [talk]
We study the problem of query optimization in federated relational database
systems. The nature of federated databases explicitly decouples many aspects of
the optimization process, often making it imperative for the optimizer to
... [more]
We study the problem of query optimization in federated relational database
systems. The nature of federated databases explicitly decouples many aspects of
the optimization process, often making it imperative for the optimizer to
consult underlying data sources while doing costbased optimization. This not
only increases the cost of optimization, but also changes the trade-offs
involved in the optimization process significantly. The dominant cost in the
decoupled optimization process is the "cost of costing" that traditionally has
been considered insignificant. The optimizer can only afford a few rounds of
messages to the underlying data sources and hence the optimization techniques in
this environment must be geared toward gathering all the required cost
information with minimal communication. In this paper, we explore the design
space for a query optimizer in this environment and demonstrate the need for
decoupling various aspects of the optimization process. We present
minimum-communication decoupled variants of various query optimization
techniques, and discuss tradeoffs in their performance in this scenario. We have
implemented these techniques in
the Cohera federated database system and our experimental results, somewhat surprisingly, indicate that a simple two-phase optimization scheme performs
fairly well as long as the physical database design is known to the optimizer, though more aggressive algorithms are required otherwise.
- Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data;
Amol Deshpande, Minos Garofalakis, Rajeev Rastogi;
SIGMOD 2001.
[pdf] [talk]
Approximating the joint data distribution of a multi-dimensional data set
through a compact and accurate histogram synopsis is a fundamental problem
arising in numerous practical scenarios, including query optimization and
... [more]
Approximating the joint data distribution of a multi-dimensional data set
through a compact and accurate histogram synopsis is a fundamental problem
arising in numerous practical scenarios, including query optimization and
approximate query answering. Existing solutions either rely on simplistic
independence assumptions or try to directly approximate the full joint data
distribution over the complete set of attributes. Unfortunately, both approaches
are doomed to fail for high-dimensional data sets with complex correlation
patterns between attributes. In this paper, we propose a novel approach to
histogram-based synopses that employs the solid foundation of statistical
interaction models to explicitly identify and exploit the statistical
characteristics of the data. Abstractly, our key idea is to break the synopsis
into (1) a statistical interaction model that accurately captures significant
correlation and independence patterns in data, and (2) a collection of
histograms on low-dimensional marginals that, based on the model, can provide
accurate approximations of the overall joint data distribution. Extensive
experimental results with several real-life data sets
verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance
of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the
"dimensionality curse".
- Adaptive Query Processing: Technology in Evolution;
Joe Hellerstein, Mike Franklin, Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah;
IEEE Data Engineering Bulletin 23(2): 7-18 (2000) .