There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources --- from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as "shared correlations" where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the "random variable elimination graph" (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.
In this paper we study a set of problems related to efficient energy management for monitoring applications in wireless
sensor networks. We study several generalizations of a basic problem called Set "k"-Cover. The problem can be described
as follows: we are given a set of sensors, and a set of regions to be monitored. Each region can be monitored by a
subset of sensors. The goal is to partition the sensors into "k" sets (or time-slots), so that by activating the set of
sensors in a time-slot, we can maximize coverage of the regions. By activating each sensor in only one of the "k" time
slots, we decrease its battery consumption and extend its battery life significantly (by a factor of "k"). This problem
is known to be "NP"-hard. Our goal is to develop improved approximation algorithms for this problem. Moreover, we are
able to demonstrate that this algorithm is practical, and yields almost optimal solutions in practice.
We also consider generalizations of this problem in several different directions. First, we allow each sensor to be
active in "\alpha" different sets (time-slots). This means that the battery life is extended by a factor of
"\frac{k}{\alpha}", and allows for a richer space of solutions. We also consider different coverage requirements, such
as requiring that the regions be covered in each time slot, or a certain number of regions be monitored in each time
slot. In the Set "k"-Cover formulation, there is no requirement that a region be monitored at all, or in any number of
time slots. We develop a randomized rounding algorithm for this problem.
We also consider extensions where each sensor can monitor only a bounded number of regions, and not all the regions
adjacent to it. This kind of problem may arise when a sensor has a directional camera, or some other physical
constraint might prevent it from monitoring all adjacent regions even when it is active. We develop the first
approximation algorithms for this problem.
We address the problem of designing practical, energy-efficient protocols for data collection in wireless sensor networks using predictive modeling. Prior work has suggested several approaches to capture and exploit the rich spatio-temporal correlations prevalent in WSNs during data collection. Although shown to be effective in reducing the data collection cost, those approaches use simplistic corelation models and further, ignore many idiosyncrasies of WSNs, in particular the broadcast nature of communication. Our proposed approach is based on approximating the joint probability distribution over the sensors using "undirected graphical models", ideally suited to exploit both the spatial correlations and the broadcast nature of communication. We present algorithms for optimally using such a model for data collection under different communication models, and for identifying an appropriate model to use for a given sensor network. Experiments over synthetic and real-world datasets show that our approach significantly reduces the data collection cost.
In this paper we address the problem of minimizing the response time of a multi-way join query using pipelined (inter-operator) parallelism, in a parallel or a distributed environment. We observe that in order to fully exploit the parallelism in the system, we must consider a new class of "interleaving" plans, where multiple query plans are used simultaneously to minimize the response time of a query (or maximize the tuple-throughput of the system). We cast the query planning problem in this environment as a "flow maximization problem", and present polynomial-time algorithms that (statically) find the optimal set of plans to use for a large class of multi-way join queries. Our proposed algorithms also naturally extend to query optimization over web services. Finally we present an extensive experimental evaluation that demonstrates both the need to consider such plans in parallel query processing and the effectiveness of our proposed algorithms.
In this paper, we address the problem of extending a relational database system to facilitate efficient real-time application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of model-based views for this purpose, by allowing users to declaratively specify the model to be applied, and by presenting the output of the models to the user as a probabilistic database view. We support declarative querying over such views using an extended version of SQL that allows for querying probabilistic data. Underneath we use particle filters, a class of sequential Monte Carlo algorithms commonly used to implement dynamic probabilistic models, to represent the present and historical states of the model as sets of weighted samples (particles) that are kept up-to-date as new readings arrive. We develop novel techniques to convert the queries on the model-based view directly into queries over particle tables, enabling highly efficient query processing. Finally, we present exmodel to be applied, and by presenting the output of the models to the user as a probabilistic database view. We support declarative querying over such views using an extended version of SQL that allows for querying probabilistic data. Underneath we use particle filters, a class of sequential Monte C
We introduce join scheduling algorithms that employ a balanced network utilization metric to optimize the use of all network paths in a global-scale database federation. This metric allows algorithms to exploit excess capacity in the network, while avoiding narrow, long-haul paths. We give a two-approximate, polynomial-time algorithm for serial (left-deep) join schedules. We also present extensions to this algorithm that explore parallel schedules, reduce resource usage, and define trade-offs between computation and network utilization. We evaluate these techniques within the SkyQuery federation of Astronomy databases using spatial-join queries submitted by SkyQuery's users. Experiments show that our algorithms realize near-optimal network utilization with minor computational overhead.
As the data management field has diversified to consider settings in which queries are increasingly complex, statistics
are less available, or data is stored remotely, there has been an acknowledgment that the traditional
optimize-then-execute paradigm is insufficient. This has led to a plethora of new techniques, generally placed under the
common banner of adaptive query processing, that focus on using runtime feedback to modify query processing in a way
that provides better response time or more efficient CPU utilization.
In this survey paper, we identify many of the common issues, themes, and approaches that pervade this work, and the
settings in which each piece of work is most appropriate. Our goal with this paper is to be a "value-add" over the
existing papers on the material, providing not only a brief overview of each technique, but also a basic framework for
understanding the field of adaptive query processing in general. We focus primarily on intra-query adaptivity of
long-running, but not full-fledged streaming, queries. We conclude with a discussion of open research problems that are
of high importance.
Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an "inference" problem in an appropriately constructed "probabilistic graphical model". We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental evaluation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets.
Real-world data --- especially when generated by distributed measurement infrastructures such as sensor networks ---
tends to be incomplete, imprecise, and erroneous, making it impossible to present it to users or feed it directly into
applications. The traditional approach to dealing with this problem is to first process the data using statistical or
probabilistic "models" that can provide more robust interpretations of the data. Current database systems, however, do
not provide adequate support for applying models to such data, especially when those models need to be frequently
updated as new data arrives in the system. Hence, most scientists and engineers, who depend on models for managing
their data, do not use database systems for archival or querying at all; at best, databases serve as a persistent raw
data store.
In this paper we define a new abstraction called "model-based views" and present the architecture of "MauveDB",
the system we are building to support such views. Just as traditional database views provide logical data independence,
model-based views provide independence from the details of the underlying data generating mechanism and hide the
irregularities of the data by using models to present a consistent view to the users. MauveDB supports a declarative
language for defining model-based views, allows declarative querying over such views using SQL, and supports several
different materialization strategies and techniques to efficiently maintain them in the face of frequent updates. We
have implemented a prototype system that currently supports views based on regression and interpolation, in the Apache
Derby open source DBMS, and we present results that show the utility and performance benefits that can be obtained by
supporting several different types of model-based views in a database system.
Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high per-attribute "acquisition" cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we observe that it can be beneficial to use correlations to automatically introduce low-cost attributes whose observation will allow the query processor to better estimate the selectivity of these expensive predicates. In particular, we show how to build "conditional plans" that branch into one or more sub-plans, each with a different ordering for the expensive query predicates, based on the runtime observation of low-cost attributes. We frame the problem of constructing the optimal conditional plan for a given user query and set of candidate low-cost attributes as an optimization problem. We describe an exponential time algorithm for finding such optimal plans, and describe a polynomial-time heuristic for identifying conditional plans that perform well in practice. We also show how to compactly model conditional probability distributions needed to identify correlations and build these plans. We evaluate our algorithms against several real-world sensor-network data sets, showing several-times performance increases for a variety of queries versus traditional optimization techniques.
Declarative queries are proving to be an attractive paradigm for interacting with networks of wireless sensors. The metaphor that "the sensornet is a database" is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings onto physical reality, a "model" of that reality is required to complement the readings. In this paper, we enrich interactive sensor querying with statistical modeling techniques. We demonstrate that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy. Utilizing the combination of a model and live data acquisition raises the challenging optimization problem of selecting the best sensor readings to acquire, balancing the increase in the confidence of our answer against the communication and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution to this optimization problem, and a polynomial-time heuristic for identifying solutions that perform well in practice. We evaluate our approach on several real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition techniques.
Webcams, microphones, pressure gauges and other sensors provide exciting new opportunities for querying and monitoring the physical world. In this paper we focus on querying wide area sensor databases, containing (XML) data derived from sensors spread over tens to thousands of miles. We present the first scalable system for executing XPATH queries on such databases. The system maintains the logical view of the data as a single XML document, while physically the data is fragmented across any number of host nodes. For scalability, sensor data is stored close to the sensors, but can be cached elsewhere as dictated by the queries. Our design enables self-starting distributed queries that jump directly to the lowest common ancestor of the query result, dramatically reducing query response times. We present a novel query-evaluategather technique (using XSLT) for detecting (1) which data in a local database fragment is part of the query result, and (2) how to gather the missing parts. We define partitioning and cache invariants that ensure that even partial matches on cached data are exploited and that correct answers are returned, despite our dynamic query-driven caching. Experimental results demonstrate that our techniques dramatically increase query throughputs and decrease query response times in wide area sensor databases.