- GrDB: A System for Declarative and Interactive Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Hui Miao, Amol Deshpande, Lise Getoor;
SIGMOD Demo 2013.
- HiNGE : Enabling Temporal Network Analytics at Scale;
Udayan Khurana, and Amol Deshpande;
SIGMOD Demo 2013.
- Algorithms for the Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
IPDPS 2013.
[abstract]
The energy costs for cooling a data center constitute a significant portion of the overall running costs. Thermal imbalance and hot spots that arise due to
imbalanced workloads lead to significant wasted cooling effort -- in order to ensure that no equipment is operating above a certain temperature, the data center
may be cooled more than necessary. Therefore it is desirable to schedule the workload in a data center in a "thermally aware" manner, assigning jobs to machines
not just based on local load of the machines, but based on the overall thermal profile of the data center. This is challenging because of the spatial
cross-interference between machines, where a job assigned to a machine may impact not only that machine's temperature, but also nearby machines. Here, we
continue formal analysis of the "thermal scheduling" problem that we initiated recently~\cite{sig-abstract}. There we introduced the notion of "effective load of
a machine" which is a function of the local load on the machine as well as the load on nearby machines, and presented optimal scheduling policies for a simple
model (where cross-effects are restricted within a rack) under the assumption that jobs can be split among different machines. Here we consider the more
realistic problem of "integral" assignment of jobs, and allow for cross-interference among different machines in adjacent racks in the data center. The integral
assignment problem with cross-interference is NP-hard, even for a simple two machine model. We consider three different heat flow models, and give constant
factor approximation algorithms for maximizing the number (or total profit) of jobs assigned in each model, without violating thermal constraints. We also
consider the problem of minimizing the maximum temperature on any machine when all jobs need to be assigned, and give constant factor algorithms for this
problem.
- SWORD: Scalable Workload-Aware Data Placement for Transactional Workloads;
Abdul Quamar, K.Ashwin Kumar and Amol Deshpande;
EDBT 2013.
[pdf] [abstract]
In this paper, we address the problem of transparently scaling out transactional (OLTP) workloads on relational databases, to support "database-as-a-service"
in cloud computing environment. The primary challenges in supporting such workloads include choosing how to "partition" the data across a large number of
machines, minimizing the number of "distributed transactions", providing high data "availability", and tolerating "failures" gracefully. Capturing
and modeling the transactional workload over a period of time, and then exploiting that information for data placement and replication has been shown to provide
significant benefits in performance, both in terms of transaction latencies and overall throughput. However, such workload-aware data placement approaches can
incur very high overheads, and further, may perform worse than naive approaches if the workload changes.
In this work, we propose SWORD, a Scalable WORkload-aware Data partitioning and placement approach for OLTP workloads, that
incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement, and during query execution at runtime.
We model the workload as a hypergraph over the data items, and propose using a "hypergraph compression" technique to reduce the overheads of partitioning. To
deal with workload changes, we propose an incremental data repartitioning technique that modifies data placement in small steps without resorting to complete
workload repartitioning. % and data migration. We have built a workload-aware "active replication" mechanism in SWORD to increase availability and enable
load balancing. We propose the use of "fine-grained quorums" defined at the level of "groups of tuples" to control the cost of distributed updates,
improve throughput, and provide adaptability to different workloads. To our knowledge, SWORD is the first system that uses fine-grained quorums in this
context. The results of our experimental evaluation on SWORD deployed on an Amazon EC2 cluster show that our techniques result in orders-of-magnitude
reductions in the partitioning and book-keeping overheads, and improve tolerance to failures and workload changes; we also show that choosing quorums based
on the query access patterns enables us to better handle query workloads with different read and write access patterns.
- Efficient Snapshot Retrieval over Historical Graph Data;
Udayan Khurana, and Amol Deshpande;
ICDE 2013 (also CoRR Technical Report arXiv:1207.5777).
[abstract]
We address the problem of managing historical data for large evolving information networks like
social networks or citation networks, with the goal to enable temporal and evolutionary queries and
analysis. We present the design and architecture of a distributed graph database system that stores
the entire history of a network and provides support for efficient retrieval of multiple graphs from
arbitrary time points in the past, in addition to maintaining the current state for ongoing updates.
Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We
introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index
structure that enables compactly recording the historical information, and that supports efficient
retrieval of historical graph snapshots for single-site or parallel processing. Along with the
original graph data, DeltaGraph can also maintain and index auxiliary information; this
functionality can be used to extend the structure to efficiently execute queries like subgraph
pattern matching over historical data. We develop analytical models for both the storage space
needed and the snapshot retrieval times to aid in choosing the right parameters for a specific
scenario. In addition, we present strategies for materializing portions of the historical graph
state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph
data structure called GraphPool that can maintain hundreds of historical graph instances in main
memory in a non-redundant manner. We present a comprehensive experimental evaluation that
illustrates the effectiveness of our proposed techniques at managing historical graph information.
- Managing Large Dynamic Graphs Efficiently;
Jayanta Mondal, Amol Deshpande;
SIGMOD 2012.
[pdf] [abstract]
There is an increasing need to ingest, manage, and query large volumes of
graph-structured data arising in applications like social networks,
communication networks, biological networks etc. Graph databases that can
explicitly reason about the graphical nature of the data, that can support
flexible schemas and node/edge-centric analysis and querying, are ideal for
storing such data. However, although there is much work on single-site graph
databases and on efficiently executing specific types of queries over large
graphs, to date there is little work on understanding the challenges in
distributed dynamic graph data management, needed to handle the large scale of
such data. In this paper, we propose the design of an in-memory, distributed
graph data management system aimed at managing a large dynamically changing
graph and supporting low-latency query processing over it. The key challenge in
a distributed graph database is that, partitioning a graph across a set of
machines inherently results in a large number of distributed traversals across
partitions to answer even simple queries. We propose a suite of novel techniques
to minimize the communication bandwidth and the storage requirements. We have
implemented our framework as a middle-ware on top of an open-source key-value
store. We evaluate our system on a social graph, and show that our system is
able to handle very large graphs efficiently, and that it reduces the network
bandwidth consumption significantly.
- Local Structure and Determinism in Probabilistic Databases;
Theodoros Rekatsinas, Amol Deshpande, Lise Getoor;
SIGMOD 2012.
[pdf] [abstract]
While extensive work has been done on evaluating queries under the
tuple-independence assumption, query evaluation over correlated data has
received much less attention even though the support for correlations is
essential for many natural applications of probabilistic databases (e.g.,
information extraction, data integration, computer vision, etc.). In this paper,
we develop a novel approach for efficiently evaluating probabilistic queries
over correlated databases where correlations are represented using a "factor
graph", a class of graphical models widely used for capturing correlations and
performing statistical inference. Our approach exploits the specific values of
the factor parameters and determinism in the correlations, collectively called
"local structure", to reduce the complexity of query evaluation. Our framework
is based on "arithmetic circuits", factorized representations of probability
distributions that can exploit such local structure. Traditionally, arithmetic
circuits are generated following a compilation process and can not be updated
directly. We introduce a generalization of arithmetic circuits, called
"annotated arithmetic circuits", and a novel algorithm for updating them, which
enables us to answer probabilistic queries efficiently. We present a
comprehensive experimental analysis and show speed-ups of at least one order of
magnitude in many cases.
- Saving on Cooling: The Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
SigMetrics 2012 (Poster paper).
- Ego-centric Graph Pattern Census;
Walaa Eldin Moustafa, Amol Deshpande, Lise Getoor;
ICDE 2012.
[abstract]
There is increasing interest in analyzing networks of all types including social, biological,
sensor, computer, and transportation networks. Broadly speaking, we may be interested in
global network-wide analysis (e.g., centrality analysis, community detection) where the
properties of the entire network are of interest, or local ego-centric analysis where the
focus is on studying the properties of nodes (egos) by analyzing their neighborhood subgraphs.
In this paper we propose and study ego-centric pattern census queries, a new type of graph
analysis query, where a given structural pattern is searched in every node's neighborhood and
the counts are reported or used in further analysis. This kind of analysis is useful in many
domains in social network analysis including opinion leader identification, node
classification, link prediction, and role identification. We propose an SQL-based declarative
language to support this class of queries, and develop a series of efficient query evaluation
algorithms for it. We evaluate our algorithms on a variety of synthetically generated graphs.
We also show an application of our language in a real-world scenario for predicting future
collaborations from DBLP data.
- Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189).
[abstract]
We study the stochastic versions of a broad class of combinatorial problems
where the weights of the elements in the input dataset are uncertain. The class
of problems that we study includes shortest paths, minimum weight spanning
trees, and minimum weight matchings over probabilistic graphs, and other
combinatorial problems like knapsack. We observe that the expected value is
inadequate in capturing different types of "risk-averse" or "risk-prone"
behaviors, and instead we consider a more general objective which is to maximize
the "expected utility" of the solution for some given utility function, rather
than the expected weight (expected weight becomes a special case). We show that
we can obtain a polynomial time approximation algorithm with "additive error"
$\epsilon$ for any $\epsilon>0$, if there is a pseudopolynomial time algorithm
for the "exact" version of the problem. (This is true for the problems mentioned
above). Our result generalizes several prior results on stochastic shortest
path, stochastic spanning tree, and stochastic knapsack. Our algorithm for
utility maximization makes use of the separability of exponential utility and a
technique to decompose a general utility function into exponential utility
functions, which may be useful in other stochastic optimization problems.
- Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2011.
[abstract]
As a result of decades of research and industrial development, modern query
optimizers are complex software artifacts. However, the quality of the query
plan chosen by an optimizer is largely determined by the quality of the
underlying statistical summaries. Small selectivity estimation errors,
propagated exponentially, can lead to severely sub-optimal plans. Modern
optimizers typically maintain one-dimensional statistical summaries and make the
attribute value independence and join uniformity assumptions for efficiently
estimating selectivities. Therefore, selectivity estimation errors in today's
optimizers are frequently caused by missed correlations between attributes. We
present a selectivity estimation approach that does not make the independence
assumptions. By carefully using concepts from the field of graphical models, we
are able to factor the joint probability distribution of all the attributes in
the database into small, usually two dimensional distributions. We describe
several optimizations that can make selectivity estimation highly efficient, and
we present a complete implementation inside PostgreSQL's query optimizer.
Experimental results indicate an order of magnitude better selectivity
estimates, while keeping optimization time in the range of tens of milliseconds.
- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011.
[abstract]
Probabilistic database systems have successfully established themselves as a tool for
managing uncertain data. However, much of the research in this area has
focused on efficient query evaluation and has largely ignored two key
issues that commonly arise in uncertain data management:
First, how to provide "explanations" for query
results, e.g., ``Why is this tuple in my result?'' or ``Why does this
output tuple have such high probability?''. Second, the problem of
determining the "sensitive" input tuples for the given query, e.g., users
are interested to know the input tuples that can substantially alter the output,
when their probabilities are modified (since they may be unsure about the input
probability values).
Existing systems provide the "lineage/provenance" of each of the output tuples in
addition to the output probabilities, which is a boolean formula indicating the
dependence of the output tuple on the input tuples. However, lineage does not
immediately provide a quantitative relationship and it is not informative when
we have multiple output tuples.
In this paper, we propose a unified framework
that can handle both the issues mentioned above to facilitate robust query
processing. We formally define the notions of "influence" and "explanations" and provide algorithms to determine the top-$\ell$ influential set of
variables and the top-$\ell$ set of explanations for a variety of queries,
including "conjunctive" queries, "probabilistic threshold" queries,
"top-k" queries and "aggregation" queries.
Further, our framework naturally enables highly efficient incremental
evaluation when input probabilities are modified (e.g., if uncertainty is
resolved).
Our preliminary experimental results demonstrate the benefits of our framework
for performing robust query processing over probabilistic databases.
- Declarative Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Galileo Namata, Amol Deshpande, Lise Getoor;
ICDE Workshop on Graph Data Management: Techniques and Applications (GDM 2011).
[abstract]
There is a growing interest in methods for analyzing data describing networks of
all types, including information, biological, physical, and social networks.
Typically the data describing these networks is observational, and thus noisy
and incomplete; it is often at the wrong level of fidelity and abstraction for
meaningful data analysis. This has resulted in a growing body of work on
extracting, cleaning, and annotating network data. Unfortunately, much of this
work is ad hoc and domain-specific. In this paper, we present the architecture
of a data management system that enables efficient, declarative analysis of
large-scale information networks. We identify a set of primitives to support the
extraction and inference of a network from observational data, and describe a
framework that enables a network analyst to easily implement and combine new
extraction and analysis techniques, and efficiently apply them to large
observation networks. The key insight behind our approach is to "decouple", to
the extent possible, (a) the operations that require traversing the graph
structure (typically the computationally expensive step), from (b) the
operations that do the modification and update of the extracted network. We
present an analysis language based on "Datalog", and show how to use it to
cleanly achieve such decoupling. We briefly describe our prototype system that
supports these abstractions. We include a preliminary performance evaluation of
the system and show that our approach scales well and can efficiently handle a
wide spectrum of data cleaning operations on network data.
- Sharing-Aware Horizontal Partitioning for Exploiting Correlations during Query Processing;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2010.
[pdf] [abstract]
Optimization of join queries based on average selectivities is sub-optimal in
highly correlated databases. In such databases, relations are naturally divided
into partitions, each partition having substantially different statistical
characteristics. It is very compelling to discover such data partitions during
query optimization and create multiple plans for a given query, one plan being
optimal for a particular combination of data partitions. This scenario calls for
the sharing of state among plans, so that common intermediate results are not
recomputed. We study this problem in a setting with a routing-based query
execution engine based on eddies. Eddies naturally encapsulate horizontal
partitioning and maximal state sharing across multiple plans. We define the
notion of a conditional join plan, a novel representation of the search space
that enables us to address the problem in a principled way. We present a
low-overhead greedy algorithm that uses statistical summaries based on graphical
models. Experimental results suggest an one order of magnitude faster execution
time over traditional optimization for high correlations, while maintaining the
same performance for low correlations.
- Read-Once Functions and Query Evaluation in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2010.
[pdf] [abstract]
Probabilistic databases hold promise of being a viable means for large-scale
uncertainty management, increasingly needed in a number of real world
applications domains. However, query evaluation in probabilistic databases
remains a computational challenge. Prior work on efficient "exact" query
evaluation in probabilistic databases has largely concentrated on query-centric
formulations (e.g., "safe plans", "hierarchical queries"), in that, they only
consider characteristics of the query and not the data in the database. It is
easy to construct examples where a supposedly hard query run on an appropriate
database gives rise to a tractable query evaluation problem. In this paper, we
develop efficient query evaluation techniques that leverage characteristics of
both the query and the data in the database. We focus on tuple-independent
databases where the query evaluation problem is equivalent to computing marginal
probabilities of Boolean formulas associated with the result tuples. This latter
task is easy if the Boolean formulas can be factorized into a form that has
every variable appearing at most once (called "read-once"). However, a naive
approach that directly uses previously developed Boolean formula factorization
algorithms is inefficient, because those algorithms require the input formulas
to be in the disjunctive normal form (DNF). We instead develop novel, more
efficient factorization algorithms that directly construct the read-once
expression for a result tuple Boolean formula (if one exists), for a large
subclass of queries (specifically, conjunctive queries without self-joins). We
empirically demonstrate that (1) our proposed techniques are orders of magnitude
faster than generic inference algorithms for queries where the result Boolean
formulas can be factorized into read-once expressions, and (2) for the special
case of hierarchical queries, they rival the efficiency of prior techniques
specifically designed to handle such queries.
- Lineage Processing on Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2010.
[pdf] [abstract]
In this paper, we address the problem of scalably evaluating conjunctive queries
over correlated probabilistic databases containing tuple or attribute
uncertainties. Like previous work, we adopt a two-phase approach where we first
compute "lineages" of the output tuples, and then compute the probabilities of
the lineage formulas. However unlike previous work, we allow for arbitrary and
complex correlations to be present in the data, captured via a forest of
"junction trees". We observe that evaluating even read-once (tree structured)
lineages (e.g., those generated by "hierarchical" conjunctive queries),
polynomially computable over tuple independent probabilistic databases, is
\#P-complete for lightly correlated probabilistic databases like "Markov
sequences". We characterize the complexity of exact computation of the
probability of the lineage formula on a correlated database using a parameter
called "lwidth" (analogous to the notion of "treewidth"). For lineages that
result in low lwidth, we compute exact probabilities using a novel message
passing algorithm, and for lineages that induce large lwidths, we develop
approximate Monte Carlo algorithms to estimate the result probabilities. We
scale our algorithms to very large correlated probabilistic databases using the
previously proposed INDSEP data structure. To mitigate the complexity of lineage
evaluation, we develop optimization techniques to process a batch of lineages by
sharing computation across formulas, and to exploit any independence
relationships that may exist in the data. Our experimental study illustrates the
benefits of using our algorithms for processing lineage formulas over correlated
probabilistic databases.
- PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB Journal Special Issue on Uncertain and Probabilistic Databases, 18(6): 1065-1090, 2009.
[pdf] [abstract]
Due to numerous applications producing noisy data, e.g., sensor data,
experimental data, data from uncurated sources, information extraction,
etc., there has been a surge of interest in the development of
probabilistic databases. Most probabilistic database models proposed to
date, however, fail to meet the challenges of real-world
applications on two counts: (1) they often restrict the kinds of
uncertainty that the user can represent; and (2) the query processing
algorithms often cannot scale up to the needs of the application. In
this work, we define a probabilistic database model, "PrDB", that
uses graphical models, a state-of-the-art probabilistic modeling
technique developed within the statistics and machine learning
community, to model uncertain data. We show how this results in a rich, complex
yet compact probabilistic database model, which can capture the commonly
occurring uncertainty models (tuple uncertainty, attribute uncertainty),
more complex models (correlated tuples and attributes) and allows
compact representation (shared and schema-level correlations). In addition, we show
how query evaluation in \PrDB\ translates into inference in an appropriately
augmented graphical model. This allows us to easily use any of a myriad of
exact and approximate inference algorithms developed within the
graphical modeling community. While probabilistic inference provides a generic approach
to solving queries, we show how the use of shared correlations, together
with a novel inference algorithm that we developed based on bisimulation,
can speed query processing significantly.
We present a comprehensive experimental evaluation of
the proposed techniques and show that even with a few shared
correlations, significant speedups are possible.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB 2009 (also CoRR Technical Report arXiv:0904.1366).
[pdf] [talk] [abstract] Recipient of the best paper award.
The dramatic growth in the number of application domains that naturally generate
probabilistic, uncertain data has resulted in a need for efficiently supporting
complex querying and decision-making over such data. In this paper, we present a
unified approach to ranking and top-k query processing in probabilistic
databases by viewing it as a multi-criteria optimization problem, and by
deriving a set of features that capture the key properties of a probabilistic
dataset that dictate the ranked result. We contend that a single, specific
ranking function may not suffice for probabilistic databases, and we instead
propose two parameterized ranking functions, called PRF-w and PRF-e, that
generalize or can approximate many of the previously proposed ranking functions.
We present novel generating functions-based algorithms for efficiently ranking
large datasets according to these ranking functions, even if the datasets
exhibit complex correlations modeled using probabilistic and/xor trees or Markov
networks. We further propose that the parameters of the ranking function be
learned from user preferences, and we develop an approach to learn those
parameters. Finally, we present a comprehensive experimental study that
illustrates the effectiveness of our parameterized ranking functions, especially
PRF-e, at approximating other ranking functions and the scalability of our
proposed algorithms for exact or approximate ranking.
- Bisimulation-based Approximate Lifted Inference;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
UAI 2009.
[pdf] [abstract]
There has been a great deal of recent interest in methods for performing
lifted inference, however most of this work assumes that the first-order
model is given as input to the system. Here, we describe lifted inference
algorithms that determine symmetries and automatically "lift" the
probabilistic model to speedup inference. In particular, we describe
approximate lifted inference techniques that allow the user to trade
off inference accuracy for computational efficiency by using a
handful of tunable parameters, while keeping the error bounded. Our
algorithms are closely related to the graph-theoretic concept of
bisimulation. We report experiments on both synthetic and real data to
show that in the presence of symmetries, run-times for inference can
be improved significantly with approximate lifted inference providing
speedups of upto 2 orders of magnitude over ground inference.
- Graphical Models for Uncertain Data;
Amol Deshpande, Lise Getoor, and Prithviraj Sen;
Book Chapter. Managing and Mining Uncertain Data, ed. C. Aggarwal, Springer, 2009..
[pdf] [abstract]
Graphical models are a popular and well-studied framework for compact
representation of a joint probability distribution over a large number of
interdependent variables, and for efficient reasoning about such a distribution.
They have been proven useful in a wide range of domains from natural language
processing to computer vision to bioinformatics. In this chapter, we present an
approach to using graphical models for managing and querying large-scale
uncertain databases. We present a unified framework based on the concepts from
graphical models that can model not only tuple-level and attribute-level
uncertainties, but can also handle arbitrary correlations that may be present
among the data; our framework can also naturally capture "shared correlations"
where the same uncertainties and correlations occur repeatedly in the data. We
develop an efficient strategy for query evaluation over such probabilistic
databases by casting the query processing problem as an "inference" problem in
an appropriately constructed graphical model, and present optimizations specific
to probabilistic databases that enable efficient query evaluation. We conclude
the chapter with a discussion of related and future work on these topics.
- Predictive Modeling-based Data Collection in Sensor Networks;
Lidan Wang and Amol Deshpande;
EWSN 2008.
[pdf] [abstract] Selected as one of the best papers.
We address the problem of designing practical, energy-efficient protocols for
data collection in wireless sensor networks using predictive modeling. Prior
work has suggested several approaches to capture and exploit the rich spatio-temporal
correlations prevalent in WSNs during data collection.
Although shown to be effective in reducing the data collection cost, those
approaches use simplistic corelation models and further, ignore many
idiosyncrasies of WSNs, in particular the broadcast nature of communication.
Our proposed approach is based on approximating the joint probability distribution
over the sensors using "undirected graphical models", ideally suited to
exploit both the spatial correlations and the broadcast nature of communication.
We present algorithms for optimally using such a model for data collection under
different communication models, and for identifying an appropriate model to use for a given
sensor network. Experiments over synthetic and real-world datasets show that our
approach significantly reduces the data collection cost.
- Adaptive Query Processing;
Amol Deshpande, Zack Ives, Vijayshankar Raman;
Foundations and Trends in Databases, 1(1), 2007.
[abstract] (A "greener" version of the pdf.)
As the data management field has diversified to consider settings in which
queries are increasingly complex, statistics are less available, or data is
stored remotely, there has been an acknowledgment that the traditional
optimize-then-execute paradigm is insufficient. This has led to a plethora of
new techniques, generally placed under the common banner of adaptive query
processing, that focus on using runtime feedback to modify query processing in a
way that provides better response time or more efficient CPU
utilization.
In this survey paper, we identify many of the common
issues, themes, and approaches that pervade this work, and the settings in which
each piece of work is most appropriate. Our goal with this paper is to be a
"value-add" over the existing papers on the material, providing not only a brief
overview of each technique, but also a basic framework for understanding the
field of adaptive query processing in general. We focus primarily on intra-query
adaptivity of long-running, but not full-fledged streaming, queries. We conclude
with a discussion of open research problems that are of high importance.
- MauveDB: Supporting Model-based User Views in Database Systems;
Amol Deshpande, Sam Madden;
SIGMOD 2006.
[pdf] [talk] [abstract]
Real-world data --- especially when generated by distributed measurement
infrastructures such as sensor networks --- tends to be incomplete, imprecise,
and erroneous, making it impossible to present it to users or feed it directly
into applications. The traditional approach to dealing with this problem is to
first process the data using statistical or probabilistic "models" that can
provide more robust interpretations of the data. Current database systems,
however, do not provide adequate support for applying models to such data,
especially when those models need to be frequently updated as new data arrives
in the system. Hence, most scientists and engineers, who depend on models for
managing their data, do not use database systems for archival or querying at
all; at best, databases serve as a persistent raw data store.
In this
paper we define a new abstraction called "model-based views" and present the
architecture of "MauveDB", the system we are building to support such views.
Just as traditional database views provide logical data independence,
model-based views provide independence from the details of the underlying data
generating mechanism and hide the irregularities of the data by using models to
present a consistent view to the users. MauveDB supports a declarative language
for defining model-based views, allows declarative querying over such views
using SQL, and supports several different materialization strategies and
techniques to efficiently maintain them in the face of frequent updates. We have
implemented a prototype system that currently supports views based on regression
and interpolation, in the Apache Derby open source DBMS, and we present results
that show the utility and performance benefits that can be obtained by
supporting several different types of model-based views in a database system.
- Model-Driven Data Acquisition in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, Wei Hong;
VLDB 2004.
[pdf] [abstract] Recipient of the best paper award.
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
the data in the real world. In order to map the raw sensor readings onto
physical reality, a "model" of that reality is required to complement the
readings. In this paper, we enrich interactive sensor querying with statistical
modeling techniques. We demonstrate that such models can help provide
answers that are both more meaningful, and, by introducing approximations
with probabilistic confidences,
significantly more efficient to compute in both time and energy. Utilizing the combination of
a model and live data acquisition raises the challenging optimization
problem of selecting the best sensor readings to acquire, balancing the increase
in the confidence of our answer against the communication and data acquisition costs in the network.
We describe an exponential
time algorithm for finding the optimal solution to this optimization problem, and a
polynomial-time heuristic for identifying solutions that
perform well in practice. We evaluate our approach on several
real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach
provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition
techniques.