Due to numerous applications producing noisy data, e.g., sensor data,
experimental data, data from uncurated sources, information extraction,
etc., there has been a surge of interest in the development of
probabilistic databases. Most probabilistic database models proposed to
date, however, fail to meet the challenges of real-world
applications on two counts: (1) they often restrict the kinds of
uncertainty that the user can represent; and (2) the query processing
algorithms often cannot scale up to the needs of the application. In
this work, we define a probabilistic database model, "PrDB", that
uses graphical models, a state-of-the-art probabilistic modeling
technique developed within the statistics and machine learning
community, to model uncertain data. We show how this results in a rich, complex
yet compact probabilistic database model, which can capture the commonly
occurring uncertainty models (tuple uncertainty, attribute uncertainty),
more complex models (correlated tuples and attributes) and allows
compact representation (shared and schema-level correlations). In addition, we show
how query evaluation in \PrDB\ translates into inference in an appropriately
augmented graphical model. This allows us to easily use any of a myriad of
exact and approximate inference algorithms developed within the
graphical modeling community. While probabilistic inference provides a generic approach
to solving queries, we show how the use of shared correlations, together
with a novel inference algorithm that we developed based on bisimulation,
can speed query processing significantly.
We present a comprehensive experimental evaluation of
the proposed techniques and show that even with a few shared
correlations, significant speedups are possible.