Managing and Querying Large-scale Uncertain Databases

Increasing numbers of real-world application domains are generating data that is inherently noisy, incomplete, and probabilistic in nature. Examples of such data include measurement data collected by sensor networks, observation data in the context of social networks and scientific and biological databases, and data collected by various online cyber-sources. The data uncertainties may be a result of the fundamental limitations of the underlying measurement infrastructures, the inherent ambiguity in the domain, or they may be a side-effect of the rich probabilistic modeling typically performed to extract high-level events from sensor and cyber data. Similarly, when attempting to integrate heterogeneous data sources ("data integration") or extracting structured information from text ("information extraction"), the results are approximate and uncertain at best. However, there is currently a lack of data management tools that can reason about large volumes of uncertain data, and hence the information about the uncertainty is often either discarded or reasoned about only superficially.

The goal of this project is to build a complete probabilistic data management system, called PrDB, that can manage, store, and process large-scale repositories of uncertain data. PrDB unifies ideas from "large-scale structured graphical models" like probabilistic relational models (PRMs), developed in the machine learning literature, and "probabilistic query processing", studied in the database literature. PrDB framework is based on the notion of "shared factors", which not only allows us to express and manipulate uncertainties at various levels of abstractions, but also supports capturing rich correlations among the uncertain data. PrDB supports a declarative SQL-like language for specifying uncertain data and the correlations among them. PrDB also supports exact and approximate evaluation of a wide range of queries including inference queries, SQL queries, and decision-support queries.

Project Participants

Publications

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 0546136, 0438866, and 0916736. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.