Managing and Querying Large-scale Uncertain Databases

Increasing numbers of real-world application domains are generating data that is inherently noisy, incomplete, and probabilistic in nature. Examples of such data include measurement data collected by sensor networks, observation data in the context of social networks and scientific and biological databases, and data collected by various online cyber-sources. The data uncertainties may be a result of the fundamental limitations of the underlying measurement infrastructures, the inherent ambiguity in the domain, or they may be a side-effect of the rich probabilistic modeling typically performed to extract high-level events from sensor and cyber data. Similarly, when attempting to integrate heterogeneous data sources ("data integration") or extracting structured information from text ("information extraction"), the results are approximate and uncertain at best. However, there is currently a lack of data management tools that can reason about large volumes of uncertain data, and hence the information about the uncertainty is often either discarded or reasoned about only superficially.

The goal of this project is to build a complete probabilistic data management system, called PrDB, that can manage, store, and process large-scale repositories of uncertain data. PrDB unifies ideas from "large-scale structured graphical models" like probabilistic relational models (PRMs), developed in the machine learning literature, and "probabilistic query processing", studied in the database literature. PrDB framework is based on the notion of "shared factors", which not only allows us to express and manipulate uncertainties at various levels of abstractions, but also supports capturing rich correlations among the uncertain data. PrDB supports a declarative SQL-like language for specifying uncertain data and the correlations among them. PrDB also supports exact and approximate evaluation of a wide range of queries including inference queries, SQL queries, and decision-support queries.

Project Participants

Publications

  • Approximation algorithms for stochastic submodular set cover with applications to Boolean function evaluation and min-knapsack;
    Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
    ACM Transacations on Algorithms 12:3 (42), 2016.
  • Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
    Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
    SODA 2014 (also CoRR Technical Report arXiv:1303.0726). [pdf] [abstract]
  • Local Structure and Determinism in Probabilistic Databases;
    Theodoros Rekatsinas, Amol Deshpande, Lise Getoor;
    SIGMOD 2012. [pdf] [abstract]
  • Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
    Jian Li, and Amol Deshpande;
    FOCS 2011 (also CoRR Technical Report arXiv:1012.3189). [pdf] [abstract]
  • Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
    Bhargav Kanagal, Jian Li, Amol Deshpande;
    SIGMOD 2011. [pdf] [abstract]
  • A Unified Approach to Ranking in Probabilistic Databases;
    Jian Li and Barna Saha and Amol Deshpande;
    VLDB Journal, 20(2): 249-275, 2011. [pdf] [abstract]
  • Ranking Continuous Probabilistic Datasets;
    Jian Li, and Amol Deshpande;
    VLDB 2010. [pdf] [abstract]
  • Read-Once Functions and Query Evaluation in Probabilistic Databases;
    Prithviraj Sen, Amol Deshpande, and Lise Getoor;
    VLDB 2010. [pdf] [abstract]
  • Lineage Processing on Correlated Probabilistic Databases;
    Bhargav Kanagal, Amol Deshpande;
    SIGMOD 2010. [pdf] [abstract]
  • PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases;
    Prithviraj Sen, Amol Deshpande, and Lise Getoor;
    VLDB Journal Special Issue on Uncertain and Probabilistic Databases, 18(6): 1065-1090, 2009. [pdf] [abstract]
  • A Unified Approach to Ranking in Probabilistic Databases;
    Jian Li and Barna Saha and Amol Deshpande;
    VLDB 2009 (also CoRR Technical Report arXiv:0904.1366). [pdf] [abstract]
  • Bisimulation-based Approximate Lifted Inference;
    Prithviraj Sen, Amol Deshpande, and Lise Getoor;
    UAI 2009. [pdf] [abstract]
  • Consensus Answers for Queries over Probabilistic Databases;
    Jian Li and Amol Deshpande;
    PODS 2009 (also CoRR Technical Report arXiv:0812.2049v1). [pdf] [abstract]
  • Indexing Correlated Probabilistic Databases;
    Bhargav Kanagal, Amol Deshpande;
    SIGMOD 2009. [pdf] [abstract]
  • Efficient Query Evaluation over Temporally Correlated Probabilistic Streams;
    Bhargav Kanagal, Amol Deshpande;
    ICDE 2009 (short paper). [pdf] [abstract]
  • Graphical Models for Uncertain Data;
    Amol Deshpande, Lise Getoor, and Prithviraj Sen;
    Book Chapter. Managing and Mining Uncertain Data, ed. C. Aggarwal, Springer, 2009.. [pdf] [abstract]
  • Exploiting Shared Correlations in Probabilistic Databases;
    Prithviraj Sen, Amol Deshpande, and Lise Getoor;
    VLDB 2008. [pdf] [abstract]
  • Representing and Querying Correlated Tuples in Probabilistic Databases;
    Prithviraj Sen, Amol Deshpande;
    ICDE 2007. [pdf] [abstract]
  • Representing Tuple and Attribute Uncertainty in Probabilistic Databases;
    Prithviraj Sen, Amol Deshpande, and Lise Getoor;
    The 1st Workshop on Data Mining of Uncertain Data (DUNE 2007), in conjunction ICDM 2007.
  • MauveDB: Supporting Model-based User Views in Database Systems;
    Amol Deshpande, Sam Madden;
    SIGMOD 2006. [pdf] [abstract]
  • Using Probabilistic Models for Data Management in Acquisitional Environments;
    Amol Deshpande, Carlos Guestrin, Sam Madden;
    CIDR 2005. [pdf] [abstract]

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 0546136, 0438866, and 0916736. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.