DataHub: Enabling Collaborative Data Analytics

The rise of the Internet, smart phones, and wireless sensors has resulted in a vast trove of data about all aspects of our lives, from our social interactions to our personal preferences to our vital signs and medical records. Increasingly, "data science" teams want to collect, clean, structure, store, and collaboratively analyze these datasets, to understand trends and to extract actionable business or social insights. Unfortunately, while there exist tools to support data analysis, much-needed underlying infrastructure and data management capabilities are missing. Distressingly, most systems either focus on performance, or on supporting even more sophisticated analyses, instead of simplifying and automating many of the fundamental book-keeping operations which are a necessary prerequisite for data science, including data cleaning and ingestion, data collaboration and versioning, and in-situ integration and search. In this project, we are building tools for simplifying collaborative data analytics. Specifically, we are building a hosted data platform, called DataHub, that hosts large numbers of datasets on behalf of different users, and supports a range of functionality aimed at reducing the amount of effort involved on the part of data scientists for preparing, analytzing, and managing data. DataHub is being developed jointly with researchers at MIT and UIUC; https://datahub.csail.mit.edu/www/ serves as the canonical project website. Key features of DataHub include: (1) a flexible, source code control-like versioning system for data, that efficiently branches, merges, and differences datasets; (2) new data ingest, cleaning, and wrangling tools designed to automate the cleaning process as much as possible; and (3) the ability to search for ``related'' tables and integrate them into the analysis process. This webpage describes the research activities we are undertaking at the University of Maryland in more detail.

Project Participants

Publications (Chronologically Ordered)

  • Decibel: The Relational Dataset Branching System;
    Michael Maddox, David Goehring, Aaron Elmore, Samuel Madden, Aditya Parameswaran, Amol Deshpande;
    VLDB 2016. [abstract]
  • Towards a unified query language for provenance and versioning;
    Amit Chavan, Silu Huang, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
    USENIX TAPP 2015. [pdf] [abstract]
  • Collaborative Data Analytics with DataHub (Demonstration Proposal);
    Anant Bhardwaj, Amol Deshpande, Aaron Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang;
    VLDB 2015. [abstract]
  • Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff;
    Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, Aditya Parameswaran;
    PVLDB 2015. [pdf] [abstract]
  • DataHub: Collaborative data science and dataset version management at scale;
    Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
    CIDR 2015. [pdf] [abstract]

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 1513972, 1513407, and 1513443. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.