DataHub: Enabling Collaborative Data Analytics

The rise of the Internet, smart phones, and wireless sensors has resulted in a vast trove of data about all aspects of our lives, from our social interactions to our personal preferences to our vital signs and medical records. Increasingly, "data science" teams want to collect, clean, structure, store, and collaboratively analyze these datasets, to understand trends and to extract actionable business or social insights. Unfortunately, while there exist tools to support data analysis, much-needed underlying infrastructure and data management capabilities are missing. Distressingly, most systems either focus on performance, or on supporting even more sophisticated analyses, instead of simplifying and automating many of the fundamental book-keeping operations which are a necessary prerequisite for data science, including data cleaning and ingestion, data collaboration and versioning, and in-situ integration and search. In this project, we are building tools for simplifying collaborative data analytics. Specifically, we are building a hosted data platform, called DataHub, that hosts large numbers of datasets on behalf of different users, and supports a range of functionality aimed at reducing the amount of effort involved on the part of data scientists for preparing, analytzing, and managing data. DataHub is being developed jointly with researchers at MIT and UIUC; https://datahub.csail.mit.edu/www/ serves as the canonical project website. Key features of DataHub include: (1) a flexible, source code control-like versioning system for data, that efficiently branches, merges, and differences datasets; (2) new data ingest, cleaning, and wrangling tools designed to automate the cleaning process as much as possible; and (3) the ability to search for ``related'' tables and integrate them into the analysis process. This webpage describes the research activities we are undertaking at the University of Maryland in more detail.

Project Participants

Faculty: Amol Deshpande, Sam Madden, Aditya Parameswaran
Students: Souvik Bhattacherjee, Amit Chavan, Hui Miao

Publications (Chronologically Ordered)

Decibel: The Relational Dataset Branching System;
Michael Maddox, David Goehring, Aaron Elmore, Samuel Madden, Aditya Parameswaran, Amol Deshpande;
VLDB 2016. [abstract]
As scientific endeavors and data analysis becomes increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.
Towards a unified query language for provenance and versioning;
Amit Chavan, Silu Huang, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
USENIX TAPP 2015. [pdf] [abstract]
Organizations and teams collect and acquire data from various sources, such as social interactions, financial transactions, sensor data, and genome sequencers. Different teams in an organization as well as different data scientists within a team are interested in extracting a variety of insights which require combining and collaboratively analyzing datasets in diverse ways. DataHub is a system that aims to provide robust version control and provenance management for such a scenario. To be truly useful for collaborative data science, one also needs the ability to specify queries and analysis tasks over the versioning and the provenance information in a unified manner. In this paper, we present an initial design of our query language, called VQuel, that aims to support such unified querying over both types of information, as well as the intermediate and final results of analyses. We also discuss some of the key language design and implementation challenges moving forward.
Collaborative Data Analytics with DataHub (Demonstration Proposal);
Anant Bhardwaj, Amol Deshpande, Aaron Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang;
VLDB 2015. [abstract]
While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native version- ing capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data- processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook — an IPython-based notebook for analyzing data and storing the results of data analysis.
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff;
Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, Aditya Parameswaran;
PVLDB 2015. [pdf] [abstract]
The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage−recreation trade−off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DATAHUB system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.
DataHub: Collaborative data science and dataset version management at scale;
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
CIDR 2015. [pdf] [abstract]
Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 1513972, 1513407, and 1513443. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.