PhD Proposal: Delta-Oriented Storage and Querying for Versioned Datasets

Talk
Amit Chavan
Time: 
05.23.2016 10:00 to 11:30
Location: 

AVW 3258

Data-driven methods and products are becoming increasingly common in a variety of communities, including science, bioinformatics, education, economics, social and web analytics. Once people start doing data analysis, they need sustainable and scalable tools that allow them to spend little to no effort on maintaining, sharing, and tracking changes to datasets, and instead focus on scientific and knowledge discovery. An easy availability of datasets, both large and small, coupled with ad hoc nature of data analysis tasks often leads to a proliferation of many thousands or millions of versions of the same datasets acquired or constructed at various stages of the analysis pipeline. In this proposal, we address several challenges in designing and developing tools for dataset management and finding systematic answers to some of the fundamental problems in this space.
The first problem we consider is understanding the storage--recreation tradeoff. The large number of overlapping datasets makes it imperative to exploit the overlaps to compactly store those. However, the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. We formulate several problems that permit a thorough and principled study of this tradeoff. We show that most of the problems are NP-Hard. We introduce a graph-based formulation of our problems, that allows us to adopt and modify algorithms from well-studied problems such as minimum spanning tree construction and delay-constrained scheduling. We provide two light-weight heuristics: one, where there is a constraint on the average recreation cost, and one where there is a constraint on the maximum recreation cost, and present an extensive experimental evaluation of these heuristics over several synthetic and real-world workloads demonstrating their effectiveness.
Data analysts also need richer retrieval and querying functionality, since there is often a need to cross-analyze the data in multiple files, possibly across multiple versions. Towards this goal, we study the problem of executing checkout, intersection, union and threshold queries inside a delta-oriented storage engine. At a high level, given thousands of versions of files, a delta-oriented storage engine stores a select few files completely (these are called materialized files), and stores others as a network of deltas, or changes, from these files. Many such engines of today require users to completely "check out" all files before manipulating them in any fashion. We introduce a novel framework that, to a large extent, takes advantage of the already computed deltas between the files during query processing. We present an extensive experimental evaluation on synthetic data with diverse characteristics, that shows that our methods perform exceedingly well compared to the "checkout followed by query execution'" approach.
Finally, we conclude the proposal with the future work we plan to pursue as a continuation of this research, which is divided into three parts: (a) devising declarative methods to specify complex queries over versioning and provenance information in a unified manner, (b) developing efficient evaluation techniques for the richer queries such as joins and aggregates, and (c) designing a distributed architecture for supporting such tasks.
Examining Committee:
Chair: Dr. Amol Deshpande
Dept. rep: Dr. Mihai Pop
Member: Dr. Samir Khuller