Storage-Compression-Querying Tradeoffs in Dataset Versioning

Talk
Amol Deshpande
Talk Series: 
Time: 
10.08.2021 11:00 to 12:00
Location: 

IRB 0318

Also on Zoom - https://umd.zoom.us/j/96718034173?pwd=clNJRks5SzNUcGVxYmxkcVJGNDB4dz09 Our ability to collect data continues to grow at an exponential rate; combine this with the abundance of local compute and storage capacities, increasingly decentralized teams of data analysts, and the almost-innate fear of ever deleting anything, and the result is a proliferation of many thousands or millions of versions of almost-similar datasets in most enterprises. This not only leads to increased storage and network costs, but also quickly grows unmanageable due to the difficulty in maintaining sufficient context like dataset provenance. Data compression is typically not sufficient by itself to address these challenges, in part because we often need to retrieve or query specific datasets or portions thereof, and in part because the data is usually stored in distributed cloud-based (semi-)structured data management systems. In this talk, I will discuss our work over the last decade on systematically understanding the storage/retrieval/query tradeoffs in this context, and describe how different use cases, computing environments, and data types lead to different solutions. I will also discuss how we can enable new types of introspective analyses of data evolution and data processing pipelines, and future research directions.