Large-Scale Array Data Management for Science Applications

Talk
Jennie Duggan
MIT CSAIL
Talk Series: 
Time: 
04.03.2014 11:00 to 12:00
Location: 

AVW 4172

Science applications are becoming increasingly data-driven. Researchers are collecting new data at an unprecedented scale, and much of it is stored in multidimensional arrays. Such workloads consist of complex transformations, many of which query the data spatially. The established relational model of data management cannot support this new class of applications. At the same time, scientists are increasingly conducting their experiments on large, shared-nothing clusters in lieu of purpose built platforms. As a result, processor time is becoming more plentiful and network bandwidth is the scarcer resource.
In this talk, I will describe my research on efficiently distributing arrays for scientific workloads. This work is done in the context of SciDB, an open source array database system built for applications with complex analytics. I will first present our optimization of data-intensive queries to minimize their use of network resources. Our approach uses integer programming to assign segments of a distributed query to individual database nodes. The second part of my talk will present research on data placement for elastic array databases. This partitioning minimizes the time needed to reorganize the database for a change in the hardware configuration, while optimizing the layout of multidimensional data structures for spatial queries.