Declarative Abstractions and Scalable Platforms for Big Data Analytics
CSI 2117
For several decades now, the amount of data available to us has been growing at a pace far higher than our ability to process it; this trend, popularly referred to as "big data", has accelerated many-fold in recent years with the emergence of efficient and mass-produced scientific instruments, increasing ease of generating and publishing data, and proliferation of Internet-connected devices. The overarching goal of my research is to understand and address the limitations of existing data management formalisms and systems, especially for new types of data or applications, by designing intuitive, formal and declarative abstractions, and by developing algorithms and systems to support those abstractions over large volumes of data.
In this talk, I will present an overview of two recent projects from my group. First, I will discuss our work on building a graph data management system for supporting a wide range of queries and analytics over graph (network) data. This work is motivated by the increasing realization that querying and reasoning about the structure of the interconnections between entities can lead to interesting and deep insights into a variety of phenomena. The application domains where graph or network analytics are regularly applied include social media, finance, communication networks, biological networks, Internet, citation data, and many others. I will talk about some of the key data management challenges in supporting graph analytics over large volumes of data, and our work on addressing those challenges.
Second, I will discuss our initial work on building a platform for enabling collaborative data science, where teams of data scientists want to simultaneously analyze, modify, and share datasets, to understand trends and to extract actionable business, scientific, or social insights. While numerous solutions exist for specific data analysis tasks, underlying infrastructure and data management capabilities for supporting the overall ad hoc collaboration pipeline are still missing. I will present our vision for a unified platform with the ability to load, store, query, collaboratively analyze, and share datasets, and discuss our recent work on managing and querying a large number of versions of datasets.