DeepDive: A Data Management System for Machine Learning Workloads

Talk
Ce Zhang
Stanford University
Talk Series: 
Time: 
02.19.2016 11:00 to 12:00
Location: 

AVW 4172

Many pressing questions in science are macroscopic: they require scientists to consult information expressed in a wide range of resources, many of which are not organized in a structured relational form. Knowledge base construction (KBC) is the process of populating a knowledge base, i.e., a relational database storing factual information, from unstructured inputs. KBC holds the promise of facilitating a range of macroscopic sciences by making information accessible to scientists. One key challenge in building a high-quality KBC system is that developers must often deal with data that are both diverse in type and large in size. Further complicating the scenario is that these data need to be manipulated by both relational operations and state-of-the-art machine-learning techniques.

My research focuses on building a data management system for machine learning workloads with the goal to help this complex process of building KBC systems. The system I build is called DeepDive, whose ultimate goal is to allow scientists to build a KBC system, and machine learning systems in general, by declaratively specifying domain knowledge without worrying about any algorithmic, performance, or scalability issues. DeepDive has been used by users without machine learning expertise in a number of domains from paleobiology to genomics to anti-human trafficking. In this talk, I will describe the DeepDive framework, its applications, and underlying techniques we developed to speed up a range of machine learning workloads by up to two orders of magnitude.