When the Rubber Meets the Road: Data Science on Track

Talk
Lei Cao
Talk Series: 
Time: 
03.03.2021 13:00 to 14:00

Many data scientists prefer high level, end-to-end interfaces, like SQL databases to make sense of data, since they abstract away low-level time consuming engineering details. However, except for SQL databases, few tools for data scientists today offer such high-level interfaces. The goal of my research is to bridge this gap, by developing systems and algorithms that automatically address low-level performance and scaling bottlenecks at every step in the data science pipeline, while still making it easy to incorporate domain-specific requirements. My talk will cover two systems we have built, including an anomaly discovery system and a labeling system that solve fundamental problems in both unsupervised and supervised machine learning. First, AutoAD, the self-tuning component of our anomaly discovery system, targets freeing the data scientists from manually determining which among the large number of unsupervised anomaly detection techniques is the best suited for the given task and tuning the parameters for each of the alternate methods. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation. AutoAD solves this problem by using a fundamentally new strategy that unifies the merits of unsupervised anomaly detection and supervised classification. Second, our LANCET approach solves the labeling problem, a key bottleneck that limits the success of cutting-edge machine learning techniques in enterprise deployments. These techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. LANCET addresses all three challenges in an integrated framework based on a solid theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model.