PhD Proposal: Supporting Independent Learning and Rapid Experimentation with Data Science Recommendation Engine

Deepthi Raghunandan
12.10.2021 12:00 to 14:00


Data science is the practice of discovering knowledge from data and facilitating decision-making with that knowledge. Knowledge derived from this practice must be provable and reproducible by a community of experts and non-experts. The practice of data science involves three main steps: data wrangling, sensemaking, and data interpretation. Data wrangling refers to collecting, consolidating, and cleaning data. Sensemaking is "the process of searching for a representation and encoding data in that representation to answer task-specific questions" (Russel 1993). Sensemaking is performed iteratively in a sensemaking loop (Pirolli 2005). Each sensemaking iteration works to refine and build on the previous insights---ultimately enabling the analyst to address less specialized audiences. The final step involves interpreting results by providing context, validating, and modeling the knowledge. Data interpretation is often collaborative, involving other data scientists or stakeholders. When knowledge is actionable, interpretations can facilitate decision-making by the team. In combination, these steps make up a data science workflow.To successfully practice data science, scientists must have access to tools that help them iterate and communicate. For data science programmers, computational notebooks are the most popular platforms for developing the data science workflow. Notebooks enable iteration and communication because they are, most notably, interactive and literate development environments. Interactive development environments enable users to "manage" the state of their program by dictating the lines of code they wish to execute and the order in which to execute them. Each execution provides users with feedback on the program's state, which they use to evaluate their next steps. This iterative interactivity is parallel to how data scientists "make sense" of their data within the sensemaking loop. Interactivity enables scientists to track and evaluate their iterations within the sensemaking loop. Notebooks are literate because they encapsulate code, execution results, visualizations, and insights in one document. Literate environments enable authors to use all the components of their data science workflow to form a computational narrative---a storytelling device to communicate and reproduce their results.The popularity of computational notebooks and, in turn, the need to teach real-world practices have driven computational notebook data science tutorials. Tutorials built using notebooks enable the audience to discover and explore new material. Their multi-functional interfaces can be beneficial, particularly for data science, where learners must marry data science concepts with programming techniques for insight derivation. However, while templates and tutorials remain static---best practices, libraries, and versions evolve. Keeping up with these trends is becoming increasingly complex, especially for fledgling data scientists. A data science recommendation system that uses current and real-world examples embedded directly into the computational notebook interface can overcome these limitations. To this end, we present Lodestar: an interactive computational notebook sandbox that allows users to quickly explore and construct new data science workflows by selecting from a list of analysis recommendations.Lodestar derives recommendations from directed graphs (workflows) of known analysis steps, with two input sources: one manually curated from online data science tutorials and another extracted through semi-automatic analysis of a corpus of Jupyter notebooks. Using a Jupyter Notebook corpus, we develop, leverage, and validate methods to identify how data scientists construct data science workflows within a computational notebook in real-life. We use these and related findings to develop a novel design for a mixed-initiative recommendation system on the computational notebook sandbox interface. To do this: we identify and label analysis steps, test and develop a recommendation engine, iteratively develop and evaluate an optimal user interface and, qualitatively evaluate the system to ensure that it meets the needs of fledgling data scientists.Examining Committee:

Chair:Department Representative:Members:

Dr. Niklas Elmqvist Dr. David Jacobs Dr. Leilani Battle