PhD Proposal: A Visual Analystics Approach to Comparing Cohorts of Event Sequences

Talk
Sana Malik
Time: 
12.03.2014 13:00 to 14:30
Location: 

AVW 3450

Sequences of timestamped events are being generated across nearly every domain of data analytics, from e-commerce web logging used by business analysts to electronic health records used by doctors and medical researchers. Every day, this data type is reviewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. To further uncover how these processes work the way they do, analysts often compare two groups, or cohorts, of event sequences to find differences and similarities between outcomes and processes. This task is complex with temporal event sequence data because of the variety of ways single events and sequences of events can differ between the two cohorts of records: the structure of the event sequences (e.g., event order, co-occurring events, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), or the metrics about the timestamps themselves (e.g., duration of an event).
Current tools for comparing groups of event sequences emphasize either a purely visual or purely statistical approach. Visual analytics tools leverage humans' abilities to see unexpected patterns and anomalies, but do not offer ways to substantiate findings. Statistical tools emphasize finding significant differences in the data, but often require analysts to have a concrete question and don't facilitate more general exploration of the data.
Often, these two approaches are taken in sequence using separate tools and sometimes by separate people. I propose an approach that combines visual analytics with statistics to amplify the benefits of both types of tools, thereby enabling analysts to conduct dramatically quicker and easier data exploration, hypothesis generation, and insight discovery.
Combining statistics and visual analytics presents considerable challenges on the frontend (e.g., presenting the large result set concisely, providing interactions for parsing results) and in the backend (e.g., scalability of running multiple metrics on multi-dimensional data at once). I begin by describing a taxonomy of metrics for comparing cohorts of temporal event sequences, which covers all aspects of event sequence differences including structure, event and record attributes, and time. I will implement these metrics as part of a visual analytics framework using existing statistical and machine learning techniques, and introduce novel methods for progressive visual analytics to run these metrics efficiently. I will develop a family of visualizations and interaction techniques which facilitate understanding large amounts of uncertain data, specifically when dealing with event sequences. The visualizations will highlight key differences between the datasets and guide users towards the most meaningful results, while interaction techniques will allow users to methodically sort, annotate, and filter based on specific questions. Lastly, I will demonstrate the utility and impact of these methods with a series of multi-dimensional long-term case studies.
My dissertation will contribute an understanding of how cohorts of temporal event sequences are commonly compared and the difficulties associated with applying and parsing the results of these metrics. It will also contribute a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide users to significant, distinguishing features between cohorts. This work will open avenues for future research in comparing two or more groups of temporal event sequences, and the principles can be extended to new data types.
Examining Committee:
Committee Chair: - Dr. Ben Shneiderman
Dept's Representative - Dr. Alan Sussman
Committee Member(s): - Dr. Catherine Plaisant
- Dr. Hector Corrada-Bravo