Skip to main content


Similan: Finding Similar Records from Temporal Categorical Data

Jump to: Features | Participants | Videos | Publications | Related Projects

---- SEE OUR NEWER WORK on EventFlow ----
Similan only handles point event data (i.e. with a single timestamp) while EventFlow also handles interval data, and provides advanced query capabilities - including specifying the absence of events.

Similan is an interactive data analysis tool that helps users find records that are similar to the target record from temporal categorical data. Similan implements a customizable similarity measure which computes how two records are similar or dissimilar. Similan provides a user-interface for the users to select record from the database as a target or create a custom target record, and customize search parameters. After the users perform the search, the similarity scores against the target record for all records are computed. Inspired by the rank-by-feature framework, Similan can rank these records by their similarity scores. The higher score represents higher similarity to the target record. The results are visualized on the screen with additional filters that allow users to explore the results.

Project Description

The initial goal of this project is to enable discovery and exploration of similar records in temporal categorical dataset. An increasing number of temporal categorical databases are being collected by various institutions. Electronic Health Records (EHRs), traffic incident logs in transportation systems and student records in academic institutes are three examples. Electronics Health Records (EHRs) are being collected by leading health organizations. These EHRs contains millions of records with patient histories. Challenges arise when a practitioner would like to to find records of patients with similar symptoms to the targeted patient in order to guide the treatment of the target patient. Finding similar patients from millions of records with patient histories is a challenging problem.

The main challenge is how to define "similarity". We are designing a customizable similarity measure that is flexible enough to capture different definitions of similarity according to users' need and allow them to customize this measure in their own ways.

In addition, we built Similan, an interactive tool for finding similar records from temporal categorical data. Similan employs the similarity measure and allows users to specify the target, customize parameters in the similarity measure and provides visualizations to help users understand and explore the search results.

The first version of Similan only allows the users to select an existing record from the database as a target. We extended Similan into a query-by-example tool that also allows the users to draw an example of what they are looking for, insteading of just selecting from existing records. Query-by-Example is more flexible than Query-by-Filters because it allows uncertainty. In query-by-filters, users need to have pretty detailed knowledge of the record they are looking for in order to formulate the queries. Too specific or wrong queries can result in an empty set of answers, frustrating casual users. In contrast, our approach displays all the results, ranked (or sorted) by similarity.

Although this project was first motivated by EHRs, applications of Similan and the similarity measure are not limited to the medical domain. Moreover, the similarity measure may be applied in other ways.

Similan Features

Align by Sentinel Category

Similan also adapts idea from LifeLines2 by allowing users to align temporal categorical events by sentinel category. The time is recomputed using the sentinel event as a reference point. Time before the sentinel event becomes positive and time after the sentinel event becomes positive. Since there can be more than one candidate for the sentinel event in the sentinel category (e.g. Patient 0000006 has two radiology contrasts), all possible alignments are grouped together and an alignment which has the maximum similarity score will be selected for ranking.

Select target from database or create custom target

Users can select any record from the database as a target record by dragging and dropping it into the target panel,

or create a custom record by placing events on the timeline.

Customize search parameters

Users can select range of interest (red box) to specify search range. In this example, user selected the duration from the admission time until three days after admission. Users can also select which categories they want to include in the search.

Rank-by-Similarity

Similan is inspired by the idea of rank-by-feature from Hierarchical Clustering Explorer (HCE). In Similan, the ranking criterion is the total similarity score from the similarity measure. The higher score represents higher similarity to the target record.

After clicking on Search, the records are sorted by their similarity score (with records with the highest scores on the top). Each record now shows its total score and a score indicator, a rectangle with four sections of different color (as shown in figure below), inspired by ValueCharts, a visualization to support decision-makers in inspecting linear models. The length of a score indicator represents total score. It is divided into four colored parts which represent the four decision criteria. The length of each part corresponds to the product of weight and score. Placing a cursor over the score indicator brings up an explanation tooltip.

Adjust weights & Apply filters

Adjust weights: (left) Similan allows users to adjust the importance of the four decision criteria in the similarity measure: avoid time difference (AT), avoid missing events (AM), avoid extra events (AE) and avoid swapping (AS). Similan also provide the users with some weight presets. By default, all the decision criteria are equally weighted (as shown in the figure below).

Apply filters: (right) Sometimes there are certain rules that the users want to apply to the dataset. Therefore, we thought that it might be better to allow the users to specify some filtering rules, so that they could benefit from both the precision of rules and the flexibility of similarity search. We developed a prototype of this idea by adding filters based on the number of occurrences into Similan.

Show comparison

The comparison panel is designed to show similarity and difference between the selected record and the target record. Lines are drawn between pair of events to show similarity and difference. Events are separated by category. The events in the target record are displayed above while the events in the compared record are displayed below.

Video Demonstration

Title Screen Description Available Formats


click to view video (720x540)

Summary
This video introduces Similan and demonstrates how users can use Similan to find patients who are similar to the selected patient from Electronic Health Records.

It also shows how users can define a custom target to query for patients who have speficied medical events.

Length: 5 minutes 46 seconds

Shockwave Flash (.swf)
720x540
View in browser

Flash Video (.flv)
720x540
Download (25.7MB)

Apple Video File (.m4v)
720x540
Download (17.9MB)



click to view video (720x540)

Summary
This video introduces Similan, a query-by-example interface for event sequences, and demonstrate how it can be used to query for a particular event sequence. At the end of this video, it also demonstrate how to use LifeLines2, a query-by-filters interface to perform the same task.

Length: 5 minutes 10 seconds

Shockwave Flash (.swf)
720x540
View in browser

Flash Video (.flv)
720x540
Download (43.6MB)

Apple Video File (.m4v)
720x540
Download (17.5MB)

Participants

Sponsors

This project is supported by the Washington Hospital Center and National Institutes of Health (NIH)

Publications

Main paper on Similan
Wongsuphasawat, K. and Shneiderman, B.
Finding Comparable Temporal Categorical Records: A Similarity Measure with an Interactive Visualization
in Proceedings of IEEE Symposium on Visual Analytics Science and Technology (IEEE VAST), 2009.

Experiment comparing Lifelines2 and Similan search interfaces
Wongsuphasawat, K., Plaisant, C. and Shneiderman, B.
Querying Timestamped Event Sequences by Exact Search or Similarity-based Search: Design and Empirical Evaluation revised version appears in Interacting with Computers (2012)
Technical Report, University of Maryland, 2010. (Under review for journal publication)

Survey of related techniques

Rind, A., Aigner, W., Miksch, S., Wang, T.D., Wongsuphasawat, K., Plaisant, C., and Shneiderman, B.
Interactive Information Visualization for exploring and querying electronic health records: A systematic review
Technical Report, University of Maryland, 2010. (Work in progress)

Short paper for local workshop
Wang, T.D., Wongsuphasawat, K., Plaisant, C., and Shneiderman, B.
Exploratory search over temporal event sequences: Novel requirements, operations, and a process model
in Proceedings of the 3rd Workshop on Human-Computer Information Retrieval, page 102-105, 2009.

Related Projects from HCIL

LifeLines2: Discovering Temporal Categorical Patterns Across Multiple Records

Summary of HCIL Projects in Temporal Visualizations:   LifeLines, LifeLines2, PatternFinder, etc.

Hierarchical Clustering Explorer (HCE):  A Rank-by-feature Framework for Interactive Exploration of Multidimensional Data

Related Workshops from HCIL

Personal Medical Devices Workshop: Increasing Patient Healthcare Participation (June 3, 2004)

Interactive Visual Exploration of Electronic Health Records (May 30, 2008)