Skip to main content


Similan: Finding Similar Records from Temporal Categorical Data

 


Similan is a temporal categorical data analysis tool that helps users find similar records from temporal categorical data. By implementing similarity metric computation and adopting ideas from rank-by-feature framework to rank records by similarity, Similan provides an interactive interface to customize and visualize similarity search results.

Project Description


Electronics Health Records (EHRs) are being collected by leading health organizations. These EHRs contains millions of records with patient histories. Challenges arise when a practitioner would like to to find records of patients with similar symptoms to the targeted patient in order to guide the treatment of the target patient. Finding similar patients from millions of records with patient histories is a challenging problem.

Similan is an interactive tool for finding similar records from temporal categorical data. Similan allows users to customize parameters in the similarity metric computation and provides visualization techniques to help users understand the search results. The goal of the project is to enable discovery and exploration of similar records in temporal categorical dataset. Although Similan was first motivated by EHRs, applications of Similan are not limited to the medical domain.

Features

Similan adopts the idea of rank-by-feature from Hierarchical Clustering Explorer (HCE). Ranking criteria are derived from the similarity metric. Whenever a target record has been selected, the similarity metric will be calculated for each record. The main panel then allows users to sort records according to these ranking criteria.

In addition to displaying results as a list in the main panel, Similan also visualizes the results as a scatterplot in the plot panel. Plot panel shows the overview of records as characterized by the target record and provides draw-a-selection filtering mechanism.

The comparison panel is designed to show similarity and difference between the selected record and the target record. Lines are drawn between pair of events to show similarity and difference. Line style is used to show relevance.

Similan also adapts idea from LifeLines2 by allowing users to align temporal categorical events by sentinel category. Since there can be more than one candidate for the sentinel event in the sentinel category, a candidate event which produces the maximum score according to the TC similarity metric will be selected.

Participants

Related HCIL Pages

Hierarchical Clustering Explorer (HCE) : A Rank-by-feature Framework for Interactive Exploration of Multidimensional Data
LifeLines2
: Discovering Temporal Categorical Patterns Across Multiple Records