HCIL Logo  Human-Computer Interaction Lab / University of Maryland
 home 
 research 
 publications 
 academics 
 about hcil 
 members 
 partnerships 
 contact 

Hierarchical Clustering Explorer 3.0

About This Project | HCE3.0 (HCE2W) | HCE 2.0 | HCE 1.0 | Download | User Manual

Abstract

Interactive exploration of multidimensional data sets is challenging because: (1) it is difficult to comprehend patterns in more than three dimensions, and (2) current systems often are a patchwork of graphical and statistical methods leaving many researchers uncertain about how to explore their data in an orderly manner. We offer a set of principles and a novel rank-by-feature framework that could enable users to better understand distributions in one (1D) or two dimensions (2D), and then discover relationships, clusters, gaps, outliers, and other features.  Users of our framework can view graphical presentations (histograms, boxplots, and scatterplots), and then choose a feature detection criterion to rank 1D or 2D axis-parallel projections.  By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods users can systematically examine the most important 1D and 2D axis-parallel projections.  We summarize our Graphics, Ranking, and Interaction for Discovery (GRID) principles as: (1) 1D, 2D, then features (2) graphics, ranking, summaries, then statistics. We implemented the rank-by-feature framework in the Hierarchical Clustering Explorer, but the same data exploration principles could enable users to organize their discovery process so as to produce more thorough analyses and extract deeper insights in any multidimensional data application, such as spreadsheets, statistical packages, or information visualization tools.

GRID Principles for Exploratory Analysis of Multidimensional Data Sets

A playful analogy may help clarify our goals. Imagine you are dropped by parachute into an unfamiliar place – it could be a forest, prairie, or mountainous area.  You could set out in a random direction to see what is nearby and then decide where to turn next. Or you might go towards peaks or valleys. You might notice interesting rocks, turbulent streams, scented flowers, tall trees, attractive ferns, colorful birds, graceful impalas, and so on.  Wandering around might be greatly satisfying if you had no specific goals, but if you needed to survey the land to find your way to safety, catalog the plants to locate candidate pharmaceuticals, or develop a wildlife management strategy, you would need to be more systematic.  Of course, each profession that deals with the multi-faceted richness of natural landscapes has developed orderly strategies to guide novices, to ensure thorough analyses, to promote comprehensive and consistent reporting, and to facilitate cooperation among professionals. 

 Our principles for exploratory analysis of multidimensional data sets have similar goals. Instead of wandering, analysts should clarify their goals and use appropriate techniques to ensure a comprehensive analysis. A good starting point is the set of principles put forth by Moore and McCabe, who recommended that statistical tools should (1) enable users to examine each dimension first and then explore relationships among dimensions, and (2) offer graphical displays first and then provide numerical summaries.  We extend Moore and McCabe’s principles to include ranking the projections to guide discovery of desired features, and realize this idea with overviews to see the range of possibilities and coordination to see multiple presentations.  An orderly process of exploration is vital, even though there will inevitably be excursions, iterations, and shifts of attention from details to overviews and back.

Detecting interesting features in low dimensions (1D or 2D) by utilizing powerful human perceptual abilities is crucial to understand the original multidimensional data set.  Familiar graphical displays such as histograms, scatterplots, and other well-known 2D plots are effective to reveal features including basic summary statistics, and even unexpected features in the data set.  There are also many algorithmic or statistical techniques that are especially effective in low dimensional spaces.  While there have been many approaches utilizing such visual displays and low dimensional techniques, most of them lack a systematic framework that organizes such functionalities to help analysts in their feature detection tasks.

 Our Graphics, Ranking, and Interaction for Discovery (GRID) principles are designed to enable users to better understand distributions in one (1D) or two dimensions (2D), and then discover relationships, clusters, gaps, outliers, and other features.  Users work by viewing graphical presentations (histograms, boxplots, and scatterplots), and then choose a feature detection criterion to rank 1D or 2D axis-parallel projections.  By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods users can systematically examine the most important 1D and 2D axis-parallel projections.  We summarize the GRID principles as:

(1) study 1D, study 2D, then find features

(2) ranking guides insight, statistics confirm.

Rank-by-Feature Framework

Abiding by these principles, the rank-by-feature framework has an interface for 1D projections and a separate one for 2D projections.  Users can begin their exploration with the main graphical display - histograms for 1D and scatterplots for 2D - and they can also study numerical summaries for more detail.

The rank-by-feature framework helps users systematically examine low dimensional (1D or 2D) projections to maximize the benefit of exploratory tools.  In this framework, users can select an interesting ranking criterion.  Users can rank low dimensional projections (1D or 2D) of the multidimensional data set according to the strength of the selected feature in the projection.  When there are many dimensions, the number of possible projections is too large to investigate every one randomly looking for interesting features.  The rank-by-feature framework relieves users from such burdens by recommending projections to users in an ordered manner defined by a ranking criterion that users selected.  This framework has been implemented in our interactive visualization tool, HCE.

- Histogram Ordering

All 1D histograms are ordered according to the current order criterion (A) in the ordered list (C).  The score overview (B) shows an overview of scores of all histograms.  A mouseover event activates a cell in the score overview, highlights the corresponding item in the ordered list (C) and shows the corresponding histogram in the histogram browser (D) simultaneously.  A click on a cell selects the cell and the selection is fixed until another click event occurs.  A selected histogram is shown in the histogram browser (D), where users can easily traverse histogram space by changing the dimension for the histogram using item slider.  A boxplot is also displayed above the histogram to show the graphical summary of the distribution of the dimension.  (Data shown is from a gene expression data set from a melanoma study (3614 genes x 38 samples)).

More ranking criteria will be added, but current available criteria for 1D histogram ordering are:

- Scatterplot Ordering

All 2D scatterplots are ordered according to the current ordering criterion (A) in the ordered list (C).   Users can select multiple scatterplots at the same time and generate separate scatterplot windows for them to compare them in a screen.  The score overview (B) shows an overview of scores of all scatterplots.  Mouseover event activates a cell in the score overview, highlights the corresponding item in the ordered list (C) and shows the corresponding scatterplot in the scatterplot browser (D) simultaneously.  A click on a cell selects the cell and the selection is fixed until another click event occurs.  A selected scatterplot is shown in the scatterplot browser (D), where it is also easy to traverse scatterplot space by changing X or Y axis using item sliders on the horizontal or vertical axis.  (A demographic and health related statistics for 3138 U.S. counties with 17 attributes.)

More ranking criteria will be added, but current available criteria for 2D scatterplot ordering are:

Signal/Noise Optimization with HCE 3.0 for Affymetrix GeneChip projects

We implemented in HCE 3.0 (HCE2W) a novel method to choose the most appropriate probe set signal algorithm for your Affy project using Unsupervised Clustering and F-measure.

Other new features in HCE 3.0

New components : Histogram Ordering, Table View, ...

Improved interactions : Continuous zooming in the dendrogram view, Minimum Similarity Bar for Column Clustering Results, ...

More functionalities : handling Excel files, clustering rows (or genes) is not mandatory anymore, coordination with gene ontology (molecular function, biological process, cellular component), annotation with Affymetrix NetAffx Annotation Files, ...

Papers

For more information, please refer to the following papers.

Download

HCE is a standalone Windows® application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (Gayatri Varma, gayatri@umd.edu).

Register and Download HCE 3.0 (released on Dec. 29, 2004)

Register and Download HCE 3.0 test version (released on March 29, 2004) - outdated!

System requirements
Intel® Pentium® processor
Microsoft® Windows 2000®, Windows XP


Last updated 01/25/2005