HCIL Logo  Human-Computer Interaction Lab / University of Maryland
 home 
 research 
 publications 
 academics 
 about hcil 
 members 
 partnerships 
 contact 

Hierarchical Clustering Explorer 1.0

About This Project | HCE 3.0 (HCE2W)| HCE 2.0 | HCE 1.0  | Download | User Manual

Abstract

Hierarchical clustering is widely used to find patterns in multi-dimensional datasets, especially for genomic microarray data.  Finding groups of genes with similar expression patterns can lead to better understanding of the functions of genes.  Early software tools produced only printed results, while newer ones enabled some online exploration.  We developed four general techniques that could be used in interactive explorations of clustering results.

Current version of HCE is downloadable from this page. [download]

Hierarchical Clustering and Dendrogram

Hierarchical agglomerative clustering algorithm is summarized as follows. Let's assume that we want to cluster m data points, and we have m(m-1)/2 similarity values for every possible pair of m data points.

  1. Initially, each data point occupies a cluster by itself. So there are m clusters at the beginning.
  2. Find one pair of clusters whose similarity value is the highest, and make the pair a new cluster.
  3. Update the similarity values between the new cluster and the remaining clusters. 
  4. Steps 2 and 3 are applied m-1 times before there remains only one cluster of size m.

There are many possible choices in updating the similarity values in step 3. Among them, most common ones are (1) complete-link, (2) average-link, and (3) single-link. Complete-link sets the similarity values between the new cluster and the remaining clusters to be the minimum of similarities between each member of the new cluster and the rest. Average-link uses average similarity value as a new similarity values. Single-link takes the maximum. A good explanation about hierarchical agglomerative clustering can be found at http://www.analytictech.com/networks/hiclus.htm.

Hierarchical clustering results are usually represented by means of dendrograms. A dendrogram is a binary tree in which each data point corresponds to terminal nodes, and distance from the root to a subtree indicates the similarity of subtrees – highly similar nodes or subtrees have joining points that are farther from the root.


A dendrogram for a part of Yeast cDNA microarray data set.

Overview in a Limited Screen Space

Overviews are important because they enable researchers to identify hot spots and understand the distribution of data. However, there are significant screen limitations when visualizing large data sets on commonly used displays that are 1600 pixels wide. For data sets larger than 1600 points, the corresponding dendrogram (and color mosaic) does not fit in a single screen even limiting each item to a single pixel.  To accommodate large datasets, HCE provides a compressed overview based on replacing leaves with average values of adjacent leaves. This view shows the entire hierarchy in one screen, at the cost of some lost detail at the leaves. The detail information of a selected cluster (yellow highlight in upper left) is provided below the overview together with the gene names and the other dendrogram (at lower right) by clustering the 38 samples (conditions).


A compressed overview for melanoma gene expression profile data (3614 genes, 38 samples).

Another possible overview is to allocate two pixels per item. This overview requires scrolling to view all items.  Users can adjust the level of detail shown in the overview by moving the slider for Bar Width (marked by orange ellipse) to change the item widths from 2 to 10 pixels.


Another overview with two pixels per gene. Melanoma gene expression profile data (3614 genes, 38 samples).

Histogram control and adjustment of color mapping

A histogram of the log ratio values of the dataset is presented at the upper right. Users can change the color mapping for the color mosaic display by adjusting the range of the color stripe displayed over the histogram. Users can instantly see the result of using an adjusted color mapping on the mosaic display. By shifting the mapping of colors to expression levels, the adjusted display provides a clearer depiction of the differences between samples.


Histogram control and the initial color mapping.


Histogram control and the adjusted color mapping.

Highlighting a cluster

Each cluster is easily identified by the alternating colored lines (blue and red just below the Minimum Similarity Bar) and the one-pixel white gaps placed between clusters. Users can select a cluster by just clicking on the cluster, which causes it to be highlighted by a yellow rectangle. The corresponding gene names are also highlighted in the detailed color mosaic together with the other dendrogram produced by clustering the data in the transposed dimension (on lower left side).

Dynamic Query Controls - minimum similarity bar

HCE provides a dynamic query on the dendrogram in the form of a filtering bar whose y coordinate determines the minimum similarity value. As the users pull the minimum similarity bar down, the mosaic display splits into two, three, four, etc. groups.  As the bar moves further down, items that are distant from a cluster center are removed from the mosaic display, but users can still see the overall dendrogram structure. As more and more items are removed, the tighter clusters can be seen more easily.  User understanding of the domain guides them in determining how far to go and how many clusters to examine.

 
A use of Minimum Similarity Bar. The minimum similarity values changed from 0.13 to 0.89 in this example to separate 2 large clusters into 8 small clusters.

Dynamic Query Controls - detail cutoff bar

Users can adjust the level of detail by dragging up with the Detail Cutoff Bar.   All the subtrees below the bar are rendered using the average of leaf node values belonging to the subtree. This bar makes it possible to concentrate on more global structures.


A use of Detail Cutoff Bar. 

Coordinated Displays

Users can select a group of items by sweeping out a rectangular area on the scattergram.  The selected items will be highlighted with orange triangles in the scattergram and the related items will be simultaneously highlighted just below the overview color mosaic, also with orange triangles.


Two-dimensional scattergram and the coordination with other displays. 

Verification of Clustering - Cluster Comparisons

Users can see the mapping of each gene between the two different clustering results by double-clicking a specific cluster. The selected cluster will highlight in yellow and lines from each item in that cluster will be drawn to their position in the second clustering result.


Cluster Comparisons. Average-linkage vs. Shneiderman's 1-by-1 linkage

Verification of Clustering - Clustering in a Reduced Dimension

Users can select a subset of the conditions (samples), and do the clustering only on the subset to verify the clustering results. The horizontal white line between conditions separates the 4 selected conditions and the 3 others. Users can concentrate their inspection on the selected (upper) part and see the clusters more clearly in the scattergram.

Papers

For more information, please refer to the following paper.

Download

HCE is a standalone Windows® application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (Gayatri Varma, gayatri@umd.edu ).

Register and Download HCE

A Short User's Guide for HCE

Check whether there is a newer version (go to the Download section at the main project page).

System requirements
Intel® Pentium® processor
Microsoft® Windows 2000®, Windows XP


Last updated 11/22/2004