Hierarchical Clustering Explorer 2.0

User's Guide

Version 2.0 beta (April 30, 2003)


wpe2.jpg (226730 bytes)

Author : Jinwook Seo ( )
If you have any comment or question, send an email to the author.

Project Webpage : http://www.cs.umd.edu/hcil/hce/
Human-Computer Interaction Lab ( http://www.cs.umd.edu/hcil/)
University of Maryland, College Park

Table of Contents

0. Overall Layout of HCE2 6. Switching between filter-out mode and gray-out mode
1. Input data file format 7. Using minimum similarity bar and detail cutoff bar
2. Load an input data file 8. Show/Hide a part of dendrogram
3. Determine clustering parameters 9. Comparison of two clustering results
4. Utilizing clustering results 10. Dynamic Control Dialog Bar (Vertical Dialog Bar) - Control, Detail, and Evaluation Tabs
5. Overview and Zoom 11. Information Dialog Bar (Horizontal Dialog Bar) - Color Mosaic, Scatterplot Ordering, Gene Ontology, K-means, and Profile Search Tabs

0. Overall Layout of HCE2

 

 
 

Dendrogram View

Dynamic Control Dialog Bar
 

 

Information Dialog Bar

Return to top

 

1. Input data file format

The default extension of input data file  is ".exp". Your data file should be a tab-delimited text file.  The following table is a sample input file format for a microarray data set.

EXPERIMENT NAME NAME or TITLE WEIGHT GO or GOID DESCRIPTION or DESC SAMPLE1 SAMPLE2 SAMPLE3 ..........
WEIGHT                
GENE1     GO:0015385          
GENE2     GO:0004725GO:0005001          
GENE3                
.....                

The columns and rows in gray shadow are optional, but others are mandatory. Names of the columns in gray shadow are keywords, which means they have a special meaning in HCE2. The actual numeric values should be shown in the cells with yellow shadow. Please take a look at sample data files for examples. You'd better prepare your data file in narrow and long form (# of rows > # of columns) to have better control over dendrogram visualization in HCE2. Generally, for microarray data, rows are genes and columns are experimental conditions, samples, or time points.

Please take a look at the following input files.  You can generate an input file in MS Excel by selecting  "Text (Tab delimited) (*.txt)" at the "Save-as" dialog box.

Sample input files : cereal.exp, melanoma_go_small.exp, yeast.exp

Return to top

 

2. Load an input data file

File => Open

The following dialog box comes up. Click "Select Data File" button, and you can select your data file in the standard file open dialog. Then you can see the overview of your file in the list box.

You can choose the data transformations that you want to perform on your data file. If you choose both transformations (log and normalization), please keep in mind that log transformation is first and normalization is second. As for normalization, you can choose 'Row-by-Row' normalization or 'Column-by-Column' normalization. You can also choose a normalization formula among three. In the first formula, control is the values in the first  column or in the first row. It might be useful when you are dealing with a time-series data and up-down patterns are more important than the magnitude of each value.

Return to top

 

3. Determine clustering parameters

Clustering => Hierarchical Clustering

Note: if you are not familiar with agglomerative hierarchical clustering, read the section in the main page explaining hierarchical clustering algorithm before you keep reading this manual.

You can see a dialog box like the following. 

In this dialog box, you can do the followings.

1. Choose a linkage method 

When hierarchical clustering algorithm merges two clusters to generate a new bigger cluster, it should calculate the distances between the new cluster and remaining clusters.  There are many different ways to do that, and we call them 'linkage method.'  HCE2 implements 5 different linkage methods.  Let Cn be a new cluster, a merge of Ci and Cj. Let Ck be a remaining cluster.

Instead of picking two new closest clusters, this linkage method tries to grow the newly merged cluster in the previous iteration.  Let Cn-1 be the newly merged cluster in the previous iteration.  Let Cm be the closest cluster to Cn-1, and Cp be the closest cluster to Cm.

If |DIST(Cn-1,Cm) - DIST(Cm,Cp)|<THRESHOLD, merge Cn-1 and Cm instead of searching two new closest clusters globally.

2. Uncheck a set of samples that will not actually take part  in the clustering process.

You can exclude some of columns(samples or experimental conditions) from clustering

3. Specify whether you want to do a clustering on columns(samples or experimental conditions)

If you want to cluster columns(samples or experimental conditions) in addition to row (gene) clustering, just check the check box "Cluster Columns."

4. Choose a node arrangement method

When merging two nodes(subtrees),

"Increasing Average" puts the subtree of higher average to the right side, so that you can see the increasing average from left to right.

"Keep Right Child Small" puts the small subtree(having small number of nodes in it) to the right side, so that you can see the stair-shape subtrees.

5. Choose a distance/similarity measure 

You can choose a distance/similarity measure, Pearson correlation coefficient or Euclidean distance.

If you want to try a different combination of clustering parameters( linkage method, distance measure, and/or node arrangement method) without reloading the data set, click the clustering icon on the main tool bar, and select a different combination on the dialog box.

Return to top

 

4. Utilizing clustering results

4.1 Import/Export Clustering Results

File=>Import Clustering Result

File=>Export Clustering Result


You can export clustering result to tab-delimited text files.  It is saved to two or three text files for row clustering result(.rcr), column clustering result(.ccr), and actual raw data file(.dat). The second file will not be generated if you don't cluster columns.  You can import these files for further analysis.

4.2 Save Current Dendrogram View

File=>Save Dendrogram
File=>Save Dendrogram As...

You can save current dendrogram view to a true-color BMP file which is the standard image file format of Windows operating system.

4.3 Print Current Dendrogam View

File=>Print

You can print current dendrogram view.

4.4 Copy Current Dendrogram View to Clipboard

Edit=>Copy

You can copy current dendrogram view to clipboard so that you can paste it into your document.

Return to top

 

5. Overview and Zoom

If your data set is so large that the resulting dendrogram won't  fit in one screen, it is hard to see the overview of entire result.  In HCE2, you can zoom in and out by using the horizontal double-sided slider bar.

Initial Overview Mode

Initially, HCE2 always shows the entire dendrogram and color mosaic in the dendrogram view by averaging values of leaf nodes mapped into the same pixel.

Zoom-in

You can adjust the double-sided slider bar to zoom in a certain part of dendrogram and color mosaic.

Note: A double click on the slider bar will make the view return to the initial overview mode

Return to top

 

6. Switching between filter-out mode and gray-out mode

You can see items filtered out by 'Minimum Similarity Bar' in filter-out mode or gray-out mode. By toggling the button on the toolbar indicated by red arrow, you can switch between two mode.

Filter-out Mode 

- hide items filtered out

Gray-out Mode 

- desaturate the color bars of items filtered out

Return to top

 

7. Using minimum similarity bar and detail cutoff bar

 There are two bars in the main view of HCE, one is the minimum similarity bar and the other is the detail cutoff bar. You can drag the minimum similarity bar to change the minimum similarity threshold so that only the subtrees satisfying the threshold will be shown. By using this bar, it is easy to determine the proper number of clusters. You can also drag up the detail cutoff bar to ignore the detail expression level and see the global pattern of the clustering results. You can set the position of the detail cutoff bar to the bottom of the tree by double clicking on the bar - no detail cutoff.

Examples :

Return to top

 

8. Show/Hide a part of dendrogram

You can select a cluster by a right mouse click.  You can also select an arbitrary part of dendrogram by dragging mouse with right button pressed.  The selected part is highlighted with a yellow bounding rectangle.  Then a popup menu shows up, and you can do the followings.

- Hide : hide the selected part. This is useful to filter out uninteresting or distracting part.

- Show Only This : show only the selected part. You can concentrate on the part without distraction. You can enlarge the image by using 'Color Bar Width' slider at 'Control' tab in the dynamic control dialog bar.

- Show All : show all items.

- Save Only This : You can save the raw data of the selected items for further analysis.

Return to top

 

9. Comparison of two clustering results

Clustering => Compare Results

You can compare two different clustering results. After loading a file, you can choose two different settings in the following dialog box.

You can see the mapping of each item between the two different clustering results by double-clicking a specific cluster on the first dendrogram. You can horizontally shrink and enlarge the yellow mapping window on the first dendrogram by dragging the left or right border so that you can see the mapping of items within the mapping window.  Keep in mind that hierarchical clustering is computationally too expensive to try this comparison method for a large dataset.  I recommend a small dataset that can fit into a screen in original view mode. 

Note: this comparison method has not been improved since the previous version. Don't try this function with a large data file.

Return to top

 

10. Dynamic Control Dialog Bar (Vertical Dialog Bar)

10.1 Control Tab

- Change between 3-color mode and 1-color mode.  Default color mapping is green, black, and red.

- See the histogram for the whole data set. 

- By dragging the green or red vertical lines, you can change the color mapping.

- A right mouse click on a color boundary will pop up the color selection dialog, so you can customize the corresponding color.

You can change the bar height of the terminal nodes of dendrogram from 2 to 10 pixels.
If you'd like to hide 'Minimum Similarity Bar', 'Detail Cutoff Bar', 'Clustering Information', or 'Color Scale Bar' (at the top right corner of the dendrogram view)  from dendrogram view, you just uncheck the corresponding check box.  It might be useful to capture a picture of a neat dendrogram.

You can also select your favorite color for the selection markers (triangle marker) which are used to highlight the selected items in the dendrogram view and scatterplot views.

 

10.2 Detail Tab

List control A shows the selected items. The highlighted item is one under cursor in the main dendrogram view, whose detail is also shown in list control B.  You can save the row data of these selected items in a text file by clicking 'save' button  at the top left corner of this tab.  The number of selected items is shown right next to 'save' button.

List control B shows the all values of current item under cursor in the main dendrogram view. Every mouse move on the color mosaic shows the information of the item under cursor in this list control.

10.3 Evaluation Tab

HCE2 implements a very naive evaluation method that shows average within-cluster distance and average between-cluster distance of the current clustering results.  It is better if the within-cluster distance is smaller and the between cluster distance is bigger.

You can compare hierarchical clustering result with K-means clustering result.  When you perform K-means clustering, you can randomly generate initial clusters, or you can use hierarchical clustering result as an initial cluster set.  The later has been known to produce better clustering results.

Return to top

 

11. Information Dialog Bar (Horizontal Dialog Bar)

11.1 Color Mosaic Tab

This tab shows the color mosaic of the hierarchical clustering result without filter-outs by minimum similarity bar.  You can scroll horizontally to see entire color mosaic, and scroll vertically to see  long item names.  If you  clustered columns (experimental conditions, or samples) by checking 'Cluster Columns' check box in the clustering dialog box, you can also see another dendrogram as shown in the left example. As you can see in this example, HCE2 highlights the selected items by showing their names in a yellow background.

When you click on a cluster in the dendrogram view, the names of corresponding items are highlighted in a yellow background, and at the same time, the sample name and the small dendrogram attach to the right or left side of the selected group. You can also drag the sample name and the small dendrogram combination next to any item so that you can easily know the column name of the item.

By a right mouse button click, you see a popup menu with 3 menu items (Print Current View, Copy Current View, and Save Current View).  You can print current view, copy current view to clipboard to paste it to your document, or save current view to a bmp file.

11.2 Scatterplot Ordering Tab

Low dimensional projections are very useful when analyzing a multidimensional dataset.  Since computer screen is intrinsically 2 dimensional space and 2D projections are readily understood by most users without distractions by navigation controls, we chose 2D scatterplots as low dimensional projections. You can choose 2 columns (experimental conditions, or samples) for X and Y axis respectively, and see the corresponding scatterplot where each item is depicted as a rectangle point in (x,y); x is the value of the item at the sample for X axis, y is the value of the item at the sample for Y axis.

Here, we encounter a problem that there are often times a large number of possible scatterplots, so we need some mechanisms to wisely traverse the possible scatterplots. We suggest a way to help users find scatterplots interesting to them.  We implemented 5 different criteria by which scatterplots are ordered; Pearson's correlation coefficient, least square error of straight line fit, least square error of quadratic curve fit, number of items in a region of interest, and how uniform the distribution is.  In 'Scatterplot Ordering' tab, users can choose an ordering method and see the resulting ordered list of scatterplots and scores.  Bidirectional and interactive coordinations with other views are of course supported.

This tab consists of 5 parts (Ordering Criteria, Score Table, Score List, Scatterplot Browser, and Histogram.

Ordering Criteria
You can choose an ordering method.  If you choose '# of items in ROI', a popup menu comes up and you can choose a type of region from 'Rectangle', 'Ellipse', and 'Free Form'. Then, you should specify a region on the scatterplot view by dragging left mouse button.

Note: Uniformness ordering will take a large amount of time to calculate the order. It takes 2~3 minutes on P4 2383MHz machine to get orders for the data set with 3600 rows and 38 samples.

Score Table
A new visualization component, the Score Table, shows a lower triangular matrix where each cell represents a scatterplot.  Each cell is color-coded by its score and the color mapping is shown at the top right corner of the Score Table.  As users move the mouse over a cell, the scatterplot corresponding to the cell is shown in the Scatterplot Browser , and the corresponding item is highlighted in the Score List simultaneously.  Users can easily find a variable that is the least (or most) correlated to other variables by just scanning the row or column to find the darkest (or brightest) cell.  It is also possible to find an outlying scatterplot whose cell has distinctive color intensity compared to the rest of the same row or column.  After locating an interesting cell, users can double click on the cell to select, and enlarge it, and then they can scrutinize it on the Scatterplot Browser and on other tightly coordinated views.
Score List  
This list control shows the resulting order of scatterplots. You can see the corresponding scatterplots by clicking one in the list. If you  double click one, the corresponding scatterplot will be shown as a separate MDI child window. 

You can select multiple plots at the same time and click 'Make Views' button, then each selected plots will be shown in a separate child window.  In this way, you can see multiple projections at the same time.

Toggle the column heading 'Score', and change the sort order between ascending and descending.

Scatterplot Browser

Then you can use a rubber rectangle to sweep out an area on the scatterplot. The items in the scatterplot will be highlighted with triangles and the related items will be simultaneously highlighted in other views, also with triangles.  The vertical and horizontal item sliders help users quickly traverse scatterplots.  You can change X (or Y) axis by just simply dragging the item slider.

 

Histogram
The last view in 'Scatterplot Ordering' tab shows the histogram of the selected items.

11.3 Gene Ontology Tab

If input file is a microarray experiment data and the gene ontology information is available for genes, you can utilize this tab. GOID(Gene Ontology ID) can be entered as in the sample file shown at the first section of this manual.  If a gene has more then one GOID, GOID should be the concatenation of all GOIDs. For example,  GO:0004725GO:0005001 if a gene has two GOIDs, GO:0004725 and GO:0005001.  

If you select a cluster in the dendrogram view by a left click, all genes of the cluster are shown in the list view together with their GOIDs as in the following example.

Ontology Tree Control

This tree control shows all or a part of molecular function ontology.  

Buttons Gene List Control

This list control shows all selected genes with their gene ontology ids. If you click a GOID, it will be shown at the left tree control with complete paths from the root. All other paths are hidden.  If users click a gene name, all its GOIDs are shown at the left tree control.

11.4 Profile Search Tab

HCE2 implemented a kind of parallel coordinate view to help you compare the patterns of clusters and find more interesting patterns interactively.  'Brushing' and 'Dynamic Query' are the fundamental techniques of Profile Search, which means you can specify search patterns on the information space itself by mouse dragging, and see query results instantaneously. The following figure shows the overall layout of Profile Search tab.

You see solid lines in the Information Space, each of which represents a profile of an item(gene).  In this space, you can also submit a query just by a mouse dragging.  Of course, the result of your query will be shown interactively in the same space.  You can modify your query easily by moving a point vertically or by moving a line segment vertically or horizontally.  You can delete a certain part of model pattern by dragging mouse with left control key pressed or after pressing 'Delete' button.  'Clear ALL' button let you return to the initial state.  Profile Search tab supports sequential query refinements.   After you get a search result, you can keep it as a new narrowed information space by clicking 'Pin This Result' button.  If 'Show Silhouette' is checked, you always see the range of profiles of the entire items in form of gray shadowed polygon. You can refine your query by submitting a new query over the pinned result set.   You can reset your information space to original full set by clicking 'Consider All Profiles' button. The following two kinds of queries are possible in this tab.

You can specify a model pattern simply by dragging mouse with left button pressed as shown in the above figure.  You can use 3 different distance measures and assign threshold values.  All profiles satisfying the threshold range will be interactively shown in Information Space.  For example, previous figure shows the profiles of items that are 97 percent or more similar to the red model pattern in terms of  Pearson correlation coefficient. You can move the entire model pattern by dragging on a line segment, or move a control point by dragging it.  3 different measures are Pearson correlation coefficient, Euclidean distance, and Absolute distance from each control point. Assume you select the last measure and the threshold values are 0 and 5. If the distance between each point of a profile and its corresponding control point of a model pattern is within the distance between 0 and 5, the profile will be selected as a result. It's like selecting profiles that flow through a equi-width(5) pipe whose center line is the model pattern.

It is possible to define ceilings and floors on Information Space so that only the profiles below ceilings and above floors are shown as a result. The following figure shows a simple example of a ceil-and-floor query.  You can specify a ceiling by left mouse button, and a floor by right mouse button. You can move each individual line segment or control point to change ceiling and floor.

You can type in a string to find items whose name (description) contains the string.  Searches are done incrementally.  For example,  if you want to find items whose name contains the substring "EST", when you type 'E' only the items containing 'E' in their name will be shown.  As you type in 'S', the result will be updated to show only the items whose name has the substring "ES".

11.5 K-means Tab

This tab shows the K-means clustering results in the similar form to 'Color Mosaic' tab.  You can see the K-means clustering results with one pixel gaps between clusters.  Selected items are simultaneously highlighted with item names in yellow background.

 

Return to top

Return to main project web page

Last updated 11/22/2004

Web Accessibility