Hierarchical Clustering Explorer 3.0

User's Guide

Last updated on 05/06/2005


Author : Jinwook Seo ()
If you have any comment or question, send an email to the author.

Download HCE 3.0
Project Webpage : http://www.cs.umd.edu/hcil/hce/
Human-Computer Interaction Lab ( http://www.cs.umd.edu/hcil/)
University of Maryland, College Park


Table of Contents

0. Introduction 13. Dynamic Control Dialog Bar (Vertical Dialog Bar)
1. New input file format      13.1 Control Tab
2. Preview dialog box      13.2 Detail Tab
3. Hierarchical Clustering and Dendrogram Display      13.3 Evaluation Tab
4. Color Mosaic Displays for Multidimensional Data Sets 14. Information Dialog Bar (Horizontal Dialog Bar)
5. Determine clustering parameters      14.1 Color Mosaic Tab
6. Utilizing clustering results      14.2 Table View Tab
7. Overview and Zoom      14.3 Histogram Ordering Tab
8. Switching between filter-out mode and gray-out mode      14.4 Scatterplot Ordering Tab
9. Using minimum similarity bar and detail cutoff bar      14.5 Profile Search Tab
10. Show/Hide/Select items on the dendrogram view      14.6. Gene Ontology Tab
11. Switch views between the main dendrogram view and color mosaic tab      14.7 K-means Tab
12. Comparison of two clustering results     

0. Introduction

HCE (Hierarchical Clustering Explorer) is a visualization tool for interactive exploration of multidimensional datasets. One of the goals of HCE is to help users explore and understand multidimensional datasets by maximizing the human perceptual skills that have been underutilized. HCE is a telescope with which users can systematically scrutinize multidimensional datasets in order to identify unexpected interesting features hidden in the multidimensional space.

A guiding principle in the usage of the multidimensional telescope is the GRID (Graphics, Ranking, and Interaction for Discovery) principle that could enable users to better understand distributions in one (1D) or two dimensions (2D), and then discover relationships, clusters, gaps, outliers, and other features in multidimensional data sets. By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods, users can systematically examine the most important 1D and 2D axis-parallel projections.

In this manual, we explain how HCE works in detail and how to use visualization components in HCE.

1. New Input File Format

HCE 3.0 also accepts the previous file format that has been used for previous HCE versions. An important requirement is that the very first column should have unique identifiers.  It could be name of items, or users can fill the column by integer values from 1 to n.  In this new format, users can add one special row that has fields type information as shown in the previous figure. In this case, the first column of the row should be "fieldtype" and each column can be of a type among STRING, CATEGORICAL, INTEGER, and REAL.  Integer values should be between –2147483648 and 2147483647. Real values should be between -3.402823466e+38 and 3.402823466e+38, the decimal exponent should be between -37 and 38.  Here is a sample input file in the new file format: cereal-new.txt

Please note that users may not use any of the following reserved words as a column heading unless the column is used for its designated purpose.

Reserved words in HCE input file: name, title, desc (or description), go (or goid), weight

All columns except meta columns of the input files in the previous file format (without "fieldtype" row) are assumed to have REAL-type values (floating point numbers) in the columns as we've done for previous HCE versions.

Several test data sets are included in the download for users to study, and please refer to the application examples page for more information.

1.1 Categorical variables

Categorical variables can be either character strings or integers. Internally string categorical values are handled as integers, in other words, each string value has an integer encoding from 1 to the number of unique categorical values. In the data shown above, for example, 1 for "Nabisco", 2 for "Quacker Oats", 3 for "Kelloggs", and so on. In the current version (version 3.0), categorical variables are treated in the same way as other variables except that categorical variables don't take part in the clustering. Later versions will use categorical variables for stratification as well as for clustering.

1.2 File size vs. Clustering time

The time complexity of the current implementation of HCE is O(n2m) where n is the number of items (rows) and m is the number of dimensions (columns).  The space complexity is O(n2).

The following table shows an experimental result on the time taken to complete the clustering of rows. If the number of rows (or columns) is larger than 40000, the clustering completion time (in seconds) is indefinite with the current implementation of HCE running on a Pentium 4 2.53GHz and 1GB memory PC because of the memory overhead for maintaining the intermediate distance matrix.

number of rows 3138 3614 6211 12422 22690 22283 38305
number of columns 17 38 27 27 40 105 6
clustering time 3.75 7.75 16.15 65.25 226.21 452.29 659.63

 

2. Preview Dialog Box

File Ţ Open

After selecting an input file, the following preview dialog box will show up.  Users can see the first 10 rows of the input file and check if the file is in the right format.  Users can also perform some data filtering and transformation in this dialog box.

2.1 Data Filtering

Present call filtering (This is only for Affymetrix GeneChip Experiment data.)

There are two outputs from the Affymetrix noise calculations; one is the continuous p value assignment, and the other is a simple “present/absent” threshold.  When the probe set detection p value reaches a certain level of significance (less than 0.04 in the default setting as shown in the above figure), then the probe set is assigned a “present” call, while all those probe sets with less robust signal/noise ratios are assigned an “absent” call.  This enables the use of a “present call” threshold noise filter.  Default setting is  a “10% present call” noise filter.  This means that any specific probe set was required to show at least 3 “present” assignments in the 25 microarrays in the project (>10% “present” calls).  All profiles that don't satisfy the requirement will be filtered out when users click "Filter it!" button.

Standard deviation filtering

Users can filter out rows based on the standard deviation.  The idea is to filter out data items (, or genes) that don't quite change over the samples or time points. Rows (, or genes) will be filtered out if their standard deviations considering all columns (, or samples) are less than a threshold.  The default threshold is 1.

2.2 Data Transformation

Log transformation (Natural log)

Users sometimes want to transform the variable to get a better result.  For example, log transformations convert exponential relationships to linear relationships, straighten skewed distributions, and reduce the variance. This transformation is sometimes useful when the dataset is ratio data, for example, the ratio of red/green intensities for cDNA array.

Normalization

Users can normalize the input data either row-by-row or column-by-column, and four normalization methods are available in HCE3.

Values will be standardized, i.e. calculate the deviation from the mean and then divide the deviation by the standard deviation.  After standardization, each row (, or column) will have the same mean (0) and the same standard deviation (1).
Simply divide values by the value at the first column or row. In other words, control is always the first column or row for HCE3.
Simply divide values by the median.
rescale to a new range Linearly transform each row or each column to a new range of values. For example, after rescaling to the range 0 to 1, the minimum value becomes 0, and the maximum value becomes 1, and values in between are linearly transformed to values between 0 and 1.

If columns in a users' input file (like the cereal data file shown in the section 1) have different range of values, column-by-column normalization is recommended. This normalization makes the color mapping more reasonable and make columns comparable to each other. If values in all columns are already directly comparable, row-by-row normalization is recommended. For example, in Affymetrix projects each column (chip or sample) is usually normalized by probe set signal algorithms, so values in different columns are directly comparable. In such cases, row-by-row normalization improves the color mapping and accelerates the row clustering process. Please note that the choice of normalization direction (column-by-column or row-by-row) will deeply influence the clustering results and other results. Please be aware that there is an option to choose to use either normalized values or original values in most visualization components in HCE such as Table View, Histogram Ordering, and Scatterplot Ordering.

Return to top

 

3. Hierarchical Clustering and Dendrogram Display

One of the requirements of good clustering algorithms is the ability to determine the number of natural clusters in the data set. However, most existing clustering algorithms ask users to specify the number of clusters that they want to generate. This requirement makes clustering algorithms perform unnecessary merges or splits, which produce unnatural clusters. Furthermore, the natural number of clusters is mostly dependent on users’ preferences or applications. A possible solution to this problem is to use the hierarchical agglomerative clustering (HAC) algorithm and allow users to control parameters to determine the proper number of clusters. Unlike most clustering algorithms, HAC generates a hierarchical structure of clusters instead of sets of clusters.

The HAC algorithm is summarized as follows. Let's assume that we want to cluster n data items, and we have n*(n-1)/2 similarity (or distance) values between every possible pair of n data items:

1. Initially, each data item occupies a cluster by itself. So there are n clusters at the beginning.

2. Find one pair of clusters whose similarity value is the highest, and make the pair a new cluster.

3. Update the similarity values between the new cluster and the remaining clusters.

4. Steps 2 and 3 are applied n-1 times before there remains only one cluster of size n.

There are many possible choices in updating the similarity values in step 3. Among them, most common ones are complete-linkage, average-linkage, and single-linkage. Complete-linkage sets the similarity values between the new cluster and the remaining clusters to be the minimum of similarities between each member of the new cluster and the rest. Average-linkage uses average similarity value as a new similarity values. Single-linkage takes the maximum.

Hierarchical clustering results are usually represented as dendrograms. A dendrogram is a binary tree, in which each data item corresponds to a terminal node of the binary tree and the distance from the root to a subtree indicates the similarity of the subtree – highly similar nodes or subtrees have joining points that are farther from the root. For example, in the following figure, the Euclidean distance between A and D is the smallest among all possible pairs, they are merged together as a subtree and the height of the subtree is very short because they are very similar in terms of the similarity/distance measure. On the other hand, B and E are not so close to each other, the height of the corresponding subtree is much taller because they are not so similar.

Hierarchical agglomerative clustering and dendrogram. Five data points (A, B, C, D, E) on a 2D plane are clustered, and the dendrogram (a binary tree) on the right side shows the clustering result by using Single-linkage and Euclidean distance. The height of each subtree represents the distance between the two children.

Return to top

 

4. Color Mosaic Displays for Multidimensional Data Sets

Multidimensional data sets are usually represented in a table where a row represents an item and a column represents a variable (or a dimension). For example, (a) of the following figure shows a small multidimensional data set (77 rows and 13 columns) about nutrition information of breakfast cereals. Each row is a cereal, and each column is a nutrition component. A graphical representation of this data set is to color-code each value in the table according to a color mapping scheme. This graphical representation of a table is called “Color Mosaic.” There are other names for the representation such as heat map and patchgrid. A usual way to show a color mosaic is to maintain the same layout of the original table and just color-code each cell (b). Even though this vertical layout is a natural representation, HCE uses a transposed layout (c) by default to show more items in a limited screen space. Since the width of a computer screen is usually bigger than the height and multidimensional data sets usually have many more rows than columns, the horizontal layout can accommodate more items on a screen.


(a) cereal data set

(b) vertical color mosaic

(c) horizontal color mosaic

Color mosaic displays for a multidimensional data set. In (a) and (b), each row is a cereal while each column is a cereal in (c). The default layout in HCE is (c).

When researchers want to identify hot spots and understand the distribution of data, they can examine the color mosaic. In general, a dendrogram is displayed with a color mosaic at the leaves (see (a) of the following figure). The arrangement of rows and columns of the color mosaic display is changed according to the clustering result. The graphical pattern of the underlying data is shown by coloring each tile on the basis of the numerical value corresponding to the tile. The color mapping is specified by a color mapping control using a histogram for all numerical values in the data set (b). By default, in HCE, a high value has a bright red color and a low value has bright green color. The middle value has a black color. The vertical red line specifies the value above which all values are mapped to the brightest red color, and the vertical green line specifies the value below which all values are mapped the brightest green color. As a value gets closer to the middle value between the green and the red lines, the color becomes darker. A right click on a vertical color line shows a color-selection dialog box to allow users to use a different set of colors for color mapping.

User controls over the color mapping are necessary to enable users to see subtle differences in the ranges of interest. For skewed data distributions, this is essential to avoid a situation where a large part of screen is filled with all green or red, indicating that most of the values are near extremes. Users can change the color mapping for color mosaic display by dragging the red and green vertical line over the histogram to adjusting the range of color stripe displayed (b). Users can instantly see the result of new color mapping on the color mosaic display, so that they can identify the proper color mapping for the data set. 


(a) color mosaic attached to dendrogramam

(b) color mapping

color mosaic display attached to a dendrogram visualizes a hierarchical clustering result of the cereal data set. The arrangements of rows and columns are changed according to the clustering result. Users can change the color mapping for the color mosaic by dragging vertical color lines (green or red) on a histogram.

Return to top

 

5. Determining Clustering Parameters

Clustering Ţ Hierarchical Clustering

After clicking ok on the preview dialog box, the following dialog box will open.

In this dialog box, users can do the following tasks.

5.1 Choose a linkage method

When hierarchical clustering algorithm merges two clusters to generate a new bigger cluster, it should calculate the distances between the new cluster and remaining clusters.  There are many different ways to do that, and we call them 'linkage method.'  HCE3 implements 5 different linkage methods.  Let Cn be a new cluster, a merge of Ci and Cj. Let Ck be a remaining cluster.

Instead of picking two new closest clusters, this linkage method tries to grow the newly merged cluster in the previous iteration.  Let Cn-1 be the newly merged cluster in the previous iteration.  Let Cm be the closest cluster to Cn-1, and Cp be the closest cluster to Cm.

If |DIST(Cn-1,Cm) - DIST(Cm,Cp)|<THRESHOLD, merge Cn-1 and Cm instead of searching two new closest clusters globally.

5.2 Uncheck a set of columns (, or samples) that will not actually take part  in the clustering process.

Users can exclude some of columns (samples or experimental conditions) from clustering by unchecking them.

5.3 Specify whether users want to do a clustering on rows and/or columns (samples or experimental conditions)

If users uncheck "Cluster Rows", the data items (, or rows) will not be clustered, but they will be sorted by the average values of items.  If users uncheck "Cluster Columns", the original order of columns in the input file will be preserved (neither clustering nor sorting will be done). So, if the input dataset is a time-series data set that the order of columns in the input is important, it is better to uncheck it because columns clustering may change the order of columns.

5.4 Load precomputed similarity matrixes ("Load Similarity Matrix for Rows", "Load Similarity Matrix for Columns")

Users can prepare a precomputed similarity matrix if their favorite distance/similarity measure is not available in HCE. Then users can load the matrix into HCE using this dialog box.  The matrix should be in a tab-delimited text file, a comma-delimited text file, or an Excel file.  The file format should look like this table.

The first row should have a short description on the file.  The first column has identifiers for items. Since the matrix is symmetric, users can either fill the matrix in full or just fill the lower triangular part only. Each numerical value represents the similarity between the corresponding row and column.  For example, 46 is the similarity value of 'Row3' and 'Row1'.  Later, HCE will read a similarity matrix and convert it into a distance matrix to do a clustering.  Users need to prepare separate similarity matrixes for rows and columns.

5.5 Choose a node arrangement method

When merging two nodes (subtrees),

"Keep Right Child Redder" puts the subtree of higher average to the right side, so that users can see the increasing average from left to right.

"Keep Right Child Small" puts the small subtree (that has a small number of nodes in it) to the right side, so that users can see the stair-shape subtrees.

5.6 Choose a distance/similarity measure

Available distance/similarity measures are, Pearson correlation coefficient, Euclidean distance, and Manhattan distance.

If users want to try a different combination of clustering parameters (linkage method, distance measure, and/or node arrangement method) without reloading the data set, click the clustering icon on the main tool bar, and select a different combination on the dialog box.

Note: The check box "Use P-Values as Weights" is for Affymetrix GeneChip experiment data. Please refer to the signal/noise analysis page for more detail.

Return to top

 

6. Utilizing clustering results

6.1 Export/Import Clustering Results

File=>Export Clustering Result


 

Users can export clustering result to tab-delimited text files.  It is saved to two or three text files for row clustering result(.rcr), column clustering result(.ccr), and actual raw data file(.dat). The second file will not be generated if users don't cluster columns.  Users can import these files later for further analysis.

File=>Import Clustering Result

6.2 Save Current Dendrogram View

File=>Save Dendrogram
File=>Save Dendrogram As...

Users can save current dendrogram view to a true-color BMP file which is the standard image file format of Windows operating system.

6.3 Print Current Dendrogram View

File=>Print

Users can print current dendrogram view.

6.4 Copy Current Dendrogram View to Clipboard

Edit=>Copy

Users can copy current dendrogram view to clipboard so that users can paste it into other document.

Return to top

 

7. Overview and Zoom

If the data set is so large that the resulting dendrogram won't  fit in one screen, it is hard to see the overview of entire result.  In HCE3, users can zoom in and out by using the horizontal double-sided slider bar.  Selected items (, or genes) are marked with a triangle below the color mosaic.

Initial Overview Mode

Initially, HCE3 shows the entire dendrogram and color mosaic in the dendrogram view by averaging values of leaf nodes mapped into the same pixel.

Zoom-in

Users can adjust the double-sided slider bar to zoom in a certain part of dendrogram and color mosaic. The positions of the selected items are marked with tick markers on the slider bar so that users can easily position the slider to see some selected items.

Note: A double click on the slider bar will make the view return to the initial overview mode.

Return to top

 

8. Switching between filter-out mode and gray-out mode

The items filtered out 'Minimum Similarity Bar' will be either hidden (filter-out mode) or shown in a desaturated color (gray-out mode). By toggling the button on the toolbar indicated by red arrow, users can switch between two modes.

Filter-out Mode

- hide items that were filtered out

Gray-out Mode

- desaturate the color bars for items that were filtered out

Return to top

 

9. Using minimum similarity bar and detail cutoff bar

 There are two dynamic control bars in the main view of HCE, one is the minimum similarity bar and the other is the detail cutoff bar. Users can drag the minimum similarity bar to change the minimum similarity threshold so that only the subtrees satisfying the threshold will be shown. By using this bar, it is easy to determine the proper number of clusters. A double click on the minimum similarity bar will move the bar to the very top of the view so that there is only one cluster left.

Users can also drag up the detail cutoff bar to ignore the detail expression level and see the global pattern of the clustering results. All subtrees (clusters) below the detail cutoff bar will be rendered using the average expression of the subtree. When it is not necessary, users can set the position of the detail cutoff bar to the bottom of the tree by double clicking on the bar.

Return to top

 

10. Show/Hide/Select items on the dendrogram view

Users can select a cluster by a right mouse click on the color mosaic.  Users can also select an arbitrary part of dendrogram by dragging mouse on the color mosaic with right button pressed.  The selected part is highlighted with a yellow bounding rectangle.  Then a popup menu shows up, and users can do the followings. Note that a left mouse click on a subtree (not on the color mosaic) will select all genes in the subtree right away to highlight them in all views.

- Select : Select all item within the selection rectangle and highlight them in all views.

- Hide : hide the selected part. This is useful to filter out uninteresting or distracting part.

- Show Only This : show only the selected part. Users can concentrate on the part without distraction. Users can enlarge the image by using 'Color Bar Width' slider at 'Control' tab in the dynamic control dialog bar.

- Show All : show all items.

- Save Only This : Users can save the raw data of the selected items for further analysis.

As users move mouse over the color mosaic, the bar below mouse cursor will be highlighted with a yellow rectangle and the name of the bar (for a gene) will be shown (see the following figure).  When each row of the color mosaic is high enough, the name for each row (, or column name) will be shown right next to the yellow selection rectangle.  Users can click on a bar and fix the selection until the bar is clicked one more time or other bar is clicked. A blue rectangle will be shown to indicated the bar is selected and the selection is fixed.  Once a bar is selected and the selection is fixed, any mouse move over the color mosaic will neither select a bar nor show the name of the bar.

Return to top

 

11. Switch views between the main dendrogram view and color mosaic tab

Tools => Switch Views

 If users only check "Cluster Columns" at the clustering dialog box, the columns clustering result will be shown at the main dendrogram view right after clustering. Otherwise, the rows clustering results will be shown at the main dendrogram view right after clustering. Afterwards, users can switch the two views (, or the main dendrogram view and the color mosaic tab) by clicking the toolbar icon or selecting the menu item.  The columns clustering on the main view is useful for a  signal/noise analysis for an Affymetrix project.

Rows clustering on the main dendrogram view

Columns clustering on the main dendrogram view

 

12. Comparison of two clustering results

Clustering => Compare Results

Users can compare two different clustering results. After loading a file, users can choose two different settings in the following dialog box.

Users can see the mapping of each item between the two different clustering results by double clicking a specific cluster on the first dendrogram.  As users move the cursor over the color mosaic, the mapping between the item under cursor and the corresponding item on the other dendrogram will be shown by a line connecting them.  Please note that hierarchical clustering is computationally too expensive to try this comparison method for a large dataset.  I recommend a small dataset that can fit into a screen in original view mode.

Note: this comparison method has not been improved since the previous version. Don't try this function with a large data file (e.g. >200x30).

Return to top

 

13. Dynamic Control Dialog Bar (Vertical Dialog Bar)

13.1 Control Tab

Users who have color deficiencies or who desire different color palettes for their monitors/projectors can change color settings using this control.

- Default color mapping is green, black, and red.
- Choose between 3-color mode and 1-color mode. 
- See the histogram for the whole data set.
- By dragging the green, black, or red vertical lines, users can change the color mapping.
- A right mouse click on a color boundary will pop up the color selection dialog, so users can customize the corresponding color.

Users can change the bar height of the terminal nodes of dendrogram so that the color mosaic height can change from zero to the half of the dendrogram view height.
If users want to hide 'Minimum Similarity Bar', 'Detail Cutoff Bar', 'Clustering Information', or 'Color Scale Bar' (at the top right corner of the dendrogram view)  from dendrogram view, users can just uncheck the corresponding check box.  It might be useful when capturing a picture of a neat dendrogram.

Users can also select their favorite color for the selection markers (triangle marker) which are used to highlight the selected items in the dendrogram view, scatterplot views, histogram views, and other views.

 

13.2 Detail Tab

List control A shows the selected items. The highlighted item is one under cursor in the main dendrogram view, whose detail is also shown in list control C.  Users can save the raw data of these selected items in a text file by clicking 'save' button at the top left corner of this tab.  The number of selected items is shown right next to 'save' button.

List control C shows the all values of current item under cursor in the main dendrogram view, or other views. Every mouse move on the color mosaic shows the information of the item under cursor in this list control.

Users can drag the bar (B) to adjust the share of list controls, A and C.  A click on the up arrow will move the bar to the top so that the list control C will occupy most space.  Similarly a click on the down arrow will move the bar to the bottom so that the list control A will occupy most space. A double click anywhere on the bar except on the two arrows will move the bar to the middle of the view so that the two list controls share the same amount of space.

13.3 Evaluation Tab

HCE3 implements a very naive evaluation method that shows average within-cluster distance and average between-cluster distance of the current clustering results.  It is better if the within-cluster distance is smaller and the between cluster distance is bigger.

 

Return to top

 

14. Information Dialog Bar (Horizontal Dialog Bar)

14.1 Color Mosaic Tab

This tab shows the color mosaic of the hierarchical clustering result without filter-outs by minimum similarity bar.  Users can scroll horizontally to see entire color mosaic, and scroll vertically to see long item names.  If users clustered columns (experimental conditions, or samples) by checking 'Cluster Columns' check box in the clustering dialog box, users can also see another dendrogram as shown in the left example. As users can see in this example, HCE3 highlights the selected items by showing their names in a yellow background. Users can see this column clustering result a the main dendrogram view by clicking or choosing Tools => Switch Views.

When users click on a cluster in the dendrogram view, the names of corresponding items are highlighted in a yellow background, and at the same time, the sample name and the small dendrogram attach to the right or left side of the selected group.

Users can also drag the sample name and the small dendrogram combination next to any item so that users can easily know the column names of items.

By a right mouse button click, users see a popup menu with 3 menu items (Print Current View, Copy Current View, and Save Current View).  Users can print current view, copy current view to clipboard to paste it to other document, or save current view to a bmp file.

Users can drag the minimum similarity bar to determine the right clustering resolution as they do in the dendrogram view.

14.2 Table View

The tabular view is interactively coordinated with other views in HCE 3.0. It shows the input dataset in a simple table.  If users select a group of items in other views, rows for the selected items are highlighted in the tabular view. Each row represents an item and each column represents a variable or an annotation from an external knowledge source.

Users choose to see either only the selected items or all items in the table with the selected items highlighted. When "Show Selected Data Only" check box is unchecked like at the above figure, all items will be shown in the table, and the selected items are highlighted with a light yellow background.  The locations of selected items will also marked by colored lines shown right next to the scrollbar so that uses can easily scroll to selected items. The selection marker color will change when users change the color at the control tab (see 11.1).

Link a column to a web database

If there is a available web database for the data set, users can specify a URL template for each column to link a web database so that they can look up the database for a cell on the column.  If users right-click on a column header, the following input dialog box pops up.  After entering a search string with the URL and %s for the search term, a right click on a cell and selecting a value in the cell will launch a web browser and open up the web database specified by the URL and the search result for the selected value on the database will be shown at the web browser.

 

For example, there are many web databases for biologists.  Here are example URL templates for some biological databases.

Web database  URL template
UniGene http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene&cmd=search&term=%s
LocusLink http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=%s
SwissProt http://srs.ebi.ac.uk/srsbin/wgetz?-newId+[SWALL-AllText:%s]+-lv+30+-view+SeqSimpleView+-page+qResult
Full Length Ref. Sequences http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Nucleotide&term=%s
 

For example, if  a column contains UniGene identifiers, users can right-click on the column and enter the URL template,  http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene&cmd=search&term=%s.

Remember that %s will be replaced by a value in the cell where users right-click in the table view.

Download and attach an annotation file from Affymetrix

If the data set is a Affymetrix GeneChip data, users can download an annotation file for users' data from the Affymetrix web site. Please note that HCE only accept the CSV format annotation file.  For example, if users used "Mouse Genome 430A 2.0 Array", users can download the annotation file for users' data at http://www.affymetrix.com/Auth/analysis/downloads/taf/Mouse430A_2_annot_csv.zip (login required).  After annotation, many columns from the annotation file will be added to the table view. At the same time, gene ontology information (see 12.5) will also be extracted from the annotation and will be used in the gene ontology tab.

Knowledge integration

By carefully looking at the annotations for the selected item in the table view and looking those up in the corresponding databases, users can gain more insight into the items by utilizing the domain knowledge from the databases. Conversely, if users select a bunch of rows in the tabular view, the selected items are also highlighted in other views. For example, after sorting by a column and selecting rows with the same value on the column, users can easily verify how closely those items are group together in the dendrogram view.

14.3 Histogram Ordering Tab

HCE implements an interface framework (Rank-by-feature framework) where users can systematically examine each variable (, or column) in the data set by sorting them according to some ranking criteria. The main display for the rank-by-feature framework for 1D projections shows a combined histogram and boxplot. The interface consists of four coordinated parts: control panel, score overview, ordered list, and histogram browser.


All 1D histograms are ordered according to the current order criterion (A) in the ordered list (C). The score overview (B) shows an overview of scores of all histograms. A mouseover event activates a cell in the score overview, highlights the corresponding item in the ordered list (C) and shows the corresponding histogram in the histogram browser (D) simultaneously. A click on a cell selects the cell and the selection is fixed until another click event occurs. A selected histogram is shown in the histogram browser (D), where users can easily traverse histogram space by changing the dimension for the histogram using item slider. A boxplot is also displayed above the histogram to show the graphical summary of the distribution of the dimension.

Starting a systematic exploration of 1D projections by ranking them

Users can select a ranking criterion from a combo box in the control panel, and then they see the overview of scores for all dimensions in the score overview according to the selected ranking criterion. All dimensions are aligned from top to bottom in the original order, and each dimension is color-coded by the score value. By default, cells of high value have bright red colors and cells of low value have bright green colors. The cell of middle value has the black color. As a value gets closer to the middle value, the color intensity attenuates. Users can change the colors for minimum, middle, and maximum values by a right mouse click on the corresponding part of the color scale bar. The color scale and mapping are shown at the top right corner of the overview (B). Users can easily see the overall pattern of the score distribution, and more importantly they can preattentively identify the dimension of the highest/lowest score in this overview. Once they identify an interesting row on the score overview, they can just mouse over the row to view the numerical score value and the name of the dimension is shown in a tooltip window.

Examining the ranking result (Score overview, Ordered List)

The mouseover event is also instantaneously relayed to the ordered list and the histogram browser, so that the corresponding list item is highlighted in the ordered list and the corresponding histogram and boxplot are shown in the histogram browser. The score overview, the ordered list, and the histogram browser are interactively coordinated according to the change of the dimension in focus. In other words, a change of dimension in focus in one of the three components leads to the instantaneous change of dimension in focus in the other two components.

In the ordered list, users can see the numerical detail about the distribution of each dimension in an orderly manner. The numerical detail includes the five-number summary of each dimension and the mean and the standard deviation. The numerical score values are also shown at the third column whose background is color-coded using the same color-mapping as in the score overview. While numerical summaries of distributions are very useful, sometimes they are misleading. For example, when there are two peaks in a distribution, neither the median nor the mean explains the center of the distribution. This is one of the cases for which a graphical representation of a distribution (e.g., a histogram) works better.

Manually traversing histograms (Histogram Browser)

In the histogram browser, users can see the visual representation of the distribution of a dimension at a time. A boxplot is a good graphical representation of the five-number summary, which together with a histogram provides an informative visual description of a dimension’s distribution. It is possible to interactively change the dimension in focus just by dragging the item slider attached to the bottom of the histogram.

Ranking criteria

Since different users may be interested in different features in the data sets, it is desirable to allow users to customize the available set of ranking criteria. However, we have chosen the following ranking criteria that we think fundamental and common for histograms as a starting point, and we have implemented them in HCE3:

Normality of the distribution (0 to inf)
Many statistical analysis methods such as t-test, ANOVA are based on the assumption that the data set is sampled from a Gaussian normal distribution. Therefore, it is useful to know the normality of the data set. Since a distribution can be nonnormal due to many different reasons, there are at least ten statistical tests for normality including Shapiro-Wilk test and Kolmogorov-Smirnov test. We used the omnibus moments test for normality in the current implementation. The test returns two values, skewness (s) and kurtosis (k). Since s is 0 and k is 3 for a standard normal distribution, we calculate |s|+|k-3| to measure how the distribution deviates from the normal distribution and rank variables according to the measure. Users can confirm the ranking result using the histogram browser to gain an understanding of how the distribution shape deviates from the familiar bell-shaped normal curve.
Uniformity of the distribution (0 to inf)
For the uniformity test, we used an information-based measure called entropy. Given k bins in a histogram, the entropy of a histogram H is , where pi is the probability that an item belongs to the i-th bin. High entropy means that values of the dimension are from a uniform distribution and the histogram for the dimension tends to be flat. While knowing a distribution is uniform is helpful to understand the data set, it is sometime more informative to know how far a distribution deviates from uniform distribution since a biased distribution sometimes reveals interesting outliers.
The number of potential outliers (0 to n)
To count outliers in a distribution, we used the 1.5*IQR (Interquartile range: the difference between the first quartile (Q1) and the third quartile (Q3)) criterion that is the basis of a rule of thumb in statistics for identifying suspected outliers. An item of value d is considered as a suspected (mild) outlier if d > (Q3+1.5*IQR) or d < (Q1-1.5*IQR). To get more restricted outliers (, or extreme outliers), 3*IQR range can be used. It is also possible to use an outlier detection algorithm developed in the data mining. Outliers are one of the most important features not only as noisy signals to be filtered but also as a truly unusual response to a medical treatment worth further investigation or as an indicator of credit card fraud.
The number of unique values (0 to n)
At the beginning of the data analysis, it is useful to know how many unique values are in the data. Only small number of unique values in a large set may indicate problems in sampling or data collection or transcription. Sometime it may also indicate that the data is a categorical value or the data was quantized. Special treatment may be necessary to deal with categorical or quantized variables.
Size of the biggest gap (0 to max range of dimensions)
Gap is an important feature that can reveal separation of data items and modality of the distribution. Let t be a tolerance value, n be the number of bins, and mf be the maximum frequency. We define a gap as a set of contiguous bins {bk} where bk (k=0 to n) has less than t*mf items. The procedure sequentially visits each bin and merges the satisfying bins to form a bigger set of such bins. It is a simple and fast procedure. Among all gaps in the data, we rank histograms by the biggest gap in each histogram. Since we use equal-sized bins, the biggest gap has the most bins satisfying the tolerance value t.

14.4 Scatterplot Ordering Tab

Analogous to the interface for 1D projections, the interface consists of four coordinated components: control panel, score overview, ordered list, and scatterplot browser.


All 2D scatterplots are ordered according to the current ordering criterion (A) in the ordered list (C). Users can select multiple scatterplots at the same time and generate separate scatterplot windows for them to compare them in a screen. The score overview (B) shows an overview of scores of all scatterplots. Mouseover event activates a cell in the score overview, highlights the corresponding item in the ordered list (C) and shows the corresponding scatterplot in the scatterplot browser (D) simultaneously. A click on a cell selects the cell and the selection is fixed until another click event occurs. A selected scatterplot is shown in the scatterplot browser (D), where it is also easy to traverse scatterplot space by changing X or Y axis using item sliders on the horizontal or vertical axis.

Starting a systematic exploration of 2D projections by ranking them

Users select an ordering criterion in the control panel on the left, and then they see the complete ordering of all possible 2D projections according to the selected ordering criterion (Figure 3A). The ordered list shows the result of ordering sorted by the ranking (or scores) with scores color-coded on the background. Users can click on any column header to sort the list by the column. Users can easily find scatterplots of the highest/lowest score by changing the sort order to ascending or descending order of score (or rank). It is also easy to examine the scores of all scatterplots with a certain variable for horizontal or vertical axis after sorting the list according to X or Y column by clicking the corresponding column header.

Examining the ranking result (Score overview, Ordered List)

However, users cannot see the overview of entire relationships between variables at a glance in the ordered list. Overviews are important because they can show the whole distribution and reveal interesting parts of data. We have implemented a new version of the score overview for 2D projections. It is an m-by-m grid view where all dimensions are aligned in the rows and columns. Each cell of the score overview represents a scatterplot whose horizontal and vertical axes are dimensions at the corresponding column and row respectively. Since this table is symmetric, we used only the lower-triangular part for showing scores and the diagonal cells for showing the dimension names as shown in (B). Each cell is color-coded by its score value using the same mapping scheme as in 1D ordering. Users can change the colors for minimum, middle, and maximum values by a right mouse click on the corresponding part of the color scale bar. As users move the mouse over a cell, the scatterplot corresponding to the cell is shown in the scatterplot browser simultaneously, and the corresponding item is highlighted in the ordered list (C). Score overview, ordered list, and scatterplot browser are interactively coordinated according to the change of the dimension in focus as in the 1D interface. In the score overview, users can preattentively detect the highest/lowest scored combinations of dimensions thanks to the linear color-coding scheme and the intuitive grid display. Sometimes, users can also easily find a dimension that is the least or most correlated to most of other dimensions by just locating a whole row or column where all cells are the mostly bright green or bright red. It is also possible to find an outlying scatterplot whose cell has distinctive color intensity compared to the rest of the same row or column. After locating an interesting cell, users can click on the cell to select, and then they can scrutinize it on the scatterplot browser and on other tightly coordinated views in HCE.

Manually traversing scatterplots (Scatterplot Browser)

While the ordered list shows the numerical score values of relationships between two dimensions, the interactive scatterplot browser best displays the relationship graphically. In the scatterplot browser, users can quickly take a look at scatterplots by using item sliders attached to the scatterplot view. Simply by dragging the vertical or horizontal item slider bar, users can change the dimension for the horizontal or vertical axis. With the current version implemented in HCE, users can investigate multiple scatterplots at the same time. They can select several scatterplots in the ordered list by clicking on them with the control key pressed. Then, click “Make Views” button on the top of the ordered list, and each selected scatterplot is shown in a separate child window. Users can select a group of items by dragging a rubber rectangle over a scatterplot, and the items within the rubber rectangle are highlighted in all other views. On some scatterplots they might gather tightly together, while on other scatterplots they scatter around.

Ranking Criteria

Again interesting ranking criteria might be different from user to user, or from application to application. Initially, we have chosen the following six ranking criteria that we think are fundamental and common for scatterplots, and we have implemented them in HCE. The first three criteria are useful to reveal statistical (linear or quadratic) relationships between two dimensions (, or variables), and the next three are useful to find scatterplots of interesting distribution.

Correlation coefficient (-1 to +1)
For the first criterion, we use Pearson's correlation coefficient (r) for a scatterplot (S) with n points defined as .
Pearson’s r is a number between -1 and 1. The sign tells us direction of the relationship and the magnitude tells us the strength of the linear relationship. The magnitude of r increases as the points lie closer to the straight line. Linear relationships are particularly important because straight line patterns are common and simple to understand. Even though a strong correlation between two variables doesn’t always mean that one variable causes the other, it can provide a good clue to the true cause, which could be another variable. Moreover, dimensionality can be reduced by combining two strongly correlated dimensions, and visualization can be improved by juxtaposing correlated dimensions. As a visual representation of the linear relationship between two variables, the line of best fit or the regression line is drawn over scatterplots.
Least square error for curvilinear regression (0 to 1)
This criterion is to sort scatterplots in terms of least-square errors from the optimal quadratic curve fit so that users can easily isolate ones where all points are closely/loosely arranged along a quadratic curve. Users are often interested to find nonlinear relationships in the data set in addition to linear relationship. For example, economists might expect that there is a negative linear relationship between county income and poverty, which is easily confirmed by correlation ranking. However, they might be intrigued to discover that there is a quadratic relationship between the two, which can be easily revealed using this criterion.
Quadracity (0 to inf)
If two variables show a strong linear relationship, they also produce small error for curvilinear regression because the linear relationship is special cases of the quadratic relationship, where the coefficient of the highest degree term (x2) equals zero. To emphasize the real quadratic relationships, we add “Quadracity” criterion. It ranks scatterplots according to the coefficient of the highest degree term, so that users can easily identify ones that are more quadratic than others. Of course, the least square error criterion should be considered to find more meaningful quadratic relationships, but users can easily see the error by viewing the fitting curve and points at the scatterplot browser.
The number of potential outliers (0 to n)
Even though there is a simple statistical rule of thumb for identifying suspected outliers in 1D, there is no simple counterpart for 2D cases. Instead, there are many outlier detection algorithms developed by data mining and database researchers. Among them, distance-based outlier detection methods such as DB-out define an object as an outlier if at least a fraction p of the objects in the data set are apart from the object more than at a distance greater than a threshold value. Density-based outlier detection methods such as LOF-based method define an object as an outlier if the relative density in the local neighborhood of the object is less than a threshold, in other words the local outlier factor (LOF) of the object is greater than a threshold. Since the LOF-based method is more flexible and dynamic in terms of the outlier definition and detection, we included the LOF-based method in the current implementation.
The number of items in the region of interest (0 to n)
This criterion is the most interactive since it requires users to specify a (rectangular, elliptical, or free-formed) region of interest. Then the algorithm uses the number of items in the region to order scatterplots so that users can easily find ones with most/least number of items in the given region. An interesting application of this ranking criterion is when a user specifies an upper left or lower right corner of the scatterplot. Users can easily identify scatterplots where most/least items have low value for one variable (e.g. salary of a baseball player) and high value for the other variable (e.g. the batting average).
Uniformity of scatterplots (0 to inf)
For this criterion, we calculate the entropy in the same way as we did for histograms, but this time we divide the two-dimensional space into regular grid cells and then use each cell as a bin. For example, if we have generated k-by-k grid, the entropy of a scatterplot S is , where pij is the probability that an item belongs to the cell at (i, j) of the grid.

Change options for scatterplot dispalys

Show density:

Right-click on a scatterplot and select "Show density" to see the density plot for the scatterplot.  100x100 grid will show the density by color-coding the density information.  The brighter a grid cell is, the denser the cell is.

Properties:

Right-click on a scatterplot and select "Properties" to change some display options for scatterplots.  On the dialog box shown to the left, users can change the marker size, shape, and color.  The marker color can change dynamically according to the average of the value for X-axis and the value for Y-axis using the global color mapping scheme, or it can be static.

14.5 Profile Search Tab

HCE3 offers a parallel coordinates view to help users compare the patterns of clusters and find more interesting patterns interactively.  Dynamic Query' is the fundamental technique for Profile Search, which means users can specify search patterns on the view itself by mouse dragging, and see query results instantaneously. The following figure shows the overall layout of Profile Search tab. The Profile Search tab consists of four parts:  (A) the information space where input profiles are drawn and queries are specified, (B) the selection indicator to visualize the number of selected items, (C) the threshold slider to specify similarity thresholds, and (D) control panel to specify query parameters.

Basic operations

Users specify a search pattern by simple mouse drags. As they drag the mouse over the information space, the intersection points of mouse cursor and vertical time lines define control points. Existing control points, if any, at the intersecting vertical time lines are up-dated to reflect the dragging. A search pattern is a set of line segments connecting the contiguous control points specified. Users choose a search method and a similarity measure on the control panel. They can change the current search pattern by dragging a control point, by dragging a line segment vertically or horizontally, or by adding or removing control points. All modifications are done by mouse clicks or drags, and the results are updated instantaneously. This integration of the space where the data is shown and where the search pattern is composed reduces users' cognitive load by removing the overhead of context switching between two different spaces.

The light gray polygonal area clearly shows the range of the entire profiles (Data Silhouette) . If the number of selected profiles is less than or equal to 200, each profile (, or gene) is drawn as a polyline.  As users mouse over a profile, the profile will be highlighted in red.

If the number of selected profiles is greater than 200, the dark gray polygonal area will be drawn to avoid visual clutters caused by two many solid lines. The dark gray polygonal area clearly shows the range of the selected profiles. 

 

Users see the solid lines in the Information Space, each of which represents a profile of an item (gene).  In this space, users can also submit a query just by a mouse dragging.  Of course, the result of the  query will be shown interactively in the same space.  Users can modify the query easily by moving a point vertically or by moving a line segment vertically or horizontally.  Users can delete a certain part of model pattern by dragging mouse with left control key pressed or after pressing 'Delete' button.  'Clear ALL' button let users return to the initial state. 

Query refinement

Profile Search tab supports sequential query refinements. Users can submit a new query over the current query result. If users click “Pin This Result” button after submitting a query, the query result becomes a new narrowed search space. We call this “pinning.” Pinning enables sequential query refinement, which makes it easy to find target patterns without losing the focus of the current analysis process. For example, if users click on a cluster in the dendrogram view, all items in the cluster are shown in the parallel coordinates view. By pinning this result, users can limit the search to the cluster to isolate more specific patterns in the cluster.  If 'Show Silhouette' is checked, users always see the range of all profiles in the search space in form of gray shadowed polygon. Users can refine the query by submitting a new query over the pinned result set.   Users can reset the information space to original full set by clicking 'Consider All Profiles' button.

The following three kinds of queries are possible in this tab.

Model-based query

Users can specify a model pattern simply by dragging mouse with left button pressed as shown in the following figure.  Another way to specify a model pattern is to make an existing pattern a model pattern.  Right-click on a profile and select "Make it a model pattern" at the popup menu, then the pattern turns bold red to become a model pattern.  Users can delete a part of the model pattern, which they don't care about.

Users can use 3 different distance measures and assign threshold values.  All profiles satisfying the threshold range will be interactively shown in Information Space.  For example, previous figure shows the profiles of items that are 90.1 percent or more similar to the red model pattern in terms of  Pearson correlation coefficient. Users can move the entire model pattern by dragging on a line segment, or move a control point by dragging it.  3 different measures are Pearson correlation coefficient, Euclidean distance, and Absolute distance from each control point. Assume users select the Absolute distance measure and the threshold values are 0 and 60 (see the following figure). If the distance between each point of a profile and its corresponding control point of a model pattern is within the distance between 0 and 60, the profile will be selected as a result. It's like selecting profiles that flow through a equi-width(60) pipe whose center line is the model pattern. The light yellow shadow clearly shows the satisfying region.

Ceil-and-Floor query

It is possible to define ceilings and floors on Information Space so that only the profiles below ceilings and above floors are shown as a result. The following figure shows a simple example of a ceil-and-floor query.  Users can specify a ceiling by left mouse button, and a floor by right mouse button. Users can move each individual line segment or control point to change ceiling and floor.

Search-by-Name query

Users can type in a string to find items whose name (or description) contains the string.  Searches are done either incrementally or not.  For example,  if users want to find items whose name contains the word "muscle", when users type 'm' only the items containing 'm' in their name will be shown.  As users type in 'u', the result will be updated to show only the items whose name has the substring "mu".

A good combination of a search-by-name query and a model-based query is to search an item (, or a gene) using the search-by-name query and then make one of the search result a model pattern by a right mouse click and select "Make it a model pattern."  By revising the new model pattern and threshold values, users can easily find a group of items similar to a known item.  Interactive coordination with the dendrogram view will also enable users to check whether the items are in the same or similar cluster.

14.6 Gene Ontology Tab

How to prepare the data with GO (Gene Ontology) information

If input file is a microarray experiment data and the gene ontology information is available for genes, users can utilize this tab. GO ID can be entered as in the sample file shown at the first section of this manual.  If a gene has more then one GO IDs, GO ID should be the concatenation of all GOIDs. For example,  GO:0004725GO:0005001 if a gene has two GO IDs, GO:0004725 and GO:0005001.  If the data set is from Affymetrix GeneChips, users can annotate each row (gene) with GO IDs by clicking "Annotation" button and loading an appropriate annotation file as users do in the Table View (11.2).  If not, users need to annotate each gene manually by looking up and joining web databases.

Users can select a cluster in the dendrogram view. Genes in the cluster are shown in the gene list control at the bottom right corner.  The data set shown is in vivo murine muscle regeneration expression profiling data using Affymetrix U74Av2 (12,488 probe sets) chips measured in 27 time points.

Ontology Tree Control Control Buttons Gene List Control

Ontology Tree Control

The gene ontology hierarchy is a directed acyclic graph (DAG), but we use a tree structure to show the hierarchy since the tree structure is easier for users to understand and easier for developers to implement than a DAG. Thus, a GO term may appear several times in different branches, but the path from the root to a node is unique.  All paths to the GO IDs selected in the gene list control are shown in this control. The selected GO IDs will be highlighted in blue and with a red flag icon. Each node has a number within a parenthesis, which represents the number of genes that has the GO ID of the node or any descendants of the node.  When users click the button, "Load Ontology" to look at the whole gene ontology hierarchy, the number in the parenthesis represents the number of genes in the whole data set.  When users click the button, either "<-ALL" or "<-Selected" to look at the selected part of hierarchy, the number in the parenthesis represents the number of genes among the selected genes.

‘I’ represents ‘IS-A’ relationship and ‘P’ represents ‘PART-OF’ relationship.  Users can search the current gene ontology either by a GO term (e.g., 'cell cycle') or by a GO ID (e.g., 'GO:0007049'). A right click on a node in this ontology tree control will highlight all genes associated with the GO node or its child nodes (nodes that are below the node) in all other views including the dendrogram view and scatterplot view.

Control Buttons

Gene List Control

This control is populated with the selected genes and their GO information.  All GO terms and IDs associated with a gene will be shown below the gene name with a tab indentation.  Users can select one gene ontology from available three ontologies (molecular function, biological process, cellular component) using the combo box above the list control.  The number of the selected genes and the number of their associated GO terms are also shown right next to the combo box.

14.7 K-means Tab

This tab shows the K-means clustering results in a way similar to 'Color Mosaic' tab. If users select "Do K-means Clustering" on the popup menu by a right mouse click on the K-means tab, the following dialog box will show up.  Users can cluster rows and/or columns. Users can generate the initial cluster set either randomly or from the current hierarchical clustering result (i.e. the clusters determined by the current minimum similarity bar position will be the initial set of clusters for K-means clustering algorithm).  This initialization method has been known to produce better clustering results. When random generation is chosen, users can specify the number of clusters to generate.

Users can see the K-means clustering results of rows (genes) with one pixel gaps between clusters.  Column clusters are separated by horizontal lines between clusters.  Selected items are simultaneously highlighted with item names in yellow background. Users can right-click on the clustering result view and select "Export Clustering Result" to export the current clustering result to a text file, where cluster labels will be added as a row and/or a column.

Last updated on 05/06/2005