Jinwook Seo Bongshin Lee
Department of Computer Science
University of Maryland, College Park, MD USA
May 16, 2001.
Clustering methods have been used in many areas to discover interesting patterns in large amounts of data. These methods, particularly hierarchical clustering, have been put to effective use in genomic data analysis. The results of hierarchical clustering are typically displayed using dendrograms and color mosaic. However, current visualization tools are not effective to visualize the results of hierarchical clustering for large data sets. The contribution of our work is three fold. First, we propose a new interface called 'Dynamic Filtering Bar' to dynamically filter out genes which do not satisfy a minimum threshold similarity value. Second, we introduce compressed versions of the dendrogram and color mosaic to make better utilization of the limited screen space. Third, we show the detail information only when users select a cluster.
Clustering methods have been used in many areas to find some important patterns in a large amount of data. Considering that a human cell has about 50,000 to 100,000 genes, the large volume of genomic experiment data is a good target for clustering. Many clustering methods are already widely used in genomic data analysis. They include hierarchical clustering, K-means clustering, self-organizing maps etc. Usually, the result of clustering analysis is too large to be interpreted easily. Especially, the result of the hierarchical clustering method is a special form of a binary tree called 'dendrogram'. In general, the number of their gene expression profiles generated by microarray hybridization experiments is over 3,000. The dendrogram is so wide that even a data sample cannot be displayed in a pixel on a high resolution computer screen. Even if we introduce a scrolling mechanism widely criticized as a bad visualization, we should have a thin thumb on the scroll bar.
One of the main problem of most current visualization tools is that they are static. Static visualizations do not allow dynamic queries on the underlying data which are rapid, incremental, and reversible. Static visualizations can make some nice views of data, but they can not support direct manipulation to facilitate pattern discovery. In this paper, we present a new scheme for visualizing large dendrogram, which allows direct manipulation and dynamic query.
We briefly summarize the pattern extraction process of our new visualization scheme to explain the main features of our tool. In general, dendrogram is usually displayed with a color mosaic at the leaf to show the visual pattern of underlying data. The gene expression profile data consists of ratio values of problem gene value to normal gene value. We take the log of the ratio values and display the result using a color mapping in form of 2D color mosaic. Together with the resulting dendrogram visualization, we show the histogram of data. Based on the data distribution, users can change the color mapping for color mosaic display by adjusting the range of color stripe displayed over the histogram. Users can instantly see the result of new color mapping on the current dendrogram display. After deciding the proper color mapping, users can do a dynamic query on the dendrogram. We provide a filtering bar whose y coordinate determine the minimum threshold similarity value. Users can filter out less similar genes by dragging the bar to the bottom of the screen. In this way, users can concentrate on more interesting genes without losing the overall dendrogram structure. We support a compact visualization for large dendrogram. Users can see the compressed view of large dendrogram in a screen. They can easily find some hot spots of the gene data on the compressed color mosaic at the leaf of the dendrogram. They include a region of high ratio values or low ratio values. For the further investigation, users can magnify that region and see the names of each genes in the region.
In the remaining sections of this paper, we first explain the gene expression profile data which is widely used in genomic data analysis. Next, we summarize the current works on the visualization of hierarchical clustering results (dendrogram). Then, we describe the functionalities of our dendrogram visualization tool and the process of pattern extraction using direct manipulation and dynamic query. We also suggest some future works for improving the usability and efficiency of our tool. Conclusion section follows at end.
The Cancer Genetics Branch research team at NHGRI gathers gene expression profiles for 38 samples, which includes 31 melanomas and 7 controls. They choose 3614 genes to analyze the melanoma samples. The purpose of this analysis is to identify the genes underlying the classification of a subset of melanomas (highly invasive melanomas). They find a major cluster of 19 samples among 38 samples, and they also identify major genes which discriminate melanoma clusters. Table 1 shows a part of the melanoma sample data. Each column and row represents a sample and a gene respectively.
Table 1. Melanoma Gene Expression Profile Data
First row contains the sample names and the first column shows the clone IDs of genes. Each cell contains a ratio value of the gene activation value of tumor cell to that of normal cell. So, value 1.0 in a cell indicates that the gene does not show a different reaction compared to the normal cell. Biologists are not interested in that kind of genes. The range of ratio values covers from 0 to a value near 100. The common way to show the data in a visual form is to present a color mosaic in which each cell is represented as a cubic filled with a specific color. A log function is applied to apparently present the difference between negative genes ( ratio <1) and positive genes (ratio >1) before applying a color mapping. Usually, negative genes have green color and positive genes have red color. That's de facto color mapping convention in gene experiment. The color intensity is linearly proportional to the absolute value of log ratio. The ranges of original ratio values are different from experiment to experiment, so we need different color mapping for each experiment. We use 0.2 and 50 as threshold values for melanoma data. Genes with values less than 0.2 have green color of maximum intensity, and genes with values greater than 50 have red color of maximum intensity. Genes between 0.2 and 1 are mapped to green color of linearly decreasing intensity. Genes between 1 and 50 are mapped to red color of linearly increasing intensity.
Johnson's hierarchical clustering is a well known method for gene expression profile analysis. Agglomerative algorithm of John's hierarchical clustering is summarized as follows. Let's assume that we want to cluster m data points, and we have m(m-1)/2 similarity values for every possible pair of m data points.
There are three different choices in updating the similarity values in step 3. They are (1) complete-link, (2) average-link, and (3) single-link. Complete-link sets the similarity values between the new cluster and the remaining clusters to be the minimum of similarities between each member of the new cluster and the rest. Average-link uses average similarity value as a new similarity values. Single-link takes the maximum.
||Spotfire's Array Explorer supports hierarchical
clustering, and it draws a dendrogram as the result of clustering. When
the number of data samples is quite large like melanoma samples, the
dendrogram visualization of Array Explorer suffers from slow speed and too
much scrolling. It can not show the overview of the entire dendrogram.
Users can only see the part of the dendrogram. Array Explorer does not
show the color mosaic in its dendrogram visualization window. We can use
'hit map' visualization to see the color mosaic, but this is not efficient
because it is not easy to notice the strong relationship between the two
visualizations. Users can select a set of clusters which satisfies minimum
threshold value (specified with a red line in Figure 1.). The selected
clusters are highlighted with red dots, which is not quite intuitive.
It allows users can extend the selected part of dendrogram to the whole extent of window. However, it can not actually show the detail view because it only increase the heights of currently selected clusters.
Figure 1. Dendrogram for 3614 genes by Array Explorer
|TreeView, proposed by Michael B. Eisen et al.
provides a color display of
dendrogram . TreeView supports tree-based and colormap-based browsing of
dendrogram. Users can select a portion of samples(genes), and the selected
region of colormap image is maginified in a different pane of window. It
can show 2 dendrograms at the same time, one for samples and the other for
Because the main purpose of this browsing tool is to produce a good image in many formats for publications, the current version of TreeView does not allow direct manipulation on the visualization.
|Figure 2. Dendrogram for time course of serum stimulation|
GeneMaths develped by Applied Maths, Inc. , displays dendrograms for samples and genes in a single screen. Users can select a cluster by clicking the root of a subtree. Their clustering algorithm is one of the fastest. The screen layout and color map image is quite nice. However, users should scroll the wide range of grids to see a certain region of color mosaic because they use and display only up to 20 database fields both for rows and columns. Users cannot capture the overview of the entire dendrogram because, as most current dendrogram visualization tools do, this program uses scrollbar to visualize a large dendrogram.
Figure 3. GeneMaths' Dendrogram and Color Mosaic
4. Dendrogram and Color Mosaic for Small Data Set
Figure 4 shows a snapshot of our system in action. The bottom portion of the figure shows the dendrogram and color mosaic. The top portion shows the portion of color mosaic that user has selected. The bottom right of the figure shows a histogram. It displays the distribution of logarithms of the ratio value.
Because each experiment generates a different range of ratio values, it is useful to know about the distribution of underlying data before determining the color mapping for the log values of the data. We provide a histogram view of the primary data. As you can see in the following figure, users can change color mapping to be more appropriate one by dragging the lines indicating both end (red and green lines). Because this direct manipulation instantly updates the actual dendrogram visualizations, users can easily find the color mapping which is most appropriate to reveal the pattern in color mosaic image.
mosaic are effective in obtaining a quick overview of the clusters, they do not
allow the user to fine tune the clusters that are displayed. For example,
certain clusters may be composed of genes with very low similarity values, and
these clusters (and their component genes) may not be interesting to the user.
In such cases, what is needed is a dynamic querying mechanism. We propose the
use of a sliding bar, whose y-coordinate represents the threshold value for
similarity. Clusters composed of genes with similarities lower than the
threshold are filtered out. The display is updated dynamically to reflect the
change in the threshold.
The difficulty in displaying a dendrogrm increases with the number of genes to be displayed. If we process more than 10,000 genes, the screen becomes crowded with terminal gene nodes and lines connecting them. However, since we are interested in displaying similarity among genes, all genes need not be displayed.
Therefore we sift out similar genes from all input genes to convey the information users want including the overall structure. To filter out gene nodes that have low similarity, we propose a new dynamic interface, called “Dynamic Filtering Bar” whose y coordinate determines the minimum similarity. We display only the gene nodes with similarity larger than this minimum similarity. Because we pack the selected genes by removing unselected genes, we can save screen space. Packed dendrogram and its color mosaic enables users to distinguish the difference between clusters.
Furthermore, users can control the minimum similarity value by dragging the bar which changes the position of it. This dynamic query provides fast results including number of clusters.
Although this interface works well for small data sets, we found that it has significant limitations when used to visualize larger data sets (e.g., genomic data). For data sets larger than 10,000 points, the corresponding dendrogram (and color mosaic) do not fit in a single screen, necessitating scroll bars. However, a typical user is likely lose track of the big picture if he has to scroll through multiple screens. We address this problem by compressing the dendrogram and color mosaic. The main idea during compression is to display multiple data points using the same pixel. Clearly, there is some loss of information. But, we believe that this is inevitable if we are interested in displaying a summary view. One problem, however, of both regular and compressed view is that it cannot show the gene name, an important piece of information. We allow a user to select (by clicking) any portion of the compressed mosaic to see it in a greater detail. The detailed view of the selected region is displayed in the top portion of the window. Figure 5 shows a screenshot involving compressed dendrograms and color mosaic.
Figure 5. Compressed View of Large Dendrogram
We proposed several visualization methods to show hierarchical dendrograms and color mosaic for large data sets effectively. They are (1) Dynamic filtering bar, (2) Compressed View of dendrograms and color mosaic, (3) Multiple Levels-of-detail, and (4) Direct manipulation of color mapping. We have also implemented a dynamic visualization tool supporting them. Our tool allows users to interact with the visualization result, facilitating data analysis. It shows not only a good overview but also enough detail through compressed view and multiple levels-of-detail. Users can do dynamic queries including dynamic filtering bar and instant cluster selection.
There are still possible improvements to our tool. The hierarchical clustering and visualization processes can be separated. Our tool can be a front end of various hierarchical clustering algorithms if it supports importing clustering results of other good hierarchical clustering algorithms. And it can be integrated with other general visualization tools such as Spotfile. It can be improved to support exporting the filtering results, the list of genes. Then, the data can be used to lookup the other databases.
We are grateful to Dr. Yidong Chen for his valuable suggestions.