Microarray Analysis with HCE
1. Introduction
In this report, we will use the HCE (Hierarchical Clustering Explorer) visualization software to explore a biological microarray data set. HCE is an interactive visualization tool designed for the analysis of multi-dimensional data sets. It couples numerous hierarchical clustering techniques with correlation analysis and interactive displays of 1D and 2D projections in order to reveal patterns and relationships in complex data. We will present a number of interesting findings revealed by HCE, and evaluate HCE as a visualization tool for multivariate data. This review is meant to give the reader a feel both for microarray data and the strengths of the HCE visualizations.
1.1. Microarray
In order to get started, we first need to understand our data set. Although a little biology background is required to fully grasp what we are about to discuss, I will give brief summaries as we go along and provide Wikipedia links to the details. For example, the term DNA microarray describes a biological research technique whereby gene expression levels can be measured. Just follow the prior links for more details, or read on for the summary. Although the term microarray can refer to a number of different experiments, for this report we will always be referring to a two-channel, spotted DNA array. So what is it? Let's start with a picture.

Figure 1. A microarray slide. The final result of the experiment is a printed slide where each tiny spot represents a single gene and the color of each spot represents the expression ratio for that gene. The ratio is between a control and experimental sample, and simply measures the amount of "stuff" (mRNA) each gene expressed. If we label the control green and the experiment red, then the green spots will indicate a more active control gene, the red will spots indicate a more active experimental gene, and the yellow spots will indicate equality.
We'll omit the details of the biochemistry and assume it all works, but for a more general illustration let's consider an analogy. Say a graduate student is magically turned into a gene. Instead of expressing mRNA, this gene expresses journal articles. We want to determine if this gene expresses more articles in the summer or winter. To do this, we print all of the winter articles on blue paper and all of the summer articles on red paper. Then, we crumple them up, throw them on a pile, and guess if the pile has more blue papers or red papers. This is roughly analogous to a microarray experiment, and as you can see, there is much room for error. So to obtain accurate results for a microarray experiment, multiple samples are collected, even more spots are printed, and the spot ratios are determined by sophisticated scanners and normalization algorithms. As we will show for this experiment, great lab work and high redundancy can produce quite accurate results.
1.2. The Experiments
Now
that we understand microarrays, we need some genes to analyze. For
this report we will look at microarray data for the cyanobacteria Synechococcus sp. WH 8102. More commonly known as blue green algae, Synechococcus sp. is a marine bacteria that captures light energy and plays a major role in marine carbon fixation (the transformation of CO2 into organic compounds). This ability is of obvious interest to the scientific
community - just imagine turning CO2 into diamonds and oxygen, what could be better? More realistically, understanding
the fixation process could lead to better vehicle and industrial CO2 emissions control.
The experimental conditions we will analyze were conducted on a wild type Synechococcus sp. strain under various environmental stresses. From here on, we will simply refer to this strain as 8102. The 13 experimental conditions studied for 8102 are listed below. Many of these conditions share the same control, however some do not. This should not affect our analysis as long as we only compare ratios within the same experiment.
- Salt Shock
- Cadmium Shock
- Low Nickel, High Nitrate
- Low Nickel
- CuEDTA Shock (Copper)
- Low Nickel, High Ammonium
- Low Nitrate, High Ammonium
- 8102 + Vibrio heterogeneous sample
- Urea vs. Nitrate
- Low Phosphate
- High Phosphate
- Mitomycin C Shock
- Ethidium Bromide Shock
The names of the experiments have been kept intentionally vague to keep the focus on the HCE software rather than the biology. Examination of these experimental conditions by Synechococcus sp. biologists is currently underway, so we will not make any assertive conclusions about the biology in this report. Later on we will draw hypotheses about these experiments, but only to demonstrate the ability of HCE to guide investigation, not to infer any biological significance. A manuscript describing these and other findings is currently under preparation, so to honor the privacy of the authors we will omit some of the details.
Previous analysis of this data was conducted on a per-experiment basis, and the entire set of experiments was never examined as a whole. In the following sections, we will treat this sequence of experiments as a sort of time series, and look for relationships between genes that are conserved across all conditions. This multi-dimensional analysis should provide stronger evidence of gene inter-relationships and reveal sets of synergistic genes.
2. Headlines
2.1. Excellent Data Quality
As we mentioned before, microarray data can be difficult to reproduce and may contain wide variance. In an attempt to produce accurate results, redundancy is often built in to the experiments to archive higher data quality. For the 8102 arrays, each gene in each experimental condition was sampled from at least 2 distinct biological replicates. These samples were then placed onto at least 5 microarray chips, with each gene being placed 6 times per chip, yielding a per-gene redundancy of at least 30x. For simplicity, we will average this redundancy to obtain a single ratio per gene, and all ratios will be converted to log base 2.
Even with this redundancy, how can we be sure the data is accurate? To do so, we will perform a couple of sanity checks using HCE to see if the data looks reasonable. Our first test will be to hierarchically cluster the data by experiment. Given some knowledge of each experimental condition, this should result in a clustering where similar experimental conditions are clustered closely with one another. Here is the resulting dendrogram of HCE's UPGMA average linkage clustering.

Figure 2. Experiment clusters represented by a dendrogram. Similarity between two experiments is represented by branch height, therefore the lower a node is vertically, the more similar its subtree.
Without knowing any biochemistry, this tree isn't very informative, so let's walk through the clusters and try to understand the results. The first cluster that stands out is on the far right of the tree (12,13) - Mitomycin C Shock and Ethidium Bromide Shock. These are both highly toxic compounds, and should trigger the cell defense mechanisms. Being the only two toxins on the list, it makes sense for these two experiments to be unlike anything else, and that is exactly how they cluster. We can also see that two other shock experiments cluster (1,2), the nickel experiments cluster (3,4), the ammonium experiments cluster (6,7), and the phosphate experiments cluster (10,11). This all looks good, so we declare success! The experiment clustering makes sense, so the underlying data is likely to be correct.
We will now confirm our hypothesis that the data is reasonable by looking at ribosomal genes and their expression levels across all 13 experiments. A ribosome is a piece of cellular machinery that turns mRNA transcripts into proteins. mRNA is the primary product of most genes, and proteins are the complex organic compounds that no cell could live without, so the importance of ribosomal genes should be clear (even the ribosome itself is made of proteins). Because they are so important to the cell, we would expect the expression of all ribosomal genes to be highly correlated, and to have little variance across the different experiments. In addition, ribosomal genes are arguably the most understood and best annotated genes in any genome. This makes them excellent candidates for our second sanity check. Using HCE to search for "ribosomal protein", we uncover the following expression profile.

Figure 3. Expression ratio for each ribosomal protein gene as a single polyline on a parallel coordinate view. The vertical axes record the log ratio of expression for each experiment, with red representing higher expression in the experiment and green representing higher expression in the control. The profile of all genes is shown as a light gray area on the plot.
The results for our genes of interest look pretty good. The majority of the ribosomal proteins are in the black with a log ratio of zero across all experiments, and the remainder form a secondary pattern with a "W" shape near the end of the plot. Using the same average linkage clustering as before, we will attempt to tease out the individual clusters. Below is a screen shot taken of HCE's dendrogram, color mosaic and parallel coordinate view of a well-formed ribosomal protein cluster.

Figure 4. Clustering of ribosomal protein genes. The orange arrows below the mosaic point to the displayed expression profiles. All but two genes in this cluster are true ribosomal protein genes.
The average linkage clusters for both the experiments and genes look reasonable, so we will move along and take a look at some less understood genes involved in metabolic pathways and stress response.
2.2. Metabolic Gene Discovery
In an effort to demonstrate HCE's ability to uncover correlations in multivariate data, we will now look to annotate genes of previously unknown function. We begin with an overview of all genes and experiments as visualized by HCE.

Figure 5. All genes (columns) and experiments (rows) from our data set as visualized by HCE. The top left panel shows the hierarchical clustering along with a color mosaic. Red indicates a more active experimental gene and green indicates a more active control gene. The bottom panel shows a histogram of expression ratios for the experiment with the least-normal distribution (phosphate starvation).
A quick scan of the color mosaic shows some nice stripes of uniform color, indicating that there may be some interesting relationships. We begin by identifying the least-normal distribution of expression using HCE's ranking options under the histogram view. The phosphate starvation distribution jumps out as having a skinny center of mass and long tails of outliers. Assuming the tails are not experimental error, we would like to investigate the multi-dimensional relationships of these hyper-expressed genes, some of which have a log ratio of nearly 4 (ratio of 16:1). Choosing an experiment with a low variance as our fixed dimension (cadmium shock), we compare the scatter plots of both phosphate experiments.

Figure 6. Comparison of both phosphate experiments shows a number of genes hyper-expressed in the absence of phosphate (upper right), while fairly normal in the presence of phosphate (lower right). Selected genes are highlighted by orange triangles in all views. These genes appear well correlated over all experiments as shown by the hierarchical clustering (upper left). Scatter plots are ranked by uniformity (lower left), with dark blue corresponding to low entropy and dark red corresponding to high entropy. The low phosphate experiment has the lowest entropy, while the high phosphate experiment has a higher entropy.
It's becoming rather clear that these genes are somehow related to the phosphate pathway of this organism. Phosphate is a nutrient of this cell, so we could hypothesize that these genes are synthesizing phosphate when it is absent from the environment. To support our claim, we will look at the details of the cluster identified by HCE along with the expression profiles.

Figure 7. Clustering (top left) and profiles (bottom) of our genes of interest. Selected genes are indicated by orange arrows below the mosaic. Identified genes include a transcription regulator, alkaline phosphatases, porins and hypotheticals as shown in the details view (top right).
As shown above, these genes not only share a similar expression ratio in the phosphate experiments, their ratios are similar across all experimental conditions. This leads us to conclude they must be related in function. In fact, they are mostly annotated as part of this organisms phosphate pathway. More specifically, these genes coordinate the extraction of phosphate ions from available compounds in the environment. Included in this cluster are a number of "hypothetical" genes, meaning their true function is still unknown. The above analysis with HCE gives strong evidence that they are functionally related to 8102's phosphate pathway.
2.3. Heat Shock Gene Discovery
In addition to automated clustering, HCE allows the user to manually search for patterns of interest using interactive widgets and search features in the parallel coordinate view. We will again illustrate HCE's ability to guide the discovery of novel genes, however this time using the parallel coordinate search features. We start out by searching for "shock" in the gene annotation to find all annotated heat shock proteins.

Figure 8. Parallel coordinate view of all annotated heat shock proteins across our 13 experimental conditions. Three genes show a distinct pattern (highlighted in blue), while the remainder stay relatively normal. The red peak in experiment 5 is the copper shock, and the green peak in experiment 11 is the high phosphate environment.
Heat shock proteins (or stress proteins) are part of a cell's stress response, and are common to all organisms. The easiest way to activate these genes is to expose the cells to elevated temperatures, however other environmental stresses can also activate the response. Here we can see over-expression spikes in the presence of copper (5), low phosphate (10), and mitomycin (12), and under-expression spikes in the presence of abundant phosphate (11). This is expected behavior for our stress response genes, since experiments 5,11, and 12 are all stressful environments, and 11 is a nutrient rich environment. However, many of the hypothetical heat shock proteins do not share the same profile, and are steadily expressed among all experiments. It is possible that these genes have been mis-annotated, so let us use HCE to construct a more compressive view of the heat shock response.

Figure 9. Parallel coordinate view of all genes fitting the model of the highlighted heat shock gene in Figure 8. This represents a more comprehensive collection of genes related to the heat shock response including GroEL, GroES, and a number of hypotheticals.
Using a well known heat shock gene as our model (highlighted in Figure 8), we manually adjust the Euclidean distance tolerance provided by HCE's profile search until we are happy with the result set. The set of genes shown above include other heat shock proteins such as GroEL and GroES which are known members of the heat shock response, but don't explicitly have the word "shock" in their annotation. Along with these well studied proteins, the set includes a large number hypotheticals and genes not currently associated with the heat shock response. Once again, HCE gives strong evidence to the function of these uncharacterized genes.
3. Methods
Here are some brief specifics on how HCE was applied to this data set.
- Formatting
- Input to HCE was an Excel spreadsheet - one column per experiment, and one row per gene. Gene ID, name and description were the first three columns of each row. This allows HCE to search for keywords in the gene annotations while searching profiles.
- Filtering and Normalization
- The data set was pre-filtered and pre-normalized by our friends at TIGR and converted to log base 2 ratios. Therefore, all of HCE's filtering and normalization options were turned off. Missing values were listed as "NA" in our spreadsheet and were handled appropriately by HCE.
- Clustering
- HCE's default clustering options were used. The default options used UPGMA average linkage clustering with a Euclidian distance measure on both the rows and columns of the data. Other clustering parameters were tried, but the defaults were the most useful.
- Navigation
- Searching for genes of interest by name in the "Profile Search" tab and then expanding the search for similar profiles was a highly effective technique. This often pointed to genes that were effectively clustered in the dendrogram, but lost in the sea of data. Because of the size of the data set, starting with a narrow scope and working outward was easiest. Using the histograms and scatter plots to guide investigation was only moderately useful because it was difficult to make connections across dimensions. After narrowing the data set, the parallel coordinate view's ability to display all dimensions at once was by far the most useful feature.
4. Remarks
The Good
The above sections have proven HCE a useful tool for the annotation of genes with previously unknown function. The most useful feature for this analysis was the parallel coordinate display and the ability to search the profile space by both name and pattern. After identifying a gene of interest through the histogram and scatter plot displays, finding genes with similar profiles guided me through the hierarchical clustering and helped identify relationships. The ability to set upper and lower bounds for each coordinate, set Euclidian distance tolerance dynamically, and ignore certain dimensions was very appreciated. The parallel coordinate view was so useful, it could warrant a standalone tool (see TimeSearcher). Integrating this view with interactive dendrograms, scatter plots, and histograms helped to quickly confirm my hypotheses by allowing me to look at the same data from multiple angles.
The second most useful feature was the integrated clustering algorithms and the minimum similarity slider for the dendrogram panel. Because the human eye is usually much better at clustering than machines, allowing interaction in the clustering process is helpful. I would not have been able to manually cluster such a large data set, but I was able to make more sense of the automated clustering result by dynamically adjusting the minimum similarity to a meaningful level.
I also appreciated scroll bar selection highlighting. This feature marks the position of each selected item on the scroll bar, allowing you to browse selections more efficiently. Unfortunately, this feature was buggy and didn't always work properly.
The Bad
To go with all the positives, there were also some (mostly technical) negatives. The program (v3.0) was very unstable and crashed often. The crashes were most problematic after I had spent a good deal of time refining parameters, re-clustering, and narrowing the search space. No history was saved, so after each crash I had to start from scratch. It is difficult to get researchers to adopt a tool if it is unstable, so this point requires serious attention. The only other problem worth mentioning was the behavior of the minimum similarity bar. Adjusting this widget adds and removes items in the color mosaic without keeping the center of view preserved, so if you are centered on a highlighted area and decide to adjust the clustering, you have to relocate your selection after each adjustment of the slider. Keeping the user oriented when browsing a large space is critical, and HCE's failure to preserve my field of view negatively affected my experience.
Suggestions
Ranking projections is one of the great ideas implemented in HCE, however I found little use for this feature when searching for higher dimensional clusters. In addition to projection ranking, I would have liked the clusters to be ranked as well - perhaps by size and similarity. I was forced to do this manually, by frequently adjusting the minimum similarity bar and scanning the mosaic for interesting clusters. While effectively ranking clusters may be difficult, even a little direction can be helpful.
Although extremely useful, the profile search feature of HCE was a little clumsy. Setting upper and lower bounds with mouse clicks and drag events was difficult. A "timebox" search method, like that used in TimeSearcher, would be an easier way to narrow and explore the search space. Also, there was no way to search for profiles symmetric around some point in all dimensions. In our application, this would correspond to genes with an inverted expression profile relative to our model.
Conclusions
Overall, HCE is a powerful tool that gracefully combines statistical methods with user interaction and dynamic controls. It proved to be very effective for my tasks, and I am still a bit shocked at how easy it was to find interesting relationships in the data. I am left wondering if it was my data set or HCE that was so wonderful - for now let's just assume the answer is both.
5. Acknowledgements
I would like thank Dr. Ian Paulsen and Katherine Kang of The Institute for Genomic Research for providing the data, and everyone else involved in the Synechococcus sp. DOE Genomes to Life project. I would also like to acknowledge Dr. Jinwook Seo and Dr. Ben Shneiderman their work on HCE.