HumanComputer Interaction Lab / University of Maryland  

About This Project  HCE3.0 (HCE2W)  HCE 2.0  HCE 1.0  Download 
The most commonly utilized microarrays for mRNA profiling (Affymetrix) include probe sets of a series of perfect match and mismatch probes (typically 22 oligonucleotides per probe set). There are an increasing number of reported probe set algorithms that differ in their interpretation of a probe set to derive a single normalized "signal"  representative of expression of each mRNA. These algorithms are known to differ in accuracy and sensitivity, and optimization has been done using a small set of standardized control microarray data.
We hypothesized that different mRNA profiling projects have varying sources and degrees of confounding noise, and that these should alter choice of a specific probe set algorithm. Also, we hypothesized that use of the Microarray Suite (MAS) 5.0 probe set detection p value as a weighting function would improve the performance of all probe set algorithms.
Permutation Study Framework using Unsupervised
Clustering in HCE2W
(the improved version of the Hierarchical Clustering Explorer 2.0 with
pvalue weighting and Fmeasure). Inputs to the Hierarchical Clustering
Explorer are two files, signal data file and pvalue file. Each column of
the two input files has values for a sample (or a chip), and the known
target biological group index is assigned to each column of the signal data
file. Success is measured using Fmeasure of a dendrogram and the known
biological grouping.
Note : HCE 3.0 test version is a newer version of HCE2W, which has all functions in HCE2W. 
How to prepare input files 
You have to prepare two files, probe set signal file and probe set detection pvalue file, for each probe set signal algorithm (e.g., MAS5, dChip, or RMA). As you can see in the figure, you can use the probe set detection pvalue file from MAS5 for all other signal files generated by probe set signal algorithms other than MAS5.
The two files should be in the same folder. The extension of the detection pvalue file should be pvl. Please refer to the following example.
1. Using Excel files
If the signal file name is mahmas5.xls, the detection pvalue file name should be mahmas5.pvl.xls.
2. Using tab delimited text files
If the signal file name is mahmas5.exp, the detection pvalue file name should be mahmas5.pvl.
Example : Please take a close look at this small example input files (mahmas5small.exp and masmas5small.pvl) in mahmas5small.zip. There are 4808 probe sets and 40 chips. It was filtered from the PGA Murine Airway Hyperresponsiveness project using a very stringent present call filter.
Please note that the order of rows and columns is the same as in the signal file.
To use continuous MAS 5 probe set detection pvalue as a noise filter 
Affymetrix noise calculations give us two outputs; one is the continuous detection p value assignment, and the other is a simple detection call (present/absent). Each signal intensity value has a confidence factor, detection pvalue, which contributes to determining the detection call for the corresponding probe set. When the probe set detection pvalue reaches a certain level of significance, then the probe set is assigned a "present" call, while all those probe sets with less robust signal/noise ratios are assigned an absent call.( follow this link at Affymetrix.com (login required) for more detail). This enables the use of a present call threshold noise filter. We reported that a 10% present call noise filter did improve the performance of probe set signal algorithms. While such present callbased filtering improves performance, it is clearly an arbitrary threshold method, and thus it is highly possible that potentially important signals that might be conveyed by the probe sets are filtered out.
There are many possible similarity measures for unsupervised clustering methods, and it is also possible to develop weighted versions of most similarity measures. For example, we can derive a weighted Pearson correlation coefficient as follows from the Pearson correlation coefficient that has been widely used in the microarray analysis. Let and be the vectors representing two arrays to be compared (thses values are prepared in the .exp or .xls files) , and let and be the vectors representing continuous probe set detection pvalues for and respectively. (These pvalues are prepared in the .pvl or .pvl.xls files) Then the weighted Pearson correlation coefficient is given by
, where , ,
We use the complement of detection pvalue to calculate the weight for each term since the smaller the pvalue is, the more significant the signal is. Other similarity measures such as Euclidean distance, Manhattan distance, and cosine coefficient can be extended to their weighted version in a similar way to the weighted Pearson correlation coefficient. In HCE, we can check the option checkbox (highlighted with a red oval in the following figure) to use the MAS 5.0 detection pvalues as weights for distance/similarity measures.
To use Fmeasure for evaluating unsupervised hierarchical clustering results 
We applied Fmeasure to the entire hierarchical structure of clustering results and also to the set of clusters determined by the minimum similarity threshold in HCE2W. Let ,.. , ,.., be the right clusters according to the target biological variable. Let , .., , .., be the clusters from the hierarchical clustering results. In Fmeasure, each cluster is considered a query and each class (or each correct cluster) is considered the correct answer of the query. The Fmeasure of a correct cluster (or a class) and an actual cluster is defined as follows:
, where , .
The precision values and recall values are defined by the information retrieval concepts. The Fmeasure of a class is given by
.
Finally, the Fmeasure of the entire clustering result is given by
, where is the total number of arrays in the experiment.
The Fmeasure score is between 0 and 1. The higher the Fmeasure score is, the better the clustering result is. When we calculate the Fmeasure for the entire cluster hierarchy, for each external class we traverse the hierarchy recursively and consider each subtree as a cluster. Then the Fmeasure for an external class is the maximum of Fmeasures for all subtrees.In the final clustering result visualization, each sample name is colorcoded by its biological class as shown in the figure at the top. Overall Fmeasure is highlighted with a pink oval. The Fmeasure distribution is shown, as the distance from the left side, over the dendrogram display as indicated by an arrow mark.
A Permutation Study Result ( 2 large novel microarray data, with/without detection pvalue weighting, 5 probe set signal algorithm) 
We used HCE 3.0 (HCE2W) to test and define parameters in Affymetrix analyses that optimize the ratio of signal (desired biological variable) versus noise (confounding uncontrolled variables). Five probe set algorithms were studied with and without statistical weighting of probe sets using the Microarray Suite (MAS) 5.0 probe set detection p values. The signal/noise optimization method was tested in two large novel microarray datasets with different levels of confounding noise; a 105 sample U133A human muscle biopsy data set (11 groups; mutationdefined; extensive noise), and a 40 sample U74A inbred mouse lung data set (8 groups; little noise). Performance was measured by the ability of the specific probe set algorithm, with and without detection p value weighting, to cluster samples into the appropriate biological groups (unsupervised agglomerative clustering with Fmeasure values).
Probe set detection pvalue weighting had the greatest positive effect on performance of dChip difference model, ProbeProfiler, and RMA algorithms. Importantly, probe set algorithms did indeed perform differently depending on the specific project, likely due to degree of confounding noise. Our data indicates that significantly improved data analysis of mRNA profile projects can be achieved by optimizing the choice of probe set algorithm with the noise levels intrinsic to a project.
The following graph shows the external evaluation results using Fmeasure of unsupervised clustering for the human muscular dystrophy data and the mouse lung biopsy data. "nowt" bar represents the result without MAS 5.0 detection pvalue weighting, and "wt" bar represents the result with pvalue weighting.
Papers 
For more information, please refer to the following papers.
Jinwook Seo, Marina Bakay, YiWen Chen, Sara Hilmer, Ben Shneiderman, Eric P Hoffman, " Interactively optimizing signaltonoise ratios in expression profiling: projectspecific algorithm selection and detection pvalue weighting in Affymetrix microarrays," Bioinformatics, Vol. 20, pp. 25342544, 2004.
Jinwook Seo, Ben Shneiderman, Interactive Exploration of Multidimensional Microarray Data: Scatterplot Ordering, Gene Ontology Browser, and Profile Search, HCIL200325, CSTR4486, UMIACSTR200355.
Download 
HCE is a standalone Windows® application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (Gayatri Varma, gayatri@umd.edu).
Register and Download HCE 3.0 test version (released on March 29, 2004)
A new users manual will be up soon. Meanwhile, please refer to the previous manual.
System requirements
Intel® Pentium® processor
Microsoft® Windows 2000®, Windows XP
Last updated 12/14/2004