| Human-Computer Interaction Lab / University of Maryland | ||||||||
|
![]() |
| About This Project | HCE3.0 (HCE2W) | HCE 2.0 | HCE 1.0 | Download |
The most commonly utilized microarrays for mRNA profiling (Affymetrix) include probe sets of a series of perfect match and mismatch probes (typically 22 oligonucleotides per probe set). There are an increasing number of reported probe set algorithms that differ in their interpretation of a probe set to derive a single normalized "signal" - representative of expression of each mRNA. These algorithms are known to differ in accuracy and sensitivity, and optimization has been done using a small set of standardized control microarray data.
We hypothesized that different mRNA profiling projects have varying sources and degrees of confounding noise, and that these should alter choice of a specific probe set algorithm. Also, we hypothesized that use of the Microarray Suite (MAS) 5.0 probe set detection p value as a weighting function would improve the performance of all probe set algorithms.
![]() |
| Permutation Study Framework using Unsupervised
Clustering in HCE2W
(the improved version of the Hierarchical Clustering Explorer 2.0 with
p-value weighting and F-measure). Inputs to the Hierarchical Clustering
Explorer are two files, signal data file and p-value file. Each column of
the two input files has values for a sample (or a chip), and the known
target biological group index is assigned to each column of the signal data
file. Success is measured using F-measure of a dendrogram and the known
biological grouping.
Note : HCE 3.0 test version is a newer version of HCE2W, which has all functions in HCE2W. |
How to prepare input files |
You have to prepare two files, probe set signal file and probe set detection p-value file, for each probe set signal algorithm (e.g., MAS5, dChip, or RMA). As you can see in the figure, you can use the probe set detection p-value file from MAS5 for all other signal files generated by probe set signal algorithms other than MAS5.
The two files should be in the same folder. The extension of the detection p-value file should be pvl. Please refer to the following example.
1. Using Excel files
If the signal file name is mah-mas5.xls, the detection p-value file name should be mah-mas5.pvl.xls.
2. Using tab delimited text files
If the signal file name is mah-mas5.exp, the detection p-value file name should be mah-mas5.pvl.
Example : Please take a close look at this small example input files (mah-mas5-small.exp and mas-mas5-small.pvl) in mah-mas5-small.zip. There are 4808 probe sets and 40 chips. It was filtered from the PGA Murine Airway Hyperresponsiveness project using a very stringent present call filter.
Please note that the order of rows and columns is the same as in the signal file.

To use continuous MAS 5 probe set detection p-value as a noise filter |
Affymetrix noise calculations give us two outputs; one is the continuous detection p value assignment, and the other is a simple detection call (present/absent). Each signal intensity value has a confidence factor, detection p-value, which contributes to determining the detection call for the corresponding probe set. When the probe set detection p-value reaches a certain level of significance, then the probe set is assigned a "present" call, while all those probe sets with less robust signal/noise ratios are assigned an absent call.( follow this link at Affymetrix.com (login required) for more detail). This enables the use of a present call threshold noise filter. We reported that a 10% present call noise filter did improve the performance of probe set signal algorithms. While such present call-based filtering improves performance, it is clearly an arbitrary threshold method, and thus it is highly possible that potentially important signals that might be conveyed by the probe sets are filtered out.
There are many possible similarity measures for unsupervised clustering methods,
and it is also possible to develop weighted versions of most similarity
measures. For example, we can derive a weighted Pearson correlation
coefficient as follows from the Pearson correlation coefficient
that has been widely used in the microarray analysis. Let
and
be the vectors representing two arrays to be compared (thses
values are prepared in the .exp or .xls files) , and let
and
be the vectors representing continuous probe set detection
p-values for
and
respectively. (These p-values are prepared in the .pvl
or .pvl.xls files) Then the weighted Pearson correlation coefficient is
given by
, where
,
,
To use F-measure for evaluating unsupervised hierarchical clustering results |
We applied F-measure to the entire hierarchical structure of clustering
results and also to the set of clusters determined by the minimum similarity
threshold in HCE2W. Let
,.. ,
,..,
be the right clusters according to the target biological variable. Let
, ..,
, ..,
be the clusters from the hierarchical clustering results. In F-measure,
each cluster is considered a query and each class (or each correct cluster) is
considered the correct answer of the query. The F-measure of a correct
cluster (or a class)
and
an actual cluster
is defined as follows:
, where
,
.
The precision values
and recall values
are
defined by the information retrieval concepts. The F-measure of a class
is
given by
.
Finally, the F-measure of the entire clustering result is given by
, where
is the total number of arrays in the experiment.
In the final clustering result visualization, each sample name is color-coded by its biological class as shown in the figure at the top. Overall F-measure is highlighted with a pink oval. The F-measure distribution is shown, as the distance from the left side, over the dendrogram display as indicated by an arrow mark.
A Permutation Study Result ( 2 large novel microarray data, with/without detection p-value weighting, 5 probe set signal algorithm) |
We used HCE 3.0 (HCE2W) to test and define parameters in Affymetrix analyses that optimize the ratio of signal (desired biological variable) versus noise (confounding uncontrolled variables). Five probe set algorithms were studied with and without statistical weighting of probe sets using the Microarray Suite (MAS) 5.0 probe set detection p values. The signal/noise optimization method was tested in two large novel microarray datasets with different levels of confounding noise; a 105 sample U133A human muscle biopsy data set (11 groups; mutation-defined; extensive noise), and a 40 sample U74A inbred mouse lung data set (8 groups; little noise). Performance was measured by the ability of the specific probe set algorithm, with and without detection p value weighting, to cluster samples into the appropriate biological groups (unsupervised agglomerative clustering with F-measure values).
Probe set detection p-value weighting had the greatest positive effect on performance of dChip difference model, ProbeProfiler, and RMA algorithms. Importantly, probe set algorithms did indeed perform differently depending on the specific project, likely due to degree of confounding noise. Our data indicates that significantly improved data analysis of mRNA profile projects can be achieved by optimizing the choice of probe set algorithm with the noise levels intrinsic to a project.
The following graph shows the external evaluation results using F-measure of unsupervised clustering for the human muscular dystrophy data and the mouse lung biopsy data. "no-wt" bar represents the result without MAS 5.0 detection p-value weighting, and "wt" bar represents the result with p-value weighting.

Papers |
For more information, please refer to the following papers.
Jinwook Seo, Marina Bakay, Yi-Wen Chen, Sara Hilmer, Ben Shneiderman, Eric P Hoffman, " Interactively optimizing signal-to-noise ratios in expression profiling: project-specific algorithm selection and detection p-value weighting in Affymetrix microarrays," Bioinformatics, Vol. 20, pp. 2534-2544, 2004.
Jinwook Seo, Ben Shneiderman, Interactive Exploration of Multidimensional Microarray Data: Scatterplot Ordering, Gene Ontology Browser, and Profile Search, HCIL-2003-25, CS-TR-4486, UMIACS-TR-2003-55.
Download |
HCE is a standalone Windows® application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (Gayatri Varma, gayatri@umd.edu).
Register and Download HCE 3.0 test version (released on March 29, 2004)
A new users manual will be up soon. Meanwhile, please refer to the previous manual.
System requirements
Intel® Pentium® processor
Microsoft® Windows 2000®, Windows XP
Last updated 12/14/2004