|Human-Computer Interaction Lab / University of Maryland|
Multidimensional data sets are common in many research areas, including microarray experiment data sets. Genome researchers are using cluster analysis to find meaningful groups in microarray data. However, the high dimensionality of the data sets hinders users from finding interesting patterns, clusters, and outliers. Determining the biological significance of such features remains problematic due to the difficulties of integrating biological knowledge. In addition, it is not efficient to perform a cluster analysis over the whole data set in cases where researchers know the approximate temporal pattern of the gene expression that they are seeking. To address these problems, we developed the Hierarchical Clustering Explorer 2.0 by adding three new features to HCE:
scatterplot ordering methods so that all 2D projections of a high dimensional data set can be ordered according to relevant criteria.
a gene ontology browser, coupled with clustering results so that known gene functions within a cluster can be easily studied.
a profile search so that genes with a certain temporal pattern can be easily identified.
If you have any comment or question, send an email to Jinwook Seo (firstname.lastname@example.org).
Current version of HCE is downloadable from this page. [download]
The large number of possible scatterplots for a high dimensional data set can present a problem, so users need efficient mechanisms to investigate the possible scatterplots. HCE 2.0 provides users with five meaningful criteria to order 2D projections. The first three criteria are useful to reveal statistical relationships between two experimental conditions (or samples), and the next two are useful to find projections of interesting distributions:
Pearson's r orders scatterplots according to the Pearson's correlation coefficient (from +1.0 to ?.0) so that users can easily find the most/least correlated ones.
Least square error (simple linear regression) sorts scatterplots in terms of sum of square errors from the optimal line fit so that users can easily isolate ones where all points are closely/loosely arranged along a straight line.
Least square error (curvilinear regression) sorts scatterplots in terms of sum of square errors from the optimal quadratic curve fit so that users can easily isolate ones where all points are closely/loosely arranged along a quadratic curve
# of items in the region of interest lists scatterplots in order of number of items within a user-defined rectangular, elliptical, or free-formed region of interest so that users can easily find ones with most/least genes in the given region.
Uniformness orders scatterplots according to the significance level of two-dimensional Kolmogorov-Smirnov test (Chakravarti et al., 1967) between a uniform distribution and a scatterplot so that users can easily find the most/least uniform scatterplot.
Data displayed is from a cDNA microarray experiment data set (31 melanoma + 7 controls) by Bittner
Gene Ontology Browser
HCE2.0 combines GO annotation data with clustering results of microarray experiment data sets to present the biological significance of the results in a unified and structured manner. Since most microarray experiment stations don't produce GO annotation in the output by default, scripts or relational database queries are necessary to add GO annotations to the microarray experiment data. We join biological databases to get gene ontology identifiers of genes. For example, we used UniGene and LocusLink to add GO annotation to the melanoma microarray data set (Bittner et al., 2000). Genes can be compared in terms of up-to-date GO annotations available at the Gene Ontology consortium website.
Data displayed is from a cDNA microarray experiment data set (31 melanoma + 7 controls) by Bittner et al., 2000.
Many microarray experiments measure gene expression over time. Researchers would like to group genes with similar expression profiles or find interesting time-varying patterns in the data set. Often times, they roughly know the time varying patterns that they want to find. For example, they might be interested in the genes that are up-regulated in a certain time and down-regulated in remaining periods. In such cases, researchers might benefit from a query environment where they can easily specify queries, instantly see the result of the queries, and easily modify their queries.
HCE 2.0 provides the Profile Search that allows for rapid creation and modification of desired profiles. Key design concepts are
interactive specification of a search pattern on the Information Space : Users can submit their queries simply by mouse drags over the Information Space, rather than using a separate query specification window.
dynamic query control : Users get the query results instantaneously as they change the search pattern, similarity function, or similarity threshold.
sequential query refinement : Users can keep the current query results as a new narrowed information space for subsequent queries. This enables users to refine their query results, which follows the process of general problem solving.
The data set shown is a temporal gene expression profile on the mouse muscle regeneration (Zhao et al., 2002).
HCE is a standalone Windows® application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (Gayatri Varma, email@example.com).
Register and Download HCE version 2.0 beta now!
User's Guide for HCE version 2.0 beta
Check whether there is a newer version (go to the Download section at the main project page).
Intel® Pentium® processor
Microsoft® Windows 2000®, Windows XP
Last updated 11/19/2004