CMSC 838b: Information Visualization

Zhijian Pan

Application Project: Visualized Pattern Matching of Malignant Melanoma

With Spotfire and Table Lens

Feb. 25, 2001


Motivation and Background

Recent genetics research has suggested that, while large amount of gene expression data could be obtained in the lab, visualizing and discovering the overall gene expression patterns from the data has become an extremely important and challenging task in identifying the previously unrecognizable human cancer taxonomy. This project aims to explore and evaluate the application of Spotfire and TableLens in this task.


The Data set

The data set is provided by NHGRI (National Human Genome Research Institute) (Excel file), which includes 3615 gene probes, with each gene probe being measured with 38 experiment conditions, including 31 melanomas (malignant sample) and 7 controls (normal sample). Each gene also has a CloneID and a title (name).


Major discoveries about the data

At present, there are three groups of genes: those of which both the sequences and functions are recognized; those of which the sequences but not the functions are recognized; and those neither the sequences nor the functions are recognized. The first group genes have specific descriptive titles. Second group genes are temporarily titled as Expressed Sequence Tags (ESTs).

The Spotfire scatter plot shown in Fig1 gives a visualization that about 1/4 of the 3615 genes are currently still ESTs.

Fig 2 is using Table Lens' Spotlight feature to visualize the same data, and it gives identical conclusion. Fig 3 was intended to visualize how the ESTs get distributed among the known genes. Surprisingly, Table Lens gives an wrong illusion that majority of the genes are still ESTs ( which, I think, is one of the limitations of Table Lens).

The parallel coordinates in Spotfire 5.0 is visualizing a gene expression pattern as in Fig 4. Each experiment condition is considered as a coordinate. The visualized pattern is a solid line connecting all coordinates.

TableLens visualizes the expression patterns as a sequence of bars of different lengths as shown by Fig 5.

TableLens is powerful visualizing the pattern of columns too. From Fig 6 it is very easy to see that samples uacc93-047 and uacc930 have very even and small ratios for most of genes, while the other columns on the right have obviously larger and uneven ratios.

After talking to one of the researchers at NHGRI, I learnt that genes of the same functionality tend to have similar expression patterns. This implies that if we could find a gene in group 1 and it has a matching expression pattern with the EST in question, it might have a good chance that the EST may have the same functionality as the former. This provides a way to help researchers to speed up fully identifying ESTs. So, the big question becomes: given a expression pattern of an EST, how do we find a gene in group 1, which matches the EST pattern reasonably well?

With Spotfire, I tried both scatter plot and parallel coordinates techniques. In theory, on the scatter plot display, if I first pick up one EST, and then adjust the scroll bars to narrow down each attribute to be around the value of the EST, the remaining genes shall be the ones matching the EST. In practice, however, scatter plots could not manage problems of this complexity. Every time I tried, it ended up that, long before I adjusted half of the attributes, the remaining genes have dropped to zero.

The problem is, first, our data set has as many as 38 attributes. Second, I don't have any clue how much tolerance I shall allow for the attributes I begin with. Third, whatever order I adjust the attributes is actually setting up an implicit priority to be used finding the matching pattern.

I then decided to switch to parallel coordinates, which is available in Spotfir 5.0. Parallel coordinates allows displaying all 38 attributes on one page. Searching for matching gene patterns is converted to finding matching line patterns. Fig 7 gives an example displaying a group 1 gene is matching reasonably well with the EST pattern, shown in Fig 4 above. Fig 8 is a just zoomed display of Fig 7. Fig 9 is the scatter plot which encloses the same matching genes.

Fig 10 shows an example where TableLens illustrates two genes also have very similar expression patterns, but I found it is very difficult using TableLens to find such matching patterns.



Critique and suggestions for improvement

In this application project, I explored four different data sets (the car, the global climate data , the world education system, and the Malignant Melanoma) and three different visualization tools: the Spotfire 4.0, the TableLens, and the Spotfire 5.0. I found each tool is doing reasonably well in visualizing outlines and individual patterns. However, when it goes to visualizing and discovering matching patterns among objects having as many as 38 attributes, I found Spotfire 4.0 least useful, TableLens limited useful, and Spotfire 5.0 (parallel coordinates) most useful. Each tool handled the car data set very well. Each tool still has responseness issue when feed with extremely large data set, such as the global climate data. In TableLens, when it is holding as many as 3615 rows of data on one page, I found it was very difficult to zoom and locate any specific data, since a very minor mouse movement is equivalent to traversing hundreds of rows.

It was unexpected that TableLens took much longer time importing the excel formated melanoma data than Spotfire. To make it worse, TableLens failed to display any loading status, and caused me keeping thinking it had crashed and rebooting the machine several times, until I eventually realized it was just the way TableLens works and the amount of time it needs importing the data.