Tunable Viewtips:
User Controlled Specification of Interesting Data Patterns
|
Kartik Parija and Jaime Spacco
|
|
[Please Note: Larger and clearer versions of every picture can be viewed by clicking on it ]
Abstract
There are many tools that perform sophisticated
visualization and analysis of data. However, most of these
tools require some familiarity with the dataset in order to
identify interesting characteristics.
We propose Tunable Viewtips, a standalone tool that performs
a "first pass" on an unfamiliar data set. It provides rapid
dynamic profiling of multidimensional data utilizing
common statistical methods. In our first version, we concentrate on single dimensional data
which is often ignored or misrepresented.
We present Tunable Viewtips 1D, which focuses on the
visualization and analysis of one dimensional data in its
true form.
Introduction and Motivation
Spotfire (www.spotfire.com), a commercial information visualization package, provides a "view tip" tool [1] which returns an ordered ranking of potentially interesting correlations. The user can preview the scatter plots of pairs of axes in a small window, or look at histograms of various axes [See Figure 1 (a) and (b)]. This is an excellent beginning with much potential! But in it's current form, Spotfire's "view tip" tool is insufficient. It is completely static which results in the fact that the user cannot tune parameters of the ranking algorithm (Pearson's Product Moment Correlation metric), nor can the user select other algorithms to rank results. Currently there is no mechanism for a third-party developer to add new algorithms. Furthermore, little screen space is devoted to displaying these "view tips" and it is difficult to distinguish values. We believe that users would benefit from user controlled specification of what they consider interesting patterns.
The first issue we address is the limited functionality of the view tip feature. We remedy this by including new
algorithms, along with a plug-in architecture for adding additional algorithms; by providing dynamic sliders that
tune these algorithms; and by providing more feedback, such as displaying the mean and standard deviations.
We then focus on visualizing one dimensional data. To our knowledge, this area has been
ignored or (mis)-represented by using line or bar graphs [6]. We want to profile
individual columns of the dataset to find any interesting relationships or patterns that
lie within them.
Finally, we address the problem of occlusion for dense datasets. We've implemented jitter in one-dimension that
uses the entire screen space allotted for the plot. Since only the Y coordinate of the point encodes it's value,
jittering purely in the X direction does not significantly compromise the
visualization. A 'reflection' technique is used to show as many data
values in densely packed data.
Previous and Related work
Past research has primarily focused on visualizing textual data in the case of 1D. Examples of these include program listings, documents with many lines, and document search results. Gary Geisler [2] and the On-line Library of Information Visualization Environments (OLIVE) [5] provide an excellent overview of such research.
However, there are very few visualizations of individual columns of numerical data. We found two software packages that claim to visualize and analyze 1D data.
|
Figure 2: xPloRe: Teaching Quantlet tw1d, to visualize 1D data. [8] |
|
Implementation
Our inspiration was the Java Swing model, which separates the underlying data storage from its display on the screen. This allows several graphical components or different views within a component to display the same data without shuffling the data. We've tried to emulate this approach as much as possible for efficiency and simplicity. We've stuck as closely as possible to this model, but in the end we employ a (redundant) intermediate layer of storage for graphical objects that is recreated every time we render a graph. [See Figure 4]
Figure 4: Dataset Architecture
We've tried to separate the graphical display from the data storage model. Furthermore, since we will run a variety of algorithms on the dataset, we have tried to minimize the overhead of calculating new sets of results. Thus, each set of results stores references into the Dataset to minimize the need for copying data that's already stored someplace. To make the Tunable Viewtips tool easily extensible, we have a simple and reasonably effective object-oriented framework that treats one-dimensional and two-dimensional results (and more-dimensional results if we chose to add that functionality) virtually identically, allowing the GUI to process and visualize results appropriately. [See Figure 5]
Having comparatively little experience writing GUI
applications, we made a few rookie mistake which have made
our software more complicated than we intended. It's more
difficult than I had anticipated to display the data
directly out of our underlying Dataset structure due to the
lack of effective tools for displaying simple things like
scatter plots and bar charts. Ultimately, the need for
results outweighed any purist desires for a clean
implementation. We made a few compromises. Data is stored
into a redundant intermediate layer before being displayed
by our much-modified version of a third-party tool [17]. This
happens every time we render a new display, and is wildly
inefficient, but saved lots of time.
File formats: We read tab-delimited text files into a simple internal data structure, and then apply our algorithms to this simple data structure.
Implemented Algorithms: We currently have implemented five statistical algorithms [9, 10] to examine single dimensional data. A pseudo-code representation of each algorithm is given below each description.
foreach column c
deviation = (c.max - c.min) / c.size;
foreach value v in c
total_deviation += | v - previous(v) | - deviation
end foreach
return 1.0 - ( total_deviation / ( c.max - c.min ) )
end foreach
tightness is a percentage of the range (max - min) set by the cluster slider bar
foreach column c
cluster_root = c.first
range = ( c.max - c.min ) * tightness
foreach value v in c
if ( v <= cluster_root + tightness ) {
add v to the cluster
}
else {
start a new cluster
cluster_root = v
}
end foreach
end foreach
tightness is a percentage of the range (max - min) set by the slider bar
foreach column c
cluster 1 begins with the first element
range = ( c.max - c.min ) * tightness
foreach value v in c
if ( v <= previous(v) + tightness ) {
add v to the cluster
}
else {
start a new cluster at v
}
end foreach
end foreach
Easily extensible: New algorithms can easily be plugged into our underlying architecture, and can make use of the existing dynamic feedback features. The GUI need know nothing about how the results were computed. We have written the framework to compute two dimensional results and implemented the same Pearson's Product Moment Correlation metric that Spotfire uses. All we require is a two-dimensional scatterplot display tool. We have already begun work on this.
Visual Statistical Algorithmic Debugging: New statistical algorithms can be integrated at a later date, without the overhead of the Spotfire plug-in API or any related proprietary file formats. Furthermore, the results of these algorithms can be quickly visualized to determine whether they are along expected lines.
Dynamic Query Mechanisms: [11]
Color scheme: Color redundantly encodes the value. The minimum is always a very dark blue (close to black), and the maximum red. We interpolate by subtracting blue and adding green, until we hit pure green. Then we begin subtracting green and adding red. This yields a nice interpolated color encoding which clearly shows the minimum and maximum. Clusters can also be identified based on their color patterns, though it is important to note that the gradations of color are not always consistent. It is easier to use color to identify a cluster that is close to the mid-point of interpolation ( 0, 255, 0 ) than it is to spot one halfway between the mean and max ( about ( 125, 125, 0 ) since a light green is easier to distinguish than a color somewhere between red and green.
| Demonstration |
In this section we present application of our tool to various kinds of datasets. As far as possible, we have used 'real-world' data, which allows us to interpret the results in some meaningful manner.
Error Detection: The first example considers a dataset with just one column of data, namely the closing price of the the Dow Jones between years 1900 and 1901. This is part of a very large data set obtained from CMU's Statistics repository [3]. By simply plugging this dataset into the tool, it was immediately obvious that there were some errors in the dataset, as there were occurrences of negative closing prices which are naturally absurd. [See Figure 6 below]
Figure 6: Dow Jones Closing Price, 1900 - 1901 [3]
Advantage of the Jitter Feature: The above example uses the Jitter feature to show as much of the dataset as possible. This is extremely useful for visualizing datasets which have many repeat values or are very densely packed within certain ranges. We present an example to show the difference when Jitter is turned on and off when viewing a dataset consisting of 5 columns of data each containing the price of a particular stock recorded over the past 29 months. They are Intel (INTC), Cisco (CSCO), Microsoft (MSFT), Human Genome Sciences (HGSI) and General Electric (GE). The figure shows the column containing the Microsoft data [Data Courtesy: MSN Moneycentral, www.moneycentral.com]. This output also shows that the Outlier detection algorithm has been performed. [See Figure 7 (a) and (b)]
Figure 7 (a) and (b): Microsoft Stock Price over the last 29 months. With and Without Jitter
Simple Decision making: The figure below shows the grading sheet of CMSC 434 offered in Spring of 2001. The columns are grades assigned in various homework assignments and projects, in addition to a column showing total number of points and overall percentage grade. The current column being viewed shows the percentage grade. By showing the mean, standard deviation, the tool-tip feature and the standard deviation slider bar, we are able to able to get a fairly accurate view of the "Letter Grade Spread". For instance, with a grading scale in place (A ~ 89 and above, B ~ 79 and above) and taking into account class performance and average, we could assign 19 A's, 2 C's and award the rest of the class B's. [See Figure 8 (a) and (b)]
Finding Outliers: As most data follows patterns and relationships, it is always interesting to find outliers that deviate away from such patters and relations. We ran both of our Outlier detection algorithms on the Cereal data set obtained from CMU Statistics Library [3]. This data set describes seventy seven breakfast cereals by containing information described on the mandated FDA nutritional facts label. For example under the column [potass], there are 77 entries with the amount of potassium contained in each cereal. Figure 9 (a) and (b) show an important example where the results of the two algorithms vary.
Finding Clusters: It is often useful to find clusters in data. Either of our cluster algorithms can be run to identify clusters in 1D data. Figure 10 (a) shows the results of the first cluster finder algorithm being run on the same dataset containing the performance of 5 stocks, used in Figure 7. Figure 10 (b) shows the results of the second cluster finder being run on the Melanoma dataset described in Figure 1(b).
Weaknesses
As with many visualizations, our tool is not without problematic areas. Here we list some that we have identified:
|
|
Contributions
Our tool makes two contributions. First, we are expanding and improving Spotfire's 'view tips' visualization feature by incorporating dynamically tunable algorithms. Second, we are exploring the profiling of one dimensional data.
What new visualization features does our code add? Initially, we intended to write a tool that emulated Spotfires's 'view tips' tool, and displayed interesting two-dimensional plots based on different algorithms. However, much work has been done on two dimensional statistical analysis. Instead, we focus on one dimension at a time. This has several advantages: the algorithms are much faster (two dimensional comparisons require pair-wise enumeration over all columns, which grows at about (n2)/2), and there has been comparatively little work done in this area. However, because we're tried to separate the display from the data, the core functionality for two (or more) dimensional display already exists.
We are not aware of a tool that performs "jittering" in one dimension. This is especially effective because it solves the problem of occlusion without altering the sanctity of data points along their axis. The use of the second dimension is only a trick in "pixel space"; the data points still line up with their correct location along the Y axis.
We position our tool as a profiling tool used to glean basic statistical information from a new dataset. Tunable Viewtips is a new way of visualizing a dataset. However, we are also visualizing new aspects of a dataset, namely the individual columns.
Imagine that we have many results measured over time. We
would like to know which of these results show interesting
properties. We are not really interested in the
relationship between the results at different timesteps; we
just want to know which of the vectors of results show
interesting statistical properties (clusters, gaps,
outliers, etc.). Our tool can profile the dataset for such
information and show a ranked list of columns that could be
examined. The user can dynamically tune the parameters of the algorithms
and changes in the results due to these adjustments are instantaneously
displayed.
Possible Application Areas
There are a number of areas where one-dimensional data is very useful. Some of these include fluid dynamics [12], image analysis and enhancement [13], information retrieval [14], and motion in 1D such as uniform and non-uniform acceleration or retardation [15]. We expand briefly on a couple of these application areas.
An area where 1D data is used often is in image analysis and enhancement. The
images are examined as a matrix and each column of the matrix, corresponding to
a single column of pixels are analyzed individually to spot effects like edges
and repetitiveness. These columns are grouped as histograms and are run through
various mathematical functions. One way to enhance the
visualization of the histogram of images after the application of a
edge-detector operator is by using the logarithm of the histogram. Figures 11 (a)
and (b) show the application of such a technique.
![]() |
![]() |
![]() |
![]() |
|
Figure11 (a) and (b): Application of an edge-detector operator to enhance the image. |
|
Fluid Dynamics is field where computationally intensive algorithms are used to model complicated flows. While such modelling exercises usually concentrate on 2D and 3D flows, there are are instances of important problems where 1D flow needs to represented and visualized. One such flow occurs in avalanches where the total time involved is very small.
Both dense and powder snow can produce avalanches. The fluid dynamic calculations involved in simulating such activity involve the calculation of one dimensional flow. These flow calculations help predict the motion (velocities, dynamic pressures) of avalanches and visualization of such data is a key factor in the analysis of such simulations. [See Figure 12]
Figure
12:
Artificially triggered powder snow avalanche in the avalanche dynamics test
region of Vallée de la Sionne, Switzerland,
Picture Courtesy: Swiss Federal Institute for Snow and Avalanche Research Davos
[12].
In addition, we think our tool can be used to explore individual columns in datasets that have traditionally been part of multi-dimensional exploration. Colleagues have recommended that we could examine data showing characteristics of Amino Acids (possibly where reduction to 1D has been performed) and traditional temporal data where the behavior of data could be examined outside the consideration of time.
Future Work
There is much room for improvement in our tool. Currently,
version one supports the visualization and analysis of 1D data alone. We would like to reach the proposed goal of
having a tool that will support multi-dimensional data. Using another open source graphing tool [17], we have begun
initial work on adding 2D support to our existing tool [See Figures13 (a)
and (b)]. We have successfully
implemented the Pearson Correlation metric to rank pairs of columns in a dataset. This already replicates the functionality of
Spotfire's View Tip. Once we've coupled this with the Tunable Viewtips 1D features we have described, we have
greatly enhanced the View Tip mechanism.
There are two general directions that future work can take. First, this work could be integrated into an existing
visualizaton tool, such as Spotfire or Stardom [16]. This approach makes sense, as any interesting visualization mined
by our standalone tool would need to be imported into a more mature tool anyway for futher analysis. Second, we can add
more features to our current tool.
Regardless of which direction future development takes, we are scoping out some other improvements. First and
foremost, we want to test out new algorithms. We have algorithms for similarity and gaps in two dimensions that
we'd like to run once we find/write a decent 2D display tool. The cluster box mechanism in 1D will correctly draw
the boxes regardless of how they're computed. We'd like to add a non-greedy algorithm that find the maximum cluster
size in each dimension. Second, we want the ability to zoom in, especially for densely packed data sets. We'd like to
zoom into a part of a 1D or 2D plot and run algorithms on
that subset of our data set. Next, we want a dynamic filtration and selection mechanism where the user can
specify ranges with the mouse and filter the data. This would be most useful in 2D where any limit to the ranges,
gaps and clusters will help narrow down the search space for the algorithms. Finally, we need to fix some glaring
inefficiencies in the intermediate data storage format by
eliminating it. The display widget should not store any data, and any data that it requires it should read out of
the dataset.
| Figure 13 (a): Plotting the stock prices of Intel Vs. Cisco since Jan '99 | Figure 13 (b): 2D Plot of UACC383 Vs. KA in the Melanoma dataset |
Acknowledgements
Our sincere gratitude to Larry Leonard [17] of Definitive Solutions, Inc for allowing us to use his Microsoft VC++ based 2D Graphing Class. As novices in this development platform, it provided a great starting point to develop what we believe is a useful tool. We would like to thank Narendar Shankar for his assistance in the GUI development, Dave Hovemeyer for his help in porting our Unix code to the Windows platform, Brian Postow for his suggestions to improve our Jitter feature, Rezarta Islamaj and Omer Horvitz for sitting through multiple demonstrations, and Jinwook Seo and Bongshin Lee for inspiring us to use the Melanoma dataset . We also greatly appreciate Dr. Ben Shneiderman and Dr. Catherine Plaisant's guidance through the various stages of our project.
References