Tunable Viewtips 1D - K.Parija & J.Spacco

Tunable Viewtips

Kartik Parija and Jaime Spacco

CMSC 838B - Information Visualization
University of Maryland, College Park, MD

Abstract

There are many tools that perform sophisticated visualization and analysis of data. However, most of these tools require some familiarity with the dataset in order to identify interesting characteristics. We propose Tunable Viewtips, a standalone tool that performs a "first pass" on an unfamiliar data set. It provides rapid dynamic profiling of multidimensional data utilizing common statistical methods. In our first version, we concentrate on single dimensional data which is often ignored or misrepresented. We present Tunable Viewtips 1D, which focuses on the visualization and analysis of one dimensional data in its true form.

Introduction and Motivation

Spotfire provides a "view tip" tool [1] which returns an ordered ranking of potentially interesting correlations. The user can preview the scatter plots of pairs of axes in a small window, or look at histograms of various axes [See Figure 1]. This is an excellent beginning with much potential! But in it's current form, Spotfire's "view tip" tool is insufficient. First, it is static! The user cannot tune parameters to the algorithms, nor can the user select other algorithms. There is no mechanism for a third-party developer to add new algorithms. Finally, too little screen space is devoted to displaying these "view tips". It's difficult to distinguish values. We believe that this feature needs an overhaul.

Figure 1: Spotfire View Tip showing Melanoma data [1]

The first issue we address is the limited functionality of the view tip feature. We remedy this by including new algorithms, along with a plug-in architecture for adding additional algorithms; by providing dynamic sliders that tune these algorithms; and by providing more feedback, such as displaying the mean and standard deviations.

We then focus on is visualizing true one dimensional data. To our knowledge, this area has been ignored or (mis)-represented by using line or bar graphs [6]. We want to profile individual columns of the dataset to find any interesting relationships or patterns that lie within them.

Finally, we address the problem of occlusion for dense datasets. We've implemented jitter in one-dimension that uses the entire screen space allotted for the plot. Since only the Y coordinate of the point encodes it's value, jittering purely in the X direction does not significantly compromise the visualization. A 'reflection' technique is used to show as many data values in densely packed data.

Previous and Related work

Past research has primarily focused on visualizing textual data in the case of 1D. Examples of these include program listings, documents with many lines, and document search results. Gary Geisler [2] and the On-line Library of Information Visualization Environments (OLIVE) [5] provide an excellent overview of such research. Figure 2 shows SeeSoft, a package from Visual Insights, a Lucent Venture [4], that visualizes lines of code in large software engineering endeavors. Using color and scaling, it provides a remarkably effective means of understanding the various components and inter-relationships between modules.

Figure 2: SeeSoft from Lucent Technolgies, a system to visualize code in software [4]

However, there are very few visualizations of individual columns of numerical data. We found two software packages that concentrate on the visualization and analysis of 1D data.

xgraph: This tool is billed as a "freely available, lightweight and easy to use visualization client for viewing 1D data files" [7]. Figure 3 shows a screenshot of xgraph. It claims the use of a line plot to show 1D data, which is actually a technique to show 2D data.

xgraph can be used to view 1D data files with the format"Time=0.0 0.0 0.0 0.2 0.04 0.4 0.16 0.6 0.36 0.8 0.64 1.0 1.0 "Time=1.0 0.0 0.0 0.2 -0.04 0.4 -0.16 0.6 -0.36 0.8 -0.64 1.0 -1.0

Figure 3: Xgraph [7]

xPloRe: This package contains a "teachware quantlet" called tw1d [8] to visualize 1D data. While containing useful statistical features [See Figure 4], the tool uses bar graphs to display 1D data. This again is a 2D visualization technique because the x-axis displays the order of the data which in most cases is time.

Figure 4: xPloRe: Teaching Quantlet tw1d, to visualize 1D data. [8]

Implementation

Architecture:
Our inspiration was the Java Swing model, which separates the underlying data storage from its display on the screen. This allows several graphical components or different views within a component to display the same data without shuffling the data. We've tried to emulate this approach as much as possible for efficiency and simplicity. We've stuck as closely as possible to this model, but in the end we employ a (redundant) intermediate layer of storage for graphical objects that is recreated every time we render a graph. [See Figure 5]

Figure 5: Dataset Architecture
We've tried to separate the graphical display from the data storage model. Furthermore, since we will run a variety of algorithms on the dataset, we have tried to minimize the overhead of calculating new sets of results. Thus, each set of results stores references into the Dataset to minimize the need for copying data that's already stored someplace. To make the Tunable Viewtips tool easily extensible, we have a simple and reasonably effective object-oriented framework that treats one-dimensional and two-dimensional results (and more-dimensional results if we chose to add that functionality) virtually identically, allowing the GUI to process and visualize results appropriately. [See Figure 6]

Figure 6: GUI Architecture

Having comparatively little experience writing GUI applications, we made a few rookie mistake which have made our software more complicated than we intended. It's more difficult than I had anticipated to display the data directly out of our underlying Dataset structure due to the lack of effective tools for displaying simple things like scatter plots and bar charts. Ulimately, the need for results outweighed any purist desires for a clean implementation. We made a few compromises. Data is stored into a redundant intermediate layer before being displayed by our much-modified version of a third-party tool [17]. This happens every time we render a new display, and is wildly inefficient, but saved lots of time.
Technical Features
File formats: We read tab-delimited text files into a simple internal data structure, and then apply our algorithms to this simple data structure.
Implemented Algorithms: We currently have implemented five statistical algorithms [9, 10] to examine single dimensional data:
- Number of Outliers: This calculates the number of outliers in a single column of data based on whether a particular element is within the 'span' of a multiple [x] of the standard deviation. The factor [x] can be specified by the user. In most cases, a dataset is composed of multi-column data and this algorithm ranks columns of such datasets in descending order of number of outliers found. Changes made in the factor [x] are immediately reflected on the graphing window as are recalculations in the number of outliers found and associated ranking of columns.
- "Outlyingness": Similar to the above algorithm, this enables the user to rank columns of data not by the total number of outliers found, but by how 'far' an outlier may lie. A column of data may not have the most number of outliers, but contain one outlier that is significantly further from the rest of the data points. A point based system based on how far each point is from the standard deviation is used to rank such columns higher than others.
- Uniformity: This algorithm measure how uniformly elements of a column of data are spread out between the minimum and maximum data values. Variance between successive points is measured to arrive at a uniformity metric. The higher this number, the more uniform the data and this fact is used to rank columns of data within a dataset.
- Cluster Finder: Going one step further, we attempt to find clusters within the data. A very simple clustering algorithm measures if data points are within a certain percentage of the range (max - min) of the data once an initial value in the cluster is fixed. This percentage is user defined. Currently, every point can belong to only one cluster, but it is not hard to extend this so that points may belong to different clusters. Bounding boxes are used to indicate clusters. Again, columns of data are ranked by the number of clusters found.
- Cluster Finder II: A slight variation of the above algorithm measures if successive points are within a certain percentage of the range (max - min) of the data. This percentage is again user defined and columns of data are ranked by the number of clusters found.
Easily extensible: New algorithms can easily be plugged into our underlying architecture, and can make use of the existing dynamic feedback features. The GUI need know nothing about how the results were computed. We have written the framework to compute two dimensional results and implemented the same Pearson's Product Moment Correlation metric that Spotfire uses. All we require is a two-dimensional scatterplot display tool. We have already begun work on this.
Visual Statistical Algorithmic Debugging: New statistical algorithms can be integrated at a later date, without the overhead of the Spotfire plug-in API or any related proprietary file formats. Furthermore, the results of these algorithms can be quickly visualized to determine whether they are along expected lines.
Dynamic Query Mechanisms: [11]
- Standard deviation slider: Allows dynamic manipulation of the number of standard deviations used to define an outlier for the two outlier algorithms. It would be very easy for any new outlier algorithms to use this feedback mechanism. We have chosen to limit the maximum number of standard deviations to 3 since statisticians consider an outlier of more than 3 standard deviations to be a "hard" outlier [10], and datasets rarely have outliers of more than 3 standard deviations.
- Jitter slider: Controls the "jitterness" of the data. One-dimensional jitter is a surprisingly effective technique. Since the location of points along the Y axis encodes their value, jittering along the X axis reduces occlusion without sacrificing data. [See Figure 8 (a) and (b) ] As we are bound by the physical space of a graphing window and it is not uncommon for one dimensional data to be densely packed, we use a 'reflection' technique to eliminate as much occlusion as possible. [See Figure 7]
- Cluster slider: Controls the "tightness" of the clustering algorithm(s). We define tightness as the percentage of the range (max data value - min data value) that determines the size of a cluster. The mechanism for calculating bounding boxes could easily be used by other clustering algorithms.
Color scheme: Color redundantly encodes the value. The minimum is always a very dark blue (close to black), and the maximum red. We interpolate by subtracting blue and adding green, until we hit pure green. Then we begin subtracting green and adding red. This yields a nice interpolated color encoding which clearly shows the minimum and maximum. Clusters can also be identified based on their color patterns, though it is important to note that the gradations of color are not always consistent. It is easier to use color to identify a cluster that is close to the mid-point of interpolation ( 0, 255, 0 ) than it is to spot one halfway between the mean and max ( about ( 125, 125, 0 ) since a light green is easier to distinguish than a color somewhere between red and green.

Demonstration

Click icon to download a demo and sample data files

In this section we present application of our tool to various kinds of datasets. As far as possible, we have used 'real-world' data, which allows us to interpret the results in some meaningful manner.

Error Detection: The first example considers a dataset with just one column of data, namely the closing price of the the Dow Jones between years 1900 and 1901. This is part of a very large data set obtained from CMU's Statistics repository [3]. By simply plugging this dataset into the tool, it was immediately obvious that there were some errors in the dataset, as there were occurrences of negative closing prices which are naturally absurd. [See Figure 7]

Figure 7: Dow Jones Closing Price, 1900 - 1901 [3]

Advantage of the Jitter Feature: The above example uses the Jitter feature to show as much of the dataset as possible. This is extremely useful for visualizing datasets which have many repeat values or are very densely packed within certain ranges. We present an example to show the difference when Jitter is turned on and off when viewing a dataset consisting of 5 columns of data each containing the price of a particular stock recorded over the past 29 months. They are Intel (INTC), Cisco (CSCO), Microsoft (MSFT), Human Genome Sciences (HGSI) and General Electric (GE). The figure shows the column containing the Microsoft data [Data Courtesy: MSN Moneycentral, www.moneycentral.com]. This output also shows that the Outlier detection algorithm has been performed. [See Figure 8 (a) and (b)]

Figure 8 (a) and (b): Microsoft Stock Price over the last 29 months. With and Without Jitter

Use of Outlier and Standard Deviation Slider Bar: The figure below shows the grading sheet of CMSC 434 offered in Spring of 2001. The columns are grades assigned in various homework assignments and projects, in addition to a column showing total number of points and overall percentage grade. The current column being viewed shows the percentage grade. Using the outlier detection algorithm and the standard deviation slider bar, we are able to able to get a fairly accurate view of the "Letter Grade Spread". For instance, with a normal grading scale in place (A ~ 90 and above, B ~ 80 and above) and taking into account class performance and average, we could assign about 14 - 15 A's, 4 C's and award the rest of the class B's. [See Figure 9]

Figure 9: Overall Percentage Grade of CMSC 434, Spring 2001

Weaknesses

As with many visualizations, our tool is not without problematic areas. Here we list some that we have identified:

There exists a tool-tip feature that allows the value of the data point to be shown when the mouse is hovered over the particular point. In densely packed data sets, this might not be the best way to view individual points.

Bounding Box Overlap: We use bounding boxes to indicate the presence of clusters. Our cluster algorithms currently clearly state that a data point can belong to just one cluster. Figure 10 shows the cluster algorithm being run on a dataset showing grades of a particular CMSC course. The column being examined is the overall grade percentages. The impression being given here is that some points belong to 2 clusters. This is not true and is caused by the fact that the padding for each of the bounding boxes happen to overlap data points that do not belong to the particular cluster. This could often be a problem if we are examining densely packed column data. However, we are limited by pixel size within the physical space

Figure 10: Example of a bounding box problem

Currently there exists lack of common visualization techniques such zooming, panning, selection and filtering. However, these features are being targeted as immediate future work. Inclusion of these capabilities will greatly enhance the functionality of Tunable Viewtips.

Contributions

Our tool makes two contributions. First, we are expanding and improving Spotfire's 'view tips' visualization feature by incorporating dynamically tunable algorithms. Second, we are exploring the profiling of true one dimensional data.

What new visualization features does our code add? Initially, we intended to write a tool that emulated Spotfires's 'view tips' tool, and displayed interesting two-dimensional plots based on different algorithms. However, much work has been done on two dimensional statistical analysis. Instead, we focus on one dimension at a time. This has several advantages: the algorithms are much faster (two dimensional comparisons require pair-wise enumeration over all columns, which grows at about (n²)/2), and there has been comparatively little work done in this area. However, because we're tried to separate the display from the data, the core functionality for two (or more) dimensional display already exists.

We are not aware of a tool that performs "jittering" in one dimension. This is especially effective because it solves the problem of occlusion without altering the sanctity of data points along their axis. The use of the second dimension is only a trick in "pixel space"; the data points still line up with their correct location along the Y axis.

We position our tool as a profiling tool used to glean basic statistical information from a new dataset. Tunable Viewtips is a new way of visualizing a dataset. However, we are also visualizing new aspects of a dataset, namely the individual columns.

Imagine that we have many results from time in [0..N]. We would like to know which of these results show interesting properties. We are not really interested in the relationship between the results at different timesteps; we just want to know which of the vectors of results show interesting statistical properties (clusters, gaps, outliers, etc.). Our tool can profile the dataset for such information and show a ranked list of which columns could be examined.

Possible Application Areas

There are a number of areas where one-dimensional data is very useful. Some of these include fluid dynamics [12], image analysis and enhancement [13], information retrieval [14], and motion in 1D such as uniform and non-uniform acceleration or retardation [15]. We expand briefly on a couple of these application areas.

Figure 11: Artificially triggered powder snow avalanche in the avalanche dynamics test region of Vall�e de la Sionne, Switzerland,
Picture Courtesy: Swiss Federal Institute for Snow and Avalanche Research Davos [12].

Both dense and powder snow can produce avalanches. The fluid dynamic calculations involved in simulating such activity involve the calculation of one dimensional flow. These flow calculations help predict the motion (velocities, dynamic pressures) of avalanches and visualization of such data is a key factor in the analysis of such simulations. [See Figure 11]

Another area where 1D data is used often is in image analysis and enhancement. The images are examined as a matrix and each column of the matrix, corresponding to a single column of pixels are analyzed individually to spot effects like edges and repetitiveness. These columns are grouped as histograms and are run through various mathematical functions. One way to enhance the visualization of the histogram of images after the application of a edge-detector operator is by using the logarithm of the histogram. Figures 12 (a) and (b) show the application of such a technique.

Figure12 (a) and (b): Application of an edge-detector operator to enhance the image.

In addition, we think our too can be used to explore individual columns in datasets that have traditionally been part of multi-dimensional exploration. Colleagues have recommended that we could examine data showing characteristics of Amino Acids (possibly where reduction to 1D has been performed) and traditional temporal data where the behavior of data could be examined outside the consideration of time.

Future Work

There is much room for improvement in our tool. Currently, version one supports the visualization and analysis of 1D data alone. We would like to reach the proposed goal of having a tool that will support multi-dimensional data. Using another open source graphing tool [17], we have begun initial work on adding 2D support to our existing tool [See Figures13 (a) and (b)]. We have successfully implemented the Pearson Correlation metric to rank pairs of columns in a dataset. This already replicates the functionality of Spotfire's View Tip. Once we've coupled this with the Tunable Viewtips 1D features we have described, we have greatly enhanced the View Tip mechanism.

There are two general directions that future work can take. First, this work could be integrated into an existing visualizaton tool, such as Spotfire or Stardom [16]. This approach makes sense, as any interesting visualization mined by our standalone tool would need to be imported into a more mature tool anyway for futher analysis. Second, we can add more features to our current tool.

Regardless of which direction future development takes, we are scoping out some other improvements. First and foremost, we want to test out new algorithms. We have algorithms for similarity and gaps in two dimensions that we'd like to run once we find/write a decent 2D display tool. The cluster box mechanism in 1D will correctly draw the boxes regardless of how they're computed. We'd like to add a non-greedy algorithm that find the maximum cluster size in each dimension. Second, we want the ability to zoom in, especially for densely packed data sets. We'd like to zoom into a part of a 1D or 2D plot and run algorithms on
that subset of our data set. Next, we want a dynamic filtration and selection mechanism where the user can specify ranges with the mouse and filter the data. This would be most useful in 2D where any limit to the ranges, gaps and clusters will help narrow down the search space for the algorithms. Finally, we need to fix some glaring inefficiencies in the intermediate data storage format by
eliminating it. The display widget should not store any data, and any data that it requires it should read out of the dataset.

Figure 13 (a): Plotting the stock prices of Intel Vs. Cisco since Jan '99 Figure 13 (b): 2D Plot of UACC383 Vs. KA in the Melanoma dataset

Acknowledgements

Our sincere gratitude to Larry Leonard [17] of Definitive Solutions, Inc for allowing us to use his Microsoft VC++ based 2D Graphing Class. As novices in this development platform, it provided a great starting point to develop what we believe is a useful tool. We would like to thank Narendar Shankar for his assistance in the GUI development, Dave Hovemeyer for his help in porting our Unix code to the Windows platform, Brian Postow for his suggestions to improve our Jitter feature, Rezarta Islamaj and Omer Horvitz for sitting through multiple demonstrations, and Jinwook Seo and Bongshin Lee for inspiring us to use the Melanoma dataset . We also greatly appreciate Dr. Ben Shneiderman and Dr. Catherine Plaisant's guidance through the various stages of our project.

References

Spotfire, "Help on the Viewtip Feature", pp. 131 - 134, Spotfire Manual, www.spotfire.com
Geisler, G., "Making Information More Accessible: A Suvery of Information Visualization Applications and Techniques", http://www.ils.unc.edu/~geisg/info/infovis/paper.html
CMU's statLib repository, http://www.stat.cmu.edu/datasets
SeeSoft, Software visualization tool, Lucent Technologies, Visual Insights, http://www.visualinsights.com/
Olive: Multidimensional Data - 1D "http://otal.umd.edu/Olive/1D.html", University of Maryland
Fortner, B., "The Data Handbook: A Guide to Understand the Organization and Visualization of Technical Data", pp. 91-102, Spyglass, 1992.
XGraph: Animated, Easy Client for 1D Line Plots, http://www.cactuscode.org/VizTools/xgraph.html
XPloRe, Teachware quantlet "tw1d" , http://www.quantlet.de/scripts/xlg/html/xlghtmlnode22.html
Stockburger, D.W., "Introductory Statistics, Concepts, Models and Applications", http://www.psychstat.smsu.edu/introbook/sbk00.htm, Southwest Missouri State University
Neter, et al., "Applied Linear Statistical Models", IRWIN, 4th Edition, 1996
Shneiderman, B., "Dynamic Queries for Visual Information Seeking", IEEE Software, 11(6), 70-77
AVAL-1D, Numerical Calculation of One Dimensional Flow in Avalanches, http://www.slf.ch/aval-1d/welcome-en.html
Use of 1D Data in Image Analysis and Enhancement, http://www.khoral.com/contrib/contrib/dip2001/html-dip/c4/s6/node3.html
Jonsson, H.A. et al., "Retrieval of One Dimensional Data", Proceeding of the 3rd Basque International Workshop on Information Technology '97.
Pausch, R. et al., "One Dimensional Motion Tailoring for the Disabled: A User Study", pp. 405-411, ACM CHI '92.
Cailleteau, L., "Interfaces for Visualizing Multi-valued Attributes: Design and Implementation using Starfield Displays", ftp://ftp.cs.umd.edu/pub/hcil/Reports-Abstracts-Bibliography/99-20html/99.20.html, University of Maryland
Larry Leonard, "2D Graphing Class", http://www.codeguru.com/controls/SimpleGraphControl.html
Paul Barvinko, "2D Visualization Class" http://www.codeguru.com/controls/graph2d.shtml
Tufte, E., "The Visual Display of Quantitative Information. Graphics Press, Chelshire, CT, 1983.

Web Accessibility

There exists a tool-tip feature that allows the value of the data point to be shown when the mouse is hovered over the particular point. In densely packed data sets, this might not be the best way to view individual points.
Bounding Box Overlap: We use bounding boxes to indicate the presence of clusters. Our cluster algorithms currently clearly state that a data point can belong to just one cluster. Figure 10 shows the cluster algorithm being run on a dataset showing grades of a particular CMSC course. The column being examined is the overall grade percentages. The impression being given here is that some points belong to 2 clusters. This is not true and is caused by the fact that the padding for each of the bounding boxes happen to overlap data points that do not belong to the particular cluster. This could often be a problem if we are examining densely packed column data. However, we are limited by pixel size within the physical space
	Figure 10: Example of a bounding box problem
Currently there exists lack of common visualization techniques such zooming, panning, selection and filtering. However, these features are being targeted as immediate future work. Inclusion of these capabilities will greatly enhance the functionality of Tunable Viewtips.


Figure 13 (a): Plotting the stock prices of Intel Vs. Cisco since Jan '99	Figure 13 (b): 2D Plot of UACC383 Vs. KA in the Melanoma dataset