SAR - Single Attribute Range-only Prototype
for Query Previews:
Date: 12th March 1999
Work Schedule
Description of the Prototype

The Current Prototype: Work in Progress - Most recent version at Raytheon STX

Images of Screens

Working Information for the Applet
This applet can be viewed on netscape or on internet explorer. The recommended version downloads a plug-in and installs it before running the applet. The plugin allows the browser to run the applet written in Java 1.1 on Netscape. It also allows users with Internet Explorer to view the applet as it will appear in Unix environment. ]Since some of the applet interface features still appear a little different on internet explorer, the plugin version is recomended for previewing. The applet can now run on these PCs using Netscape and Internet Explorer and on Suns using Netscape.

Various Options of the interface layout were explored and a hybrid model of the vertical length and color models was selected for implementation. The one dimensional variables will have histograms with their length proportional to the number of datasets present. These bars will have redundant color coding. The same color coding scheme will be used to code the data present in each grid of two dimensional variables by overlaying a layer over each grid. The color of the overlay will depend on the number of data sets present in each of the one dimensional grids.
Mockups of the three alternative schemes for the Interface
Selected Layouts - using redundant color and length coding for 1D data and color coding for 2d data

Discarded Interfaces
Vertical Interface with Annual (thick) Histogram Bars
Horizontal Interface
Transparent Grid - Opaque Map

MultiValued Attribute Problem:

Query Previews present visual representations of data distribution patt erns along certain meaningful parameters. As users use specialized widgets to select the ranges of the parameters that interest them, the visual representations change and the user gains a better understanding of the data distribution patterns. Earlier implementations of the query previews solutions ran into the problem of multi-valued attributes. Multi-valued attributes is a term used to define a situation where there are more than one value for a given attribute, for instance there are more than one actors in a movie. The first solution to this problem was to duplicate the entries for each instance of the multiple values. In some cases this lead to a large explosion of the database and produced noticeably erroneous results (total number of hits). The next solution was applicable to situations where all the parameters were range variables and used Euler's formula was used to delete the replications.

Euler's formula computes the number of datasets that actually fulfil a query by using a few simple arrays of data. Temporal data is considered one-dimensional and has two arrays associated with it. The first array specifies the counts of granules for each cell and the second array specifies the number of granules that cross over from one cell to the next. For instance the first array specifies the number of granules that have data for a cell say march 1979 and April 1979 and the next array specifies the number of granules that have data in both the months, march and April of 1979. If a user queries how many granules have data for the months of March and April of 1979, using Euler's formula the answer is the sum of all the granules per cell less the sum of the crossovers for all the granules. In the case of the geographic parameter, which is two dimensional, there are two more arrays corresponding to vertical crossovers and vertex of corner crossovers. The sum of the vertical edge arrays is subtracted from the net count and the sum of the vertex array is added back in. The appeal of this solution, in addition to the fact that it elegantly takes care of the multi-valued attribute problem is that all the client needs is a set of four arrays of at-most (72 X 36) = 2592 integers (or 331 KB). This is the maximum amount of data that needs to be transferred to the applet for each query.

The interface widgets were developed and a server side program was created to return the answers to the queries made by the client side applet. This paper is a brief description of the techniques that were used as solutions for this server side application.

The CZCS database with more than 80,000 granules was used as a trial data-set. The initial solutions have been limited to only two parameters time and geography.

Interface Overview

Using this prototype tool, users can preview the data distribution along prespecified paramaeters and make their queries narrower even while minimizing the possibility of zero and million hit querires. The parameters can be one dimensional like time or two dimensional like geographical area. To select a temporal region of interest the user is presented with a screen that has a Range Slider, a histogram of the data distribution and a logarithmic scale. As they slide the double slider to describe the zone of interest, the length of the bar in the scale reflects the number of datasets (or granules) that contain data for any part of that time period. The bar is also color coded according to the amount of data present. A similar screen is used for all one dimensional attributes. The geographic area selection site uses a rubber-banding box to select an area or Interest. A translucent color grid is overlayed on the map. The color of the grid cell depends on the amount of data that is present in that area. The total amount of data present in the rubber banding selection box is presented is reflected in the color and length of the scale.

Interface Widgets:

The interface reacts dynamically to changing the zone of interest by using the Range Slider. The month at which the slider buttons are positioned is made visible dynamically. Color in the Range Slider Selected Zone with the appropriate scale color
A histogram of the data is presented. The histogram bars that were not selected are now grayed
Users can select bars of interst by clicking on one and draging to the end of the range of interest. The range slider and the scale update accordingly
Logarithamic Scales developed and used.
In the geographic selection users can select an area using a rubber-banding box and the scale shows the number of granules in this area. The bounding box is colored according to the amount of data in the area enclosed by the map.
Two types of tansparent interfaces were created
Transparent Grid - Opaque Map
Transparent Map - Opaque Grid
The transparent map interface was selected. Interface layout improvements were made.
Help Screens were developed for the interface:
Temporal Selection Help Screen
Geographical Selection Help Screen

Serevr Side Database Solutions

Solution 1: The Datacube Table.

Meta-data from the 80,000 granules of CZCS data was reduced to a simple cube. For the two-parameter case the three dimensions of the cube can be considered to represent latitude, longitude and time. For a distribution over ten years at a monthly granularity, and a geographic cell size of five degree, the data-cube contains 72 X 36 X1 20 = 311040 cells. For the implementation of the Euler's formula a maximum of four values is required for each of the cell faces. This can be thought of as four data cubes. The size of the data-cube thus becomes about 40 MB. The size of the cube is independent of the size of the data-set but does depend on the size of the "cells" for the parameters. When the client makes a query, the server side program identifies the part of the slice of the cube that was relevant to the query and used that to return the arrays. The data-cube was pre-created by an independent program. Every time a client applet made a preliminary inquiry by supplying a dataset name, the datacube for that dataset was loaded. This loading process takes about three minutes. Pre-loaded solutions were considered impractical because they are not viable in situations where the tool is being used for previewing multiple databases, and multiple cubes have to be preloaded. The loading time and the size of this cube are too large. This solution will not scale up. Therefore a different technique was tried and adopted.

Solution 2: The DataSet Index Table.

The data-cube was modified into a tabular data-format with an interface with an ActiveX database. The three faces of the data-cube represent time, latitude and longitude. The divisions or the cells along each of these dimensions can be thought of as "bins" corresponding to that dimension. The tabular format contains a list of all the bin IDS and the array values associated with that. The data-table contains the following fields -

DsIndex - An identifier that is specific to the dataset. The table is expected to contain data from more than one datasets and this ID number will help separate granules from the dataset of interest.

TimeBin - The bin that corresponds to each time interval cell. (For instance in the czcs case there are 120 TimeBins, twelve for each of the ten years).

TimeEdgeFlag - If this flag has the value of zero the subsequent array in another cell of the table is the array of number of granules for each cell and if the value of the flag is one the subsequent array is the array of horizontal intersections

ParamBin - This is for future use to specify the value of the bin number of the third parameter.

LatBin - The bin indetifying number of the latitude bin.

RowCounts - creating row bins also would have made the database too large , therefore This hybrid solution was adopted. For every given timebin, parambin and latbin, there is a row bin. The row bin has a string that can be parsed into a 288 integers - 4 arrays of 72 cells for the five degree case. When a spatial - query is sent by the client, the temporal and spatial - lat information is sent to the database as a query and the database returns a set of row counts. These are parsed into four arrays and these arrays are truncated using the spatial - longitude part of the query. When a temporal query is sent in a similar process is followed, except that the timeEdgeFlag is used to separate the two arrays.