Dynamic Aggregation
Lida Tang
Department of Computer Science
University of Maryland
College Park, MD 20720
(301)405-2715
ltang@cs.umd.edu
1. Introduction
Current technologies have enabled
massive collections of data. Unfortunately, disk storage grows at a faster rate
than MooreÕs law. Dynamic queries is an interactive technique for data
exploration.[1] A requirement of dynamic queries is that the visualization must
keep up with the user's manipulation. This poses a problem when the dataset
grows very large, how can the visualization keep up. Since a large portion of
the computer's computation is spent on visualization, when the data sets grow,
the time to complete drawing grows proportionately. Aggregation is a very
effective way of managing large data sets. It summaries groups of similar data
elements and can greatly reduce the number of glyphs that must be shown on the
screen. Because users can specify how to aggregate the data, the important aspects
of the data will be preserved while the data set is reduced.
1.1 Motivation
Large datasets poses two problems to
interactive visualization. One is how to represent the elements on screen fast
enough. Second is even if you draw it on the screen, can the user even
understand it. Visual occlusion is a problem in general for visualization. If
the user canÕt see the data point, then the time spent drawing the item was
wasted. This problem can be solved fairly easily, SpotFire randomly jitter the
data points continuously, so that clusters that occupy the same point can be
seen. This technique canÕt be scaled up easily. [2] With larger data sets, the
occlusion problem grows more and more pressing. The average size of current
displays, in terms of pixels, is 1024x768, with the highest resolution being
1600x1200. The maximum number of data points that can appear on screen with no
occlusion is then around 2 million; with each data item one pixel in size. Due
to the non-uniform nature of most data sets, many data points will still
occlude others. Because occlusion
hides can many data points, the visual representation can lie to the user by
not showing clusters that exist in the data. Figure 1 shows 380016 data points
using a scatter plot, it is a plot of clients to a website versus the time a
request came. Are there any patterns in that data? Size coding of data points
further decreases the maximum number of data points shown and increases
occlusion. The number of data points that can be effectively understood with size
coding enabled is probably orders of magnitude less than the theoretical
maximum. Thus, aggregation is needed to reduce the total number of data points
down to a manageable size. This also will reduces the rendering time of the
display and increases the interactively of the whole system. Figure 2 shows
only 24869 data points with the same axes, but now the data points represent
the number of requests that came from a client in an hour.

Figure 1 Web data when viewed without aggregation

Figure 2 Showing client requested aggregated by hour
1.2 Related Works
Using aggregation and Dynamic
Queries (DQ) together is not a new idea. Goldstein et al [3] proposed it in 1994.
An interface mechanism called Aggregate Manager (AM) was combined with DQ,
which produced a powerful combination. (see Figure 2) DQ is used to select a
subset of the data set; this can be transferred over to AM as an aggregate
group. AM can then do aggregation on different aggregate groups, then pass the
data back to DQ for display. One of the lacking area of DQ is providing
conjunct of disjunct groups, which can be performed by AM. Users can create
multiple groups, and then pass them to the AM. The AM can then create an
aggregation based on those groups and pass them back to the visualization.
Using AM along with DQ provides many possible combinations for data
manipulation, which is powerful but can be hard for users to understand and
fully control. One of the success of DQ is that it is simple to understand and
operate, and yet is sufficiently power for most tasks. The coupling of AM with
DQ disrupts the simplicity.

Figure 3 Data flow for DQ and AM
An alternative approach to
user-controlled aggregation is automatic aggregation. Chuah [4] used automatic
aggregation in SolarPlot, a circular histogram. Elements are mapped to a pixel
on the circumference of a circle, the height of a spike that emanates from the pixel
represents the number of data values that fall with in that pixel. This
aggregation is intuitive and simple, the scale of the aggregation depends on
the diameter of the circle, and the aggregated value is easily understood.
There are several limitations to using SolarPlot. The use of a circle as the
primary visualization leaves a large portion of the display unused, while part
of the motivation for aggregation is to use screen space more efficiently.
SolarPlot only encode one dimension of data in the visualization, any
correlations between fields are harder to find as a result. Rayson's [5]
aggregate towers provide another automatic aggregation algorithm. The data
points are displayed as a block on a 3d plane. As the user zoom in and out,
data points are clustered based on their geospatial location. The aggregate
groups are represented by a stack of data blocks pointing out of the plane.
This automatic technique alleviates 2D occlusion problem by forcing it in to
3D. These stacks of data towers will occlude each other in 3D, but is easily
remedied by allowing the user to freely rotate the view.

Figure 4 SolarPlot showing a histogram

Figure 5 Close up and zoomed out view of Aggregate Towers
2. A Simple Manual Aggregation
Interface
Automatic aggregation is useful as a
way to reduce occlusion. However, having no user control makes automatic
aggregation of limited use. Goldstein's AM is complex and hard to learn. A simple
interface for manual aggregation is a nice compromise.
What are the basic steps involved in
aggregation? First, a group of data item must be selected. In AM, this is done
manual by the user; SolarPlot and Aggregate Tower use a spatial criterion.
Secondly, the fields of the group must be summaries in a meaningful manner. In
AM, this is done by using selecting a summary function on an interested field,
the other interfaces provide a fixed histogram like representation. Lastly, the
group should be able to be divided into subgroups. Because DQ and aggregation
are a powerful combination, the current interface is designed to be integrated
with DQ easily. Spotfire's interface was the starting point of our system.

Figure 6 Spotfire
The data are plotted as a scatter
plot based on two axes. Combo boxes at the edges of the screen select the
fields being plotted. A panel on the side displays DQ controls and detail on
demand. The entire interface is in front of the user. Our system has similar
characteristics as Spotfire. The aggregation controls are located on the left
side so that DQ can be placed on the right side as is. The primary aggregation
control is a combo box that can be enabled or disabled. Specifying a group of data
is easy to achieve using DQ. However, creating many groups manually can be time
consuming and should be automated. The user only needs to select a field to
group on, and have the program figure out which data point belong in which
group. The default grouping algorithm used is equivalence grouping. This is an
easily understood algorithm and is fairly useful. Should the user require a
different grouping criterion, clicking on the "..." button to the
right of the combo box will bring up an options dialog. Here, the user can
choose which algorithm to use and to configure the algorithm to their liking.

Figure 7 Choosing a grouping alg.
Goldstein et al. defined four main types
of decomposition, or grouping algorithms:
Aggregate tower's occlusion
avoidance grouping would fall into the statistical methods, and SolarPlotÕs
scheme is considered an interval division algorithm. Once the grouping
computation is finished, the results are shown on the screen with each dot now
representing a particular group, the size of the dot is currently coded to show
the number of elements in that group. The secondary aggregation controls are
the aggregate method combo boxes. Those are located below the vertical axes
field selector, and to the left of the horizontal axes field selector. The user
can select different aggregation algorithms for each axis independently.
Creation of subgroups is exactly the same procedure as creating a group, select
a field and a grouping algorithm.
3. System demonstration
Our current implementation is in
Java. The main obstacle in using Java as a testing platform was not its speed,
but rather, its appetite for memory. This is a problem with dealing with large
data sets in general, however, Java's large primitive types and other
inefficiencies are ill suited to hold large amounts of data in memory. The
following graph is a plot of CPU time consumed to load the data versus the
number of row in the data file. The data file is in a plain text format read in
from the hard drive. CPU usage time is used instead of physical time because of
the low level of memory in the testing machine resulted in sever thrashing
which affect performance greatly and doesn't accurately reflect the program's
performance.

Figure 8 Performance graph
The dataset used is extracted from
web logs. The data is taken from University of MarylandÕs Computer Science web
server. Only the requests that belonged to the HCIL section of the website (www.cs.umd.edu/hcil) was extracted. Hochheiser et. al. [6] explored the
same dataset using SpotFire. However, some of their visualizations used
preprocessed data that could be created just by using our system. The data have
the following five fields:
The rough correspondence between
number of elements and file size is around 10k data points per Meg. The data is
first read into memory. Nominal data is sorted and given an ordinal value so
they can be represented in screen coordinates. Then all data is wrapped up in
different classes based on the type of data it is. It was surprising that the
load time is linear given that sorting is order n log(n). Perhaps even bigger
data sets are needed to show this effect.
Web log data is very large and have only
a few fields. Most traditional web analyzation packages create tables of
statistics and static graphs. The user merely feed the data to the program, and
it is the program that decides what to report back to the user. Hochheister et
al. argued that an interactive star field visualization, like Spotfire is a
valid way of analyzing web log data. However, find some of the interesting
features involved preprocessing and aggregation. Thus, using the same web data
will be a good test of the flexibility and power of our simple aggregation
interface. The data file used is 35 Meg tab delimited plain text file. The data
covers four weeks of requests from July 1, 2000 to July 28, 2000. Since the web
data consists of individual client requests, one logical grouping would be to
group by user. Just by viewing the size of the groups, one can detect abnormal
large number of requests from a particular user.

Figure 9 Showing frequent visitor
We find
that the Google spider the most frequent visitor of HCIL. To check out how much
bandwidth did Google consume, we just need to change the field we are viewing
to ÒBandwidth usedÓ and set the aggregator function to sum the field. From this
graph, we find that it isnÕt Google, but another crawler, EoExchange that is
taking the most bandwidth. We can also see that EoExchange consumes almost
three times more bandwidth and have one quarter less request.

Figure 10 Showing bandwidth hogs
Grouping by
url can similarly reveal which url is the most popular, and how much each byte
each url used. A more interesting visualization is looking at the first time an
url is accessed, this might show related url accesses. Since url that are close
are logically related in someway, thus lines and clusters represent interesting
information. The two vertical lines represent concurrent access of a set of
related files. They turn out to be Java classes used in an applet. Seeing this,
a web master might decide to make a jar out of those Java classes to save
bandwidth. The horizontal line is a series of slides that was crawled by
Google. This is interesting to see because the crawling happened over three
days, suggesting that Google tries to spread out crawling over time to avoid
affecting the website performance.

Figure 11 Interesting Access pattern
To find
repeat visitors to a particular page, one can group first by user, then by url.
Thus the size of the dot is based on how many times a user used the web page.
This is an important metric for websites, because it is a measure of the siteÕs
Òstickiness.Ó

Figure 12 URL stickiness
Viewing the
user visits across the time domain will show usage patterns of users, and will
show repeat users and one time visitors. Just grouping by time and clients
using equivalence grouping give a cluttered view. By changing the grouping
algorithm for the time field to DateGrouping and setting the scale to 1 minutes
shows a cleaner version of the same graph, but now horizontal lines are clearly
visible.

Figure 13 User activity over time using equivalence grouping
Grouping by day shows us that most users are repeat users, indicated by the solid lines across time. The bandwidth hog EoExchange shows up in this graph as well. GoogleÕs accesses are well hidden and spread out across days, which validate previous observation.

4.
Implementation and Design
The
logical view of the system is a linear flow of data. Data flows from a data
source and passes through many processors and the final product are values that
can be used directly by the visualization.
Figure 14
The flow of converting raw data into a visualization
![]()
This is an
abstraction that is used commonly in computer graphics (called rendering
pipeline), and in audio/video processing (see Java Media Frameworks), and
should be familiar to Unix users as well (pipes). The client of the data, the
visualization, requests data from the end of the pipeline (a.k.a. pulling), and
each processor request data from upstream. If the pipeline is empty, then the
data flows directly from the data source and is processed along the way. Each
processor should also implement caching, so that subsequent requests for data
are fullfilled quickly. The actual implementation is isn't nearly as simple as
the design because of optimization issues. A TextDataSource is
responsible for reading the data in from disk and is at the beginning of the
pipeline. The last processor in the pipeline is the Aggregator, this
class is responsible for the summing of information once the groups are
created. Because this always applied at the end, it doesnÕt need to be
integrated into the pipeline. AggregateDataSource is the class that does all of
the work. It calls GroupingAlgorithm class to break the data into
groups, then uses the Aggregator to produce the output.
Table 1
|
URL |
Client
host |
Group
by URL |
Group
by host |
Group
by both |
|
/hcil/jazz |
L4p26.dav.mother.com |
0 |
0 |
0 |
|
/hcil/elastic-windows |
J200.inktomi.com |
1 |
1 |
4 |
|
/hcil/jazz |
Dialin68.computeron.net |
0 |
2 |
2 |
|
/hcil/jazz/learn/ |
L4p26.dav.mother.com |
2 |
0 |
6 |
The most basic
grouping algorithm is equivalence grouping. This grouping is supported by all
data types and is the default algorithm used in the system. Table 1 shows some
sample data with the results of grouping. The first two fields are fields in
the actual data, the next two fields show the results of grouping on one of
those fields. The grouping algorithm will number the groups consecutively,
starting from 0. Identical entries in the field that we are grouping by produce
the same group number. We could use a hashing function to produce group
numbers, but sequential groups makes creation of subgroups easier. Once
grouping calculations are done for a particular field, it can be cached for
later use. Subgroups can easily be created using the group ids of the fields used
in subgrouping. Column 5 of Table 1 shows the group number when grouping by URL
and Client host. The following simple formula can be used to generate subgroup
ids:
Equation:
![]()
A problem
with using the above equation is when dealing with large group ids, multiplying
two large numbers might result in wrap around. Hashing based on the group ids
can solve this problem. Using this technique is faster than direct calculations
of the subgroups with no caching, since the grouping algorithms may take a long
time to complete. The only draw back to this technique is that the memory
requirements of caching can be quite large. A LRU or priority queue algorithm
can be used to manage the cache space if memory is scarce.
AggregationDataSource also caches the results of aggregation. It uses JavaÕs
hashtable to associate a field name with an array of values returned by the
Aggregator object. The amount of data needed to store these data is much
smaller due to the fact that the aggregated dataset is much smaller.
5.
Future Work
Dynamic
Queries provides a visual and interactive means of filtering of data. Filters
are nothing more than another kind of data processor, and as such can be
integrated with our system very easily. Integrating DQ into our current system
should require little change to our design. Since DQ requires binning of values
so each pixel on the visual control correspond to a group of data that can be
filtered out, we can use the interval grouping or element frequency grouping
algorithm that is already developed for dynamic aggregation. There are two
alternatives to the position that DQ will occupy in the pipeline, one before
aggregation and one after. Putting DQ before aggregation would probably be more
powerful but since this changes the data upstream, all data downstream would
have to be flushed and recomputed. This can be prohibitively slow for
interactive requirements of DQ. Putting DQ after aggregation can also be useful
and is much more computational tractable due to the smaller number of data
items. The integration of DQ to our system would produce a system that can
cover most of the data manipulation techniques.
Another
area of importance is how to compute with much larger datasets. Currently, our
system loads everything into memory before doing any computation. This is
clearly unfeasible for datasets with gigabytes and terabytes of information. If
aggregation means having to go through the entire database before an answer can
be returned, the user would suffer a long delay before any real exploration of
data can occur. A promising technique to integrate with would be combining the
dynamic aggregation interface with ÒOnline AggregationÓ developed by Haas and
Hellerstein [7]. Online Aggregation returns partial answers to database queries
along with a confidence measure. This confidence measure is updated
incrementally, meaning that partial results can be returned and viewed. In this
way, query on a large database merely takes more time to get a good enough
confidence measure, while the user can still explore the data returned.
6.
Conclusion
We have developed a linear data flow design that produces groups to be used in aggregation, unlike the cyclical flow used in AM. Due to the simplicity of the system, we believe that users can understand the state of the system and will be able to use the tool effectively. However, due to the inherent complicity of the aggregation concept, users should have in mind a specific aggregation that they require. Unlike DQ, in which users can explore and experiment with data, aggregation should be thought of as creation of a new dataset. This new dataset can then be explored by DQ. A usability test should be conducted to test how readily users understand using the interface and which grouping algorithm and aggregation algorithm are needed to have a rich set of tools so the user can find answers to more complex questions than what was considered in the paper.
http://www.cs.umd.edu/class/spring2001/cmsc838b/Project/Lida_Tang/aggie.jar
The
data is embedded in the jar itself. So running the demo should load a two days
worth of web data used in the paper. The full dataset is 35 Megs, which would
take a long time to load not to mention download.