Application Presentation: 
Internet Traffic Measurement Visualization
Nada Golmie

Data Set  Description
The data set used in this project consists of Round Trip Time and Loss measurements on some Internet packet paths. It is available from The National Laboratory for Applied Network Research (NLANR) that conducts performance measurements and traffic analysis of several NSF High Performance Connections sites in order to derive a better understanding of service models and metrics of the Internet. This includes both passive and active  measurements. While passive measurements are mainly based on analysis of packet header traces, active measurements probe information from participating servers. There are about 80 monitors at various sites collecting data (Figure 1). Every minute, each machine on the list is pinged once, with the results being collected and stored. Raw data is available through a query mechanism from the Measurement and Operation Analysis Team (MOAT), Active Measurement Program (AMP) site. The data is tabulated and can be obtained in text form. It is indexed by day and source monitor. The measurements include Round Trip Time (Min, Max, Mean) and Loss to the 79 other sites.  In order to gain additional insights into the data I added three fields to the tabular display: (1) the route, (2) the source and (3) the destination location fields as shown in Figure 2.
 

Figure 1 
 Figure 2



Typical Visualization: Gnuplot/Excel
This type of data set is usually visualized  by a 2-D graph using gnuplot (or Excel) to obtain the plot shown in Figure 3 that depicts the RTT in milliseconds (y-axis) as a function of the day of the year (x-axis). This is from one source (in this case the University of Alaska) to a specified destination (Boston University).
For the particular data set chosen, there are 80x80=6400 source and destination pairs. In order to gain any insights on RTT we need to look at least 6400 plots which makes this representation not very effective. 
Figure 3

A More Interesting Visualization: Cichlid
A more interesting visualization of this data set is obtained with Cichlid, an experimental 3D visualization tool developed and maintained by MOAT. It was created in September 1998 to visualize the IP address space utilization of the Super Computing '98 conference network in real-time. Cichlid main features include real-time 3D display, animation, and point-and-click user interaction. It allows the user to visualize and interact with real-time data sets in 3D.  It was designed with remote data generation and machine independence in mind; data is transmitted via TCP from any number of sources (data servers) to the visualization code (the client), which displays them concurrently. It is written in C using OpenGl & GLUT graphics libraries and is publicly available. Although Cichlid could be a rather powerful and somewhat flexible visualization tool, it doesn't have much of a User Interface (UI). In order to make it run with my data set I had to write my own server based on the examples of servers provided with the distribution. That turned out to be a rather challenging experience.
The result is a 3D image (Figure 4) associating to each source and destination pair a RTT. This type of display could be useful for troubleshooting and observing the largest delay between two sites for example. However,  it suffer from occlusion and clutter that are usually associated with visualization of large data sets. In addition, this representation looses the geography associated with the sites so you still have to look at a map in order to locate the sites and draw conclusions. It is also extremely difficult to make any correlation between the types of statistics collected such as RTT and loss, source and destination location.
 Figure 4


An Unconventional Visualization: Spotfire
For this project, my main objective (other than looking at a "cool" 3D visualization) was to try to find some correlation and trends from the data collected. Some interesting questions that I had in mind were:
  1. does RTT depend on the distance traveled? the geographic area?
  2. where are the bottlenecks in the network?
  3. what is the smallest RTT from a particular source? to a particular destination?
  4. is there any correlation between packet loss and RTT?
I thought that using Spotfire for visualizing a network data set would be rather unusual, since Spotfire does not have any built-in functionality to recognize the inherent relationships that exist between the different elements of the data. Visualization of network data in most existing tools such as the commercial package netViz  or SeeNet [1][2] developed at Bell Labs, tend to focus on the structure of the data and the relationships between the nodes rather than on the data itself and the statistics associated with it. Usually,  the geographic placement of the nodes which represents the physical network is the most dominant element of the display. The statistics associated with the network structure are usually dealt with through dynamic and interactive control mechanisms such as the system described by Eick [1] and Becker [2].

Here are a few sample screen shots taken while manipulating the data with Spotfire.

Loss - RTT Mean Relationship
The scatter plot in Figure 5 describes the loss percentage as a function of Mean RTT. A third and fourth dimension of the data are visualized by using size and color coding for the source location and Min RTT respectively. From the figure we can make the following general observations: (1) routes originating in the NE (red coding) have relatively low loss (below 20%) and routes originating in the NW (blue) have generally the highest RTT (around 200 ms) and loss percentage (above 20%). An in-depth analysis of the data is possible if we zoom in on the details by manipulating the interactive control panel. For example we could isolate the routes originating in NW from the rest of the data and look at Min RTT, Max RTT, and Loss.
Figure 5

Source -Destination - RTT Mean
Figure 6 is  a 3D scatter plot that describes the distribution of Mean RTT with respect to the source and destination locations. The color and size coding used are the same as in Figure 5. In this case we observe that routes originating from SW (black) to almost all destination have a RTT Mean around 100 ms. We also note that Mean RTT for routes originating in NE (red) and ending in SE is smaller than for those starting in NE and ending in SW. 
Figure 6

Path - Destination - RTT Mean
 
Figure 7
Figure 8
Figure 7 is a much busier 3D scatter plot representing all routes with respect to destination locations and Mean RTT. The size coding is set according to the Loss percentage and the color coding is set according to the source location. We note that the display is dominated by paths originating in NE (a majority of red).  Paths originating in NW are split in two groups. One group with relatively low loss but higher delays and one with higher loss but lower delays. This 3D display contains close to 4300 data points so it is quite normal that it suffers from occlusion and cluttering . In Figure 8, only routes from and to Georgwtown University are shown. We observe that the Mean RTT from NW sites (blue) is split in two groups: one (from and to California) well below ~100ms and another one (from and to Alaska) around ~200 ms.

Comments
I thought the interactive control panel provided by Spotfire to be extremely useful and user friendly. It is very intuitive. That provided a "nice" platform to manipulate the data and look for rather hidden aspects and relationships. Although Spotfire was not developed with network data in mind it was the right tool to use given the type of questions that I wanted to answer. I cannot compare Spotfire to Cichlid where interactive control is limited to rotating the display in 3D and one has to program an interface for each data set.
I had relatively few problems using Spotfire. One thing I noted is that it crashed often especially when saving the workspace or exporting the display. Also the Edit/Properties menu button was disabled in the version I used.
In terms of suggested improvements, it would be nice if Spotfire had additional flexibility to manipulate the original data set such as creating new categories from existing ones (i.e. adding columns from existing ones) or even creating subcategories (some hierarchical ordering of the data). In Excel one needs to write macros which are rather cumbersome. I ended up writing a perl script to reformat the original data and add new fields.



References
1. Eick, S.G. and Wills, G.J. Navigating Large Networks with Hierarchies, in Proc. IEEE Visualization ‘93,1993
 2.Richard A. Becker, Stephen G. Eick, and Allan R. Wilks. Visualizing network data. IEEE Transactions on Visualization and Computer Graphics, 1(1):16-28, March 1995.

Web Accessibility