One Million Items Treemap
40,000 items population of the USA scatter plot (Census 2000 data)
12,134 items Linux Kernel 2.5.33 source
To what extent can information visualization scale?
This project designs new algorithms and interaction techniques to adapt popular information visualization representations, such as treemaps and scatter plot diagrams, for displaying one million of items or more in a effective way, without resorting to any aggregation technique.
To achieve this goal, we are developing special techniques to:
We are relying on hardware graphics acceleration to allow for smooth transitions between views, interpolation between layouts and synthesis of graphics attributes such as "overlaps" (among other things).
Fekete, J.-D., Plaisant, C.
Interactive Information Visualization of a Million Items, Proceedings of IEEE Symposium on Information Visualization 2002 (InfoVis 2002), Boston, USA, Octobre 2002.
Interactive Information Visualization to the Million , Jan 2002 HCIL Technical Report.
See the other Visualization Projects at HCIL.
Visualizing one million of items on a 1600x1200 screen is a challenge in term of visualization, graphics, perception and interaction. We have designed new techniques to achieve it for treemaps and scatter plots.
Visualization systems typically draw items with a one-pixel border, spending two lines and two columns of pixels and sending the geometry twice. We use slightly shaded quadrilaterals so that they remain distinguishable when tiled or stacked.
Whereas treemaps are space-filling visualization techniques where areas never overlap, scatter plots cannot avoid overlap. Even with hundreds of items, the distribution tends to be sparse with areas of high density that are hard to see. Transparency is useful when up to five items overlap, but with one million items, hundreds of overlapping items are not rare. To solve that problem, we synthesize an overlap attribute to show the item density.
The upper right white rectangle shows that several items are overlapping exactly at that position. They are of same type and size and otherwise unnoticeable.
Using accelerated graphics card provides graphic attributes not available on traditional graphics APIs. We have experimented with transparency and stereovision. Transparency is required with overlapped items but is not sufficient by itself to understand the number of overlaps. Furthermore, by blending colors, it interferes with its preattentive processing. Therefore, transparency is only useful when it can be varied interactively to reveal overlaps and density of overlapping items.
Stereovision is now available on all the standard graphics cards for less than $100 through shutter glasses. Stereovision is preattentive, but there again, overlaps interfere with it and cannot be avoided at all, even with space-filling visualization techniques, since stereovision requires a perspective projection that introduces occlusion and therefore overlaps. Like transparency, we have found stereovision mostly useful for transient inspections.
Exploring a large data set without a-priori knowledge requires trying several mappings from data attributes to visual attributes. A special problem arises when changing views using time multiplexing: the whole screen changes and the user has usually no clue on where the regions of the previous view have gone on the new one. We have designed techniques to manage these changes in a smoother manner.
When the positions of the data items are preserved - i.e. when only changing colors or stacking order - flipping between views enables quick comparisons thanks to the retina persistency. See the animation (1Mb file).
In this example, the first configuration shows a directory tree using a coloring by file types (categorical) and the second by file age (numerical). Flipping shows the relationships between them.
When the geometry is modified between two views and the display flips from one to the other, it is hard and long to understand relationships between views, even with a few data items.
We have implemented a set of interpolation techniques to animate the transformations from one view to the other so that the eye can follow one or a set of close items and understand their trends between both views. When some items are selected, we only animate them. The animation lasts one or two second and can be replayed back and forth interactively. We also provide a slider to interactively interpolate from the current view to the previous for further control and understanding of the movements of items.
The simplest technique for animation is linear interpolation. However, when several visual attributes change, linear interpolation is confusing and only allow users to track position changes at best. To help users understanding trends, we split the animation into one or two stages: 1) changing positions and 2) changing dimensions.
For scatter-plot-like visualization, position and size are independent so the middle configuration is easy to compute. Linear interpolation of positions is used to reach the middle configuration from the initial and linear interpolation of sizes is used to reach the final configuration. Both show trends when they exist. See the animation (13Mb file).
Space-filling visualization techniques compute positions according to the size of items. Most of them [SeeSys,SeeSoft, Slice&Dice Treemaps] can be animated by linearly changing the size attributes since they use a layout that remains stable.
Clicking on an image plays the animation of two configurations of Slice&Dice treemap showing the logarithmic size of files on the first and the linear size on the second. Large files, such as videos, use most of the screen real-estate and hide smaller files, only visible on the first configuration. See the animation (12Mb file).
Only some treemaps (squarified, strip, ordered, etc) use layout algorithms that are not geometrically stable. However, we found a general way to stabilize all these layouts for stage 2 by computing the final layout and linearly changing the sizes of all areas according to their initial values.
Clicking on an image plays the animation of a squarified treemap configuration showing the logarithmic size of files changing to a strip treemap showing the linear size of files. The first stage uses linear interpolation whereas the second changes the size of zones but keeps the final layout. See the animation (15Mb file).
Animation between unrelated layouts is also possible, such as squarified treemap to scatter plots. See the animation (14Mb file).
Dynamic queries rely on interactively filtering and redisplaying a data set through a continuous interaction. Current systems use "range-sliders" to filter one attribute either changing the smallest value, the largest, or sweeping a range of values between the smallest and the largest. To achieve the redisplay speed required for smooth interaction, we have designed a technique that relies on hardware acceleration. The data set should be loaded into main memory. When the user activates a slider to perform the query dynamically, all the items are sent to the GPU and stored in a display list. The Z coordinate is calculated according to the attribute being filtered by the dynamic query so, for example, if a film database is displayed and the user wants to filter on the size of the film, the size is assigned to the Z-axis. Each time the slider moves, a new near or far plane value is computed and sent to the GPU and the list is redisplayed, leaving the visibility computation to the hardware.
Sliders or range-sliders are used to control the interaction. On current systems, their precision is limited to their size, augmenting the size increasing the precision at the cost of screen real-estate and longer movements to reach the slider. Our sliders and range-sliders are small (around 100 pixels) but their precision increase when the mouse leave their region. Fast changes can be done by keeping the mouse on the control's region while precision is achieved by going farther away from the control. See the animation (32Mb file).
Textual labels are not preattentive but are nevertheless important to understand the context in which visualized data appear. Labeling each item cannot be done statically on a dense visualization so we have used and improved the Excentric Labeling dynamic techniques.
For Treemaps, when left clicking, a popup menu is triggered to select the depth level of interest. Dynamically, labels at this level are displayed on the outside region, which is also grayed out. See the animation (10Mb file).
For scatter plots, moving the mouse over items shows their labels around the cursor. See the animation (4Mb file).
Displaying readable text over colored areas is achieved by outlining it. We use black text with a white outline. The outline is computed by applying a one pixel morphological dilation on each character.
Our system reads data encoded in XML or a directory structure as input formats. It is made of 23,000 lines of C++, using high-performance techniques such as template metaprogramming  to achieve the required speed. We have used it with an NVidia GeForce3 board on a 2Ghz Pentium and a 3Dlab Wildcat 5110 on a dual 1.7GHz Pentium. To scale to the million, the computation of layouts should be done in time linear with the number of items. This is the case with treemaps and scatter plots but not with VisDB for example. Even using the fastest techniques, layout computation takes about 50% of the redisplay time.
Despite the high theoretical performance of the boards, we have not been able to go beyond 6 millions quads per second on any of the boards we tried. The theoretical speed of 15 million triangles per second is only achievable for triangle strips, which is of no use for scatter plots and would require expensive computation for treemaps.
Combining software and hardware techniques provides a sustained performance around 2.5 million quads per second. By using texture mapping for animating treemaps, we achieve 10 frames per second for animating across any family of treemap.
For scatter plots, we have only reached 3 frames per second for animations on 1 million items and 6 frames per second as worst for dynamic queries. Finding techniques for improving that speed would be useful but the next generation of graphics cards and computers will solve the problem.
These results are an improvement by a factor 20 to 100 on the best systems available.
The system is available under the Free Software Foundation GPL (General Public Licence). You can download the latest version from this page below.
Download the binary version millionvis-0.2-win2k.tar.gz or millionvis-0.2-win2k.zip.
Download the source version millionvis-0.2.tar.gz or millionvis-0.2.zip.