Derek Juba
Feb 28, 2006
CMSC 838S
Application Report
Visualizing Disk Usage with a TreeMap

For my Application project I visualized the disk usage on the server at www.cs.umd.edu with a TreeMap. The TreeMap software I used was called KDirStat. This is a disk management application similar to SequoiaView, although KDirStat runs in Linux while SequoiaView runs in Windows. KDirStat supports both the squarified and slice-and-dice TreeMap algorithms, and can use either cushion or flat shading.
Thar She Blows!

The first thing I noticed was the presence of a few very large files. I clicked on what appeared to be the biggest one, and determined that it was /pub/users/getoor/JIKD.tar . The file browser also revealed it to be 1.88 GB in size. I was about to ssh in to the server to try to determine what the contents of this file were, but that turned out not to be necessary...
Seeing Double

In the same directory as the large file, there was a directory called JIKD. Clicking on that directory highlighted its contents in the TreeMap, and revealed that they were just about the same size as the tar file. Further investigation revealed that this directory contained various large data files which were themselves tarred and gzipped. Perhaps the large tar file could be removed?
Another Whale

I next investigated what appeared to be the second largest file on the server- this turned out to be /pub/francois/TreeJuxtaposer_Paper_0319.mov , a 1.15 GB file. Since this appeared to be the only copy, its existence is probably justified.
An Interesting Pattern

In one area of the TreeMap I noticed a repeating pattern. This area turned out to be the directory /pub/hcl/Reports-Abstracts-Bibliography . To determine the cause of the pattern I zoomed in a little closer...
Mystery Revealed

The directory turned out to contain a series of sub directories each labeled with a different date. Within these sub directories there were collections of pdf, ps, doc, png, wmz, and other types of files. The coloring in the pattern had been a bit misleading since the similar colors had seemed to suggest that each directory contained similar file types, but it did succeed in revealing that that section of the file system contained a large number of similarly sized directories each containing several different types of files.
More Experiments

To investigate the effect of cushion shading, I tried visualizing the file system with this feature disabled. I found that the cushion shading made the display much more visually appealing, and also had some practical effect. For example, note how the flat shaded image gives very little indication that the green squares in the upper middle are grouped into two different directories.
Another difference between the cushion and flat shaded images is that in the cushion shaded images, rectangles below a certain size were omitted for performance reasons. The rational for this was that users of this tool would not care much about small files since they would be looking for large individual files to delete, but it seems that showing all files would be useful for getting an overall picture of the data in the file system. If you are willing to pay the performance penalty, the display of small files can be enabled in the options menu.
Experiments Continued

To investigate the effect of the squarified TreeMap algorithm, I tried visualizing the file system using the slice-n-dice algorithm. While the alternating horizontal and vertical strips did make the directory hierarchy easier to see, the algorithm produced some very long and skinny rectangles which were hard to distinguish from each other, gage the size of, or select with a mouse.
Conclusion
Overall, KDirStat seemed to be a useful tool. It revealed the presence of some large duplicate files on the CS department web server, and when I applied it to my own file system I discovered some files and directories that were taking up a much larger fraction of the space than I had realized.
One possible improvement to KDirStat might be to associate a unique color with each file extension, which would allow users to more accurately spot duplicated directories or other patterns. Another improvement might be to add one additional level at the top of the tree hierarchy that would allow users to see both the used and unused space in the file system- this would allow users to get a sense of how large (relative to the file system) the displayed files actually are.