Promoting Insight Based Evaluation of Visualizations:
From Contest to Benchmark Repository
Human-Computer Interaction Lab.
Institute for Vis. & Perception Research
Information Visualization (InfoVis) is now an accepted and growing field with numerous visualization components used in many applications. However, questions about the potential uses and maturity of novel visualizations remain. Usability studies and controlled experiments are helpful but generalization is difficult. We believe that the systematic development of benchmarks will facilitate the comparison of techniques and help identify their strengths under different conditions. A benchmark typically consists of a dataset, a list of tasks, and a list of non-trivial discoveries. We were each involved in the organization of three information visualization contests for the 2003, 2004 and 2005 IEEE Information Visualization Symposia. Our goal is to encourage the development of benchmarks, push the forefront of the InfoVis field by making difficult problems available, create a forum for the discussion of evaluation and provide an interesting event at the InfoVis conference. The materials produced by the contests are archived in the Information Visualization Benchmark Repository. We review the state of the art and challenges of evaluation in InfoVis, describe the three contests, summarize their results, discuss outcomes and lessons learned, and conjecture the future of visualization contests.
Visualization, information, competition, contest, benchmark, repository, measure, metrics
Information Visualization is now an accepted and growing field with numerous visualization components used in mainstream applications such as SPSS/SigmaPlot, SAS/GRAPH, and DataDesk, in commercial products such as Spotfire, Inxight, HumanIT, and ILOG JViews, and in domain specific standalone applications such as interactive financial visualizations [SMo06] and election data maps [NYT06]. Nevertheless, questions remain about the potential uses of these novel techniques, their maturity and their limitations.
Plaisant reviewed evaluation challenges specific to information visualization and proposed initial steps [Pla04] such as refined evaluation methodologies, use of toolkits, dissemination of success stories, and the development of contests (Figure 1), benchmarks and repositories, the focus of this paper.
Figure 1: A collage of sample
screens from the InfoVis 2004 contest illustrating the diversity of
visualization methods used to address a task.
Empirical user studies are very helpful but take significant time and resources, and are sometimes found of limited use because they have been conducted with ad-hoc data and tasks in constrained laboratory situations. Benchmarks facilitate the comparison of different techniques and encourage researchers to work on challenging problems. However to be convincing, the utility of new techniques needs to be demonstrated in a real setting, within a given application domain and set of users. Contests attempt to create surrogate situations that are representative of real world situations. They engage teams’ competitive spirit to produce materials that can help the community compare visualization tools applied to the same problem.
Competitions help push the forefront of a field quickly. In some cases it is simply the emotional aspect of winning or the excitement of live competition that compels researchers to participate. TREC competitions (the Text REtrieval Conference) [Voo00] exemplify the best of these in being able to bring in many corporate and academic research groups.
A contest poses a problem that many will attempt to solve. If the problem is challenging and representative of a real world situation, then the solutions proposed by contestants provide insights into what techniques are possible, and which ones are potentially better to pursue. Often these solutions provide such good results that other participants are driven to compete in the next year’s contests. The current contest data sets and tasks become part of the baseline against which new techniques can be tested. These contest submissions then describe the insights found with various tools, illustrate the current state of the art and entice researchers to find even better solutions.
In this paper we review the state of the art and challenges of evaluation in the field of information visualization, describe the three contests, summarize their results, discuss the outcome of those three events, and conjecture the future of visualization contests.
Information visualization systems can be very complex [Chen00] and require evaluation efforts targeted at different levels. One approach described in the Visual Analytics research agenda [Tho05] is focused on three levels: the component level, the system level, and the work environment level (Figure 2).
At the component level are the individual algorithms (e.g. clustering or linguistic analysis), visual representations, interactive techniques and interface designs. Data analysis algorithms can typically be evaluated with metrics that can be observed or computed (e.g. speed, accuracy, sensitivity or scalability), while other components require empirical user evaluation to determine their benefits [Che00, Chen00b]. Metrics include effectiveness (e.g. time to complete simple tasks) and efficacy (e.g. number of errors or incomplete tasks). There have been demonstrations of faster task completion, reduced error rates or increased user satisfaction measured in laboratory settings using some visualization components. These studies are helpful to compare isolated interaction techniques or data representations e.g. [Ira03, Alo98]. Studies comparing slightly more complex tools combining a few components – at least a choice of interaction and visual representation - are also available, e.g. [Pla02, Kob04, Sta00]. They often reveal that different tools perform better for different types of tasks, but it is often difficult to pull apart what part of the system really impacts the performance of the tool. Some limited techniques allow computed scores to be generated to evaluate the potential quality of simple displays, e.g.[Mac86] but controlled experiments remain the workhorse of evaluation.
Figure 2: The 3 evaluation levels for Visual Analytics (Figure 6.1 in [Tho05])
At the system level, interfaces combine and integrate multiple components and need to be evaluated by comparing them with technology currently used by target users. Metrics need to address the learnability and utility of the system. Those evaluations may take place in the laboratory using surrogate scenarios but address complex tasks conducted over a longer period of time than component-level evaluations. A new approach is to encourage insight-based evaluation. The Infovis 2003-2006 contests are examples of efforts encouraging insight-based evaluation [Inf03, Inf04, Inf05, Inf06], and there are recent empirical studies which measure insight [Sar04].
At the work environment level evaluation addresses issues influencing adoption. Metrics might include user satisfaction, trust and productivity. Case studies and ethnographic studies are used but they remain rare in the field of information visualization. Case studies report on users in natural environment doing real tasks [Gon03, Tra00]. They can describe discoveries, collaborations among users, the frustrations of data cleansing and the excitement of data exploration. They can report on frequency of use and benefits gained. The disadvantage is that they are very time consuming and may not be replicable or applicable to other domains.
Recently the Beliv’06 workshop [BEL06] provided a good overview of the most recent work on improving information evaluation, including the development of specific heuristics, metrics or taxonomies of tasks. Of course usability evaluation remains a cornerstone of user-centered design and evaluation. It is of paramount importance for product engineering but also a powerful tool for researchers as it provides feedback on problems encountered by users and guides designers toward better designs at all three evaluation levels.
benchmark data sets abound. Some repositories
simply make data sets available (e.g. the Council of European Social Science
Archives [CES06]) while others offer tools to help promote research in specific
Although ideally one should be able to evaluate the quality of answers computationally this is often not possible. The problem is the fuzziness of answers to the contest: it may be a collection of articles, a new algorithm, or a new visualization, each of whose correctness may not be computable. This forces human evaluation (TREC still uses human judging for determining the accuracy of the retrieval.) and such is the case for the IEEE InfoVis Competitions.
Another difficulty for information visualization comes from the impact of the discovery process, an extremely interactive and personal activity. Whereas computational algorithms can be compared through the accuracy of the results, most often it is not possible to accurately measure the results of visualization. We do not have measures of perceptual information transfer. There is beginning research in measures of interestingness and other metrics related to visualization [Kei95, Gri02], but these are in their infancy and too simple to be applied to the current contests.
We can identify simple tasks which yield precise results or we can specify exploratory tasks and thus have much less predictable results. This makes the evaluation process difficult to plan for and forces real time evaluation criteria which end up being reviewer dependent. Despite these constraints, one can still argue for simple tasks. A system which does not make it possible to achieve simple tasks would be a very limiting system and is likely not to support more complex or exploratory tasks. One can also argue that simple tasks are unrealistic. We tried to balance task simplicity vs. complexity to obtain a satisfying tradeoff.
Distinct challenges present themselves for evaluating at the component or system/work environment levels.
At the component level, the main challenge is to move beyond the proliferation of isolated evaluation to a more concerted effort to generate guidelines for selecting techniques based on the tasks and data characteristics. A characteristic of the field of information visualization is the great diversity of approaches available to designers to handle any type of data and the combinatorial explosion of possible implementations. Toolkits [Fek06] or even code repositories [Bor06] can help researchers control some of that diversity to adequately compare individual components. In control studies, dataset and task selection has been until now an ad-hoc process making it difficult to compare results across studies. This would be aided by the development of comprehensive task taxonomies and benchmark repositories of datasets, tasks and results. Another problem is that studies generally include only simple tasks. A literature survey [Kom04] confirms this fact by stating that experiments usually include locate and identify tasks, but that tasks requiring users to compare, associate, distinguish, rank, cluster, correlate or categorize have only been rarely covered. Those studies are very difficult to design, and better experimental design training for researchers will greatly improve the outcome of evaluation efforts.
Another characteristic of visualization is that the analysis process is rarely an isolated short term process. Users may need to look at the same data from different perspectives and over a long time. Users may also use analytics to answer questions about visible and non-visible patterns. They may also be able to formulate and answer questions they didn’t anticipate having before looking at the visualization. This is in contrast with typical empirical studies techniques which recruits subjects for a short time to work on imposed tasks. Finally, discoveries can have a huge impact but they occur very rarely, or not at all, and are unlikely to be observed during a study. Insight based studies as described in [Sar04] are one first step but new evaluations methods need to be devised to address this problem.
At the system level, evaluating information visualizations and their interfaces is a daunting challenge. Success is difficult to quantify and utility measures are elusive. Tasks become significantly more complex and difficult to emulate in a laboratory environment. Working with realistic data is crucial but “ground truth” is not always available. Even when available the comparison of steps to results is often impossible. Users’ motivation and expertise greatly influences performance. In traditional component level empirical studies the level of training of subjects is typically limited and subjects are not allowed to consult with colleagues or use outside sources as they normally would in their work environments. Using domain experts will lead to more realistic results but individual differences between subjects should be controlled for results to be useful. Trust is a particularly important aspect of Visual Analytics (VA) system evaluation. It is challenging to measure while of paramount to user acceptance during product deployment. Discovery is seldom an instantaneous event, but requires studying and manipulating the data repetitively from multiple perspectives and possibly using multiple tools. Facilitating the transfer of data between heterogeneous tools and keeping the history of the investigation might well be just as important for discovery as the functionalities of individual components. Longitudinal studies may be more helpful but they are more difficult to conduct. Measuring the impact of integrated components that require users to manipulate visual as well textual representations, use the web to find complementary information, integrate analytics and possibly spend hours brainstorming with colleagues remains a challenge. Another challenge is that success may not be due to nor easily traceable back to the visualization. For example an effective visualization used on a daily basis by an analyst may heighten their awareness of certain activities by allowing them to absorb and remember large amount of information effortlessly. However it might be difficult or impossible to link later decisions to a particular tool as awareness is difficult to identify and measure, and decision-making uses information from diverse sources. In fact, the introduction of visualization might even trigger changes in work practices, exacerbating the problem of identifying cause and effect. Shneiderman and Plaisant have proposed to use Multidimensional In-depth Long-term Case Studies (MILCS) as a way to study and evaluate creativity tools such as visual analytics and information visualization [Shn06].
The first contest took place in 2003 [Inf03] (Figure 3). We invited submissions of case studies on the use of information visualization for the analysis of tree structured data, and in particular to look at differences between pairs of similar trees.
There are hundreds of types of tree with varying characteristics. In an effort to be representative of this diversity while remaining accessible for a contest we selected three very different examples. Three pairs of datasets were provided in a simple XML format.
Small binary trees (60 leaf nodes) with a link length attribute. No node attributes except their names.
Very large trees (about 200,000 leaf nodes) with large fanouts. Three node attributes, all nominal. Labeling, search and showing results in context is important. We allowed teams to work on a subset of the dataset (the "mammal" subtree) if they could not handle that many nodes.
· File system and usage logs
The trees are large (about 70,000 leaf nodes). Many attributes, numerical and nominal. Changes between the two trees can be topological changes and attribute value changes. Data for 4 periods was available.
We provided general tasks (about 40 tasks in 11 categories) and tasks specific to the selected datasets. General tasks were low level tasks commonly encountered while analyzing any tree data: topological tasks (e.g., which branch has the largest fan-out?), attribute based tasks (e.g., find nodes with high values of X), or comparison tasks (e.g., did any node or subtrees "move"?). On the other hand the tasks specific to particular datasets included more broad goal-setting tasks (e.g., for the phylogenies, what mapping between the two trees topologies could indicate co-evolution, and, maybe, the points where the two proteins were not co-evolving?) We made clear that it was acceptable not to work on all tasks and that partial answers were OK. We also clarified that we were not looking for a detailed result list (e.g., a list of deleted nodes for the task “what nodes where deleted”) but an illustration or demonstration of how the visualization helped find the answer. General background information was provided about the data and tasks, which was particularly important for the phylogenies.
Teams had five months to prepare. The participants were required to submit the following materials:
• Two page summary
• Video illustrating the interactive techniques used
• Web page of accompanying information and
• Index page with team information
We received eight entries. It was a small number but satisfactory for the first contest.
The first main finding was that the tasks and datasets were too complex for such a contest. Each tool addressed only a subset of the tasks and only for a subset of the datasets. The phylogeny chosen required domain expertise hence was “real”, and even though it consisted of a small binary tree, it was not used, probably because the tasks were complex and required working with biologists.
The second main finding was that it was difficult to compare systems even with specific datasets and tasks. We had hoped to focus the attention of submitters on tasks and results (insights), but the majority of the materials received focused on descriptions of system features. Little information was provided on how users could accomplish the tasks and what the results meant, making it very difficult for the judge to compare. The systems presented were extremely diverse, each using different approaches to visualize the data.
were three first-place entries. TreeJuxtaposer [Mun03] (Figure 3) submitted the
most convincing description of how the tasks could be conducted and the results
interpreted. Zoomology [Hon03] (Figure 4) demonstrated how a custom design for
a single dataset could lead to a useful tool that addressed many of the tasks
satisfactorily. InfoZoom [Spe00] (Figure 5) was the most surprising entry
(Figure 5). This tool was designed for
manipulating tables and not trees. However the authors impressed the judges by
showing that they could perform most of the tasks, find errors in the data and
provide insights in the data. The three
second-place entries showed promise but provided less information to the judges
on how the tasks were conducted and the meaning of the results. EVAT [Aub03] (Figure
6) demonstrated that powerful analytical tools complementing the visualization
could assist users in accomplishing their tasks. Taxonote [Mor03] (Figure 7) demonstrated
that labeling is an important issue making textual displays attractive. The
All entries were given a chance to revise their materials after the contest. We required participants to fill a structured form with screenshots and explanations for each task. That information is archived in the Information Visualization Benchmark Repository [Bmr06].
Figure 3: Treejuxtaposer
Figure 4: Zoomology
Figure 5: InfoZoom
Figure 6: EVAT
Figure 7: Taxonote
Figure 8: A
combination of tools -
The second competition coincided with the 10 year anniversary of the InfoVis Symposium. As the visualization of the history of a field of research is a problem interesting in itself, it naturally formed the core part of the contest. The key advantage of the topic is that it was familiar to all participants. The disadvantage was that the selected corpora was not readily available in a usable form.
The set of all publications on a topic is too large a universal set of discourse for a competition. We first argued about which conferences or journals to include, then decided to limit the dataset to all the IEEE InfoVis Symposium papers and all of the articles used as reference in those papers. Metadata is rich for IEEE and ACM publications and unique keys available.
Producing a clean file (metadata for the collection of documents) was a much bigger challenge than we had imagined. We first made an assumption that both the articles and the most important authors in information visualization would be referred by most of the articles published within the InfoVis symposium. Our look at the references initiating from articles published within InfoVis seemed to us at the same time focused on the field and complete. It would be unlikely that an important publication in information visualization would seldom be referenced by other articles.
This was partially correct but text metadata still yielded numerous ambiguities. IEEE manages the InfoVis articles which are less curated than those of the ACM. Much text metadata was non-unique (e.g., many-to-one names such as Smith, Smyth, Smithe, …). Reference titles were too noisy and in many cases erroneous as text is handled by the ACM Digital Library as strings and numerical computations such as string comparisons are still weak. Much curation on our end was necessary as references were noisy, sometimes missing, and even sometimes pointing to non-existing URLs.
We thus embarked on cleaning the data. This was a complicated process, with multiple passes, and manually intensive, even with automatic reference extraction as we found no reasonable automatic system to suitably resolve the problems. We manually extracted the articles from eight years of pdf files from the symposia available in the digital library. We then semi-automatically retrieved the articles referenced in those papers again from the digital library. We extracted those which existed when found and manually cleaned and unified the publications not included in ACM library.
The result was a file containing 614 descriptions of articles published between 1974 and 2004 by 1,036 authors, referencing 8,502 publications. It took well over 1,000 hours for us to construct that file, with over 30 people involved.
We proposed 4 high level tasks with a great deal of flexibility for a variety of solutions:
1. Create a static visualization showing an overview of the 10 years of InfoVis
2. Characterize the research areas and their evolution
3. The People in InfoVis: where does a particular author/researcher fit within the research areas defined in task 2?
4. The People in InfoVis: what if any, are the relationships between two or more researchers?
We suggested particular names for task 3 to facilitate comparisons between submissions, and participants used them, along with other names.
The participants were required to submit
• A two page summary
• A video illustrating the interactive techniques used
• A structured web form providing details as to how the tasks were accomplished and what discovery or insights were identified
were 18 submissions from 6 countries (
Quality improved dramatically between 2003 and 2004. The good news was that most teams had provided a lot more insights than we had seen in the 1st contest. Still, some teams had tools that seem promising “on paper” but reported very few insights (in consequence they did not do very well in the contest.) On the other hand some teams presented tools that seemed of doubtful utility to the reviewers at first but were able to report useful insights, therefore fairing better than we had expected in the results. Of course the best teams had everything at once: promising visualizations, lots of insight reported, and convincing explanations of how the insights were obtained using the tools.
None of the 12 selected teams answered all the questions. A few of the participants had extensive experience with text analysis and that was visible in their results. Other had background knowledge of the InfoVis community and could provide better hypothesis about what they were seeing. One tool was developed entirely from scratch for the contest but most teams showed interesting new uses of existing techniques. Node Link diagrams were a very commonly used representation for many of the tasks, with some notable exceptions.
This second contest had a single dataset and simpler tasks so we thought reviewing and comparing results would be much easier. Not so. Ideally one would be able to evaluate the quality of answers computationally but this was not possible. The problem was the fuzziness of the answers and the lack of “ground truth” or even consensus on what the best answer might look like. Teams’ answers took various forms: from a collection of articles or names to a new algorithm to a new visualization, all of whose correctness was not computable. Only human evaluation was appropriate to judge the validity of the answers. In information retrieval, TREC for example does uses human judging for determining the relevance of documents (i.e. the answers) from which metrics can be computed for a team’s set of results. Short of spending time with the team throughout the discovery process (an extremely interactive and personal activity) we could only base our judgment on the materials provided (video and Web form).
There were three 1st place entries and one student 1st place:
· The entry from
· The entry from Microsoft and
· The student 1st prize went to a team from
place prizes (see Figure 13 to 19) went to the Université de Bordeaux I with
We were satisfied that teams reported useful insights but we were still surprised by how few were reported, and even fewer really surprising insights. Insights about the whole structure were rare and only came from teams who had experience looking at other domains (e.g., the fact that InfoVis is a small world, tightly connected, was mentioned only by 2 teams.) One team noticed that the most referenced papers were published at CHI, not at InfoVis. Only three teams noticed the existence of references to future papers, a problem resulting from automatically processing references and confusing multiple versions with similar titles such as a video and a paper. Only one insight dealt with something that was surprisingly missing namely that there were no papers in the dataset from several of other competing InfoVis conferences, despite the fact that they had been held for several years.
Teams interpreted the tasks and used the data in surprisingly very different ways. A task such as “describe the relationships between authors” was interpreted in at least the following nine different ways as report on co-authorship; or co-citation; or people working on similar topics; or having a similar number of co-authors; or being a part of big groups or teams; or having a similar number of publications; or a similar number of references; or working in the same institution; or working in the same clique or empire.
Teams also used the data differently: they created displays showing either only the IEEE Infovis symposium papers or all papers including the references. Sometimes they combined both authors and topics and sometimes they used separate displays. In one case we suspected that a team used only the papers first authora but could not tell for sure. One team only used references from InfoVis papers references but not references to papers from other venues. The data made visible was generally pruned dramatically to work with the tools or to create more useful or possibly appealing displays. Few ever attempted to show complete views. Some teams had a “celebrity” approach ignoring everything but the star papers or authors based on some unique criteria (e.g., numbers of citations). Some clustered first then pruned later with no clear explanation of what has been pruned.
Reviewing the displays seemed easy at first, but it quickly became impossible to remember what data we had just been looking at, not mentioning trying to compare results even when it would have been possible. Many displays had no legends or very poor legends and none had any summary of the process that generated the display. Each team probably had a clear model of the scope of the data and how it was filtered, aggregated and interpreted, but the displays did not reflect that.
teams even attempted to answer question 1, to create a static visualization
showing an overview of the 10 years of InfoVis.
Teams merely reused one of the screen shots from other tasks so we felt
only one aspect of the data had been portreyed and not the entire 10 years of
InfoVis. Teams reported very different topics and different numbers of topics
(from 5 to 12) and some created topics on the fly, refining the topics
iteratively. Sometimes a seemingly
narrow topic would take a prominent place: “parallel coordinates” was a major
topic in one case while in another system “taxonomy” was a major topic. Reviewing all the submissions gave us an
impression of randomness in the choice or labeling of the topics. One of the student team used their
professor’s notes to extract topics. It
was innovative but again, affected our ability to making comparisons. Most visualizations limited the total number
of topics which limited the insights to be related to those topics. But topic extractions were not the focus of
the contest so we did not judge the quality of the topics. Nevertheless this made
it more difficult to compare insights.
Some tools (e. g., In-spire [Won04] and an entry from
Overall, labeling remained a very big issue. Very rarely could we actually guess paper titles when looking at a display. Better dynamic layout techniques for labels were clearly needed. Labels for papers usually consisted of the first few words or even just the first author making it difficult to remember if we were looking at author relationships, or papers, or even topic relationships, e.g. a large node labeled “Johnson” could represent the often-referenced Treemap paper.
Some tools had only one window [Teo04], but most used multiple windows, showing either variants [Ham04] or very different displays for different tasks [Chen04], [Kei04]. The Paperlens submission [Lee04] illustrated the importance of coordinating views. Only two teams dealt with missing data and uncertainty, others ignored the problem entirely. Visual metaphors seemed to have had an effect on the words teams used to describe their findings, e.g. one team [Ahm04] talked about empires when looking at towers in 3D, while others talked about cliques while looking at clusters on node link diagrams. Unfortunately, we also saw examples of “nice pictures” that didn’t seem to lead to any insight.
The 2004 contest session at the workshop was very well attended and we received extremely positive feedback. Attendees reported being able to appreciate the wide diversity of solutions and contrast the different techniques. We conjecture that the topic we had selected also made the contest more accessible.
Figure 9: Link diagram, from
Figure 10 In-Spire clusters from the Pacific Northwest National Laboratory [Won04]
PaperLens distributions from Microsoft and the
Wilmascope topic flows from the
Figure 13: Document graphs from the U. de Bordeaux I and
Figure 14: Link diagram from the Technische Universiteit Eindhoven [Ham04]
Figure 15: Topical overview and focus from Georgia Institute of Technology [Hsu04]
Figure 16: Document timeline and classes from the
Figure 17: Topic classification from
Figure 18: Author link diagram from
Figure 19: Topic and author timeline from the
In the third competition [Inf05] the chairs aimed for the evaluation of more complete visualization systems and a different type of data. The data set was larger and the questions more targeted. The goal was to identify how well visualization or visual analytics systems or even specific tools could perform with a large but easily understood data set. The chairs missed a key point in that the problem was probably better phrased as a GIS challenge rather than simply an information visualization one. The chairs also released the data set only for the competition. The owner of the data set did not permit an open release, something that the chairs tried to avoid and hopefully will avoid in the future.
We selected a large,
information rich, and real data set. The
data consisted of information on about 87,659 technology companies in the US,
including year founded, zip code, yearly sales, yearly employment information,
along with industry and product information using the North American Industry
Classification System [Nai01]. This was a large data set with geographic
interpretation, one which pushed the limits of many systems. The data was cleaned by graduate students at
The three questions related to the characterization of correlations or other patterns amongst variables in the data were
1. Characterize correlations or other patterns among two or more variables in the data.
2. Characterize clusters of products, industries, sales, regions, and/or companies.
3. Characterize unusual products, sales, regions, or companies.
additional question was more general and open-ended
4. Characterize any other trend, pattern, or structure that may be of interest.
The chairs felt that these precise questions would make evaluation simpler. And again this was not correct as all questions were open-ended and comparing the discovery of different correlations was difficult.
The participants were required to submit materials using the format as in 2004. There were only 10 participants. This was a surprise but the short time from available data to submission deadline was probably the most important factor. We had no submissions from student teams possibly because we released the first version of the data set at the end of February during which most university information visualization classes already are well under way.
The chairs managed the review process and evaluated the entries in a similar manner as the previous year, but used specific ratings for insight, presentation, interaction, creativity, flexibility, and novelty. There were two first and two second place awards. Teams led by the Iowa State University [Hof05] and Penn State University [Che05] took first place having answered all questions, while the Universität Karlsruhe [Hos05] and Augsburg University [Zei05] provided strong answers and received second place prizes.
first place winners took two different approaches. The team from
second place winners had strong answers. The Company Positioning System from the
Universität Karlsruhe was visually stimulating, and had high scores on
interaction and novelty (Figure 22) [Hos05] while the team from
Figure 22: Company Positioning System by the Universität Karlsruhe
All in all the four winners covered a broad spectrum of techniques for information visualization solutions. In all cases we found that statistical analysis played a key role. The data was just too large for simple human consumption thereby putting visualization in a collaborating role with analysis. Much processing of the data took place and all contestants used coordinated views to answer the questions.
The contest illustrated the difficulty researchers have at giving adequate evidence that their tools could effectively conduct the tasks. Demonstrating the power of a tool can be difficult. Researchers are trained to describe their tools’ novel features more than illustrating them with convincing examples using real data. In 2003 participants barely reported any insight at all. Everyone was focused on the description of their tool. By 2004 more participants (not all) were able to provide insights. In 2005 insights were more common.
Half of the participants were students who built their tools. These tools were not as polished as industrial products or well developed research systems. In 2003, we provided large data sets with some meaningful subsets and in 2004 the data set was not very large. However, in 2005, there was no subset provided and the number of participating students dropped. Providing benchmarks that fit student project's sizes seems important to the success of future contests.
The evaluation process is time consuming and looked at by some of the chairs as a daunting task. Ideally one could compute metrics and add these up assuming independence to get a summary score, but since the questions are open-ended and there is no known “ground truth” it requires human evaluation. Even though there were a limited a number of specific tasks, these can be interpreted differently and of course participants did interpret them in many ways. We seem to end up having to compare non-comparable steps and results.
Evaluating the results remained a subjective activity. After seeing the submissions, the 2003 and 2004 contest reviewers decided to classify the teams in three categories: no evidence of insights gathered using the tool, some insight, and lots of insight, i.e. worthy of a first prize. The 2005 contest used more than 6 categories. This helped with discussions but required a great deal more detailed reviewing of the submitted videos and papers.
Since the data is the same, repeatedly looking at entries is taxing. It can be very hard to remember “who did what” or “who had this insight”. An insight implies novelty of the finding, so reviewers might more positively weigh a reported insight the first time they encounter it and undervalue it later on when reported by another team. Videos were extremely important. Without them it would have made it impossible to understand how most tools worked and what process was used to answer the questions. With videos interactions become understandable. Verbal comments on the videos were indispensable in explaining what the participants were highlighting. This is quite different than simply reviewing a paper and ranking the results. Labeling of the results was not as necessary as verbal description on the videos were sufficient for the descriptions but this required the reviewers to remember key points. On the downside, dealing with videos was very time consuming. Videos were large, download times high, and distribution to reviewers slow (something we now know how to resolve). For the first two contests we were flexible about the format of the videos submitted but this created problems such as finding converters or hunting for missing codecs. For the 2005 contest we required a single format and this simplified the process.
With the InfoVis 2003 contest we attempted to provide real data and tasks while trying to narrow the problem to one data type (trees) and three representative tree types. The contest taught us that the problem was still too large for a contest and that the vague nature of the tasks made it impossible to compare answers effectively. In contrast the 2004 contest had only one dataset, much fewer tasks and a more structured reporting format. Nevertheless, the open-ended nature of realistic tasks and the diversity of approaches still made judging the submitted entries a challenge.
We felt that the time to generate a reasonably clean data set was too large, around 1000 man-hours each year. This is a serious issue for the development of benchmarks. Domain experts should be solicited for cleanup and experimentation on various task solutions should be attempted before the data is released. We hope that industry groups or government agencies wishing to see more research conducted on specific data of interest to them will take on the burden of developing the benchmarks datasets or support groups to do so.
Participating in the contest takes time and motivation. Most participants reported working very hard to prepare their submission. Many acknowledged that it pushed them to improve and test their tools. Some students were encouraged to work on the contest as a class project. Some wanted to test their PhD research. A small company reported appreciating the exposure.
In the 3 contests we gave small prizes to the first and second place teams. Sponsors provided various prizes. Those were appreciated, especially by the students who liked the gaming stations. We also presented winners with certificates that many told us they were happy to hang in their office. Participants appreciated being able to be able to mention the award and the small publication to their resumes. On the other hand, some tenure-track faculty reported being interested but preferring focusing on writing full papers.
We realized that we should in the future anticipate the data set and plan earlier. Given that we ran into errors and noise in almost all the data sets, having more time will help clean the data and prepare better tasks. The 2006 contest selected the census data and made the announcement at the 2005 conference thereby providing potential participants a great deal of time, almost a year, to work on the problem. Pushing the deadline further into the late summer would allow summer interns to work on the contest, but would reduces the reviewing period dramatically.
Many people downloaded the dataset without submitting results and we collected names and emails. The chairs performed an informal survey of those that had downloaded the 2005 contest to see why there were so few participants. The participants stated that there were no problems with the data set or questions, that the data set was a great data set to show system and tool capabilities, and that all had enjoyed the process and would do it again. Most expressed that they wanted a better organized website, automated email on data or news updates, and would have preferred the data in a database format. Some expressed strong interest in splitting entries into commercial and academic categories. Four of the non-participants stated that the requirement to attend the conference hindered their participation and most expressed that they were too busy in their company to tackle such a project. Several expressed a desire for some mini-questions such as “find a more elegant way to look at …”.
There was one recurring theme which all participants and non-participants expressed and that was the need for more time. That was the reason the 2006 contest data was made available at the 2005 conference.
For all three contests we were able to have a whole session at the conference to summarize the results and allow some authors to present. In 2003 only the first place authors presented and we summarized the second place submissions. Attendees commented that it would be better to have shorter presentations but allow more presenters to speak. The following year we arranged for all first and second place winners to present with the second place ones having only 2 minutes. This format was very well received. We specified tasks presenters should focus on so the attendees could better compare the different entries, at the cost of not seeing every feature of the tools. We found that handing out the awards rapidly and keeping photos to a minimum (a group picture at the end of the session) was preferable. This left more time for the presentations and still gave a festive atmosphere to the event. All winners were also given a chance to have a poster displayed during the normal poster session.
contest is only a first step. The
revised materials provided by the authors and the datasets have to be available
after the event. We have strived to keep
the contest pages active and we also have made the submissions available in the
InfoVis repository hosted at the
Contests represent an artificial testing situation where the opinion of judges reflects the quality of the submitted materials, as opposed to the actual merits exhibited when tools are tested interactively and discussed with designers. The impact of contests is most obvious with those that participate and those that see the results but the datasets and tasks remain available after the contests thereby extending their impact. They can be used by developers to exercise their tools and identify missing features, and by evaluators to enrich their testing procedures with complex tasks. These developers and evaluators then have baseline results with which to compare their results. We hope that these data can also be used in controlled experiments, and that the more specific lists of tasks used for those experiment can be added to the repository for reuse.
Benchmarks are difficult to create, promote and use. Our belief is that we are developing solid and evolving benchmarks and are beginning to understand how to better evaluate submissions. Good benchmarks must be real (witness the success of TREC and CAMDA) to both draw the audiences and participants and to strongly push the technology curve. Good benchmark tasks must be open-ended to provide for the flexibility in solutions. We know that this makes the evaluations more difficult to measure analytically but this is realistic. We need to think that more human evaluation will be required in the future and evolve a collection of volunteer judges. These contests continue to demonstrate the challenges of benchmark design and especially of system and tools evaluation.
By making the results of analyses available to the community we provide a repository of baselines for developers to compare to. Teams did interpret our tasks in many different ways, making comparison difficult; nevertheless we feel strongly that it was extremely useful to compare with the same data set and tasks.
The integration of analysis is becoming more necessary as data sets are more complex, large, and coming from diverse sources. The identification of anomalous patterns of data from phone calls, from bank transactions, and from news articles requires new techniques and strong analytical tools. We believe that such data sets and competitions will continue to encourage the community to work on difficult problems while building a baseline of comparable tasks and datasets.
We thank the organizers of the IEEE Visualization InfoVis Symposium, in particular John Dill, Tamara Munzner and Stephen Spencer, for their continual support, members of the InfoVis community for their intellectual stimulations, and the participants without whom there would be no contest. We would also like to thank the students for their help in extracting the metadata, cleansing the data set, and producing a richly usable data set: Caroline Appert (Université Paris-Sud, France) and Urska Cvek, Alexander Gee, Howie Goodell, Vivek Gupta, Christine Lawrence, Hongli Li, Mary Beth Smrtic, Min Yu and Jianping Zhou (University of Massachusetts at Lowell).
We thank the sponsors who provided first prizes, The Hive Group, ILOG and Stephen North personally. After the first release of the datasets many others offered their help, including Jeff Klingner from Stanford, Kevin Stamper, Tzu-Wei Hsu, Dave McColgin, Chris Plaue, Jason Day, Bob Amar, Justin Godfrey and Lee Inman Farabaugh, from Georgia Tech, Niklas Elmqvist from Chalmers, Sweden, Jung-Rung Han, Chia-Ning Chiang and Tamara Munzner from UBC, and Maylis Delest from the Université de Bordeaux.
We thank ACM and IEEE and in particular Mark Mandelbaum and Bernard Rous for helping make the 2004 data available and working with us to prepare the dataset, Shabnam Tafreshi for help with the website, and finally but not least Paolo Buono from the University of Bari, Italy, for participating in the review process. We thank Michael Best for working with us on releasing the technology company data for the 2005 contest.
We also thank Sharon Laskowski for working with Catherine Plaisant on the evaluation section of the NVAC research agenda [Tho05] which helped refined some of the sections of this paper and lead to Figure 2.
[Ahm04] Adel Ahmed, Tim Dwyer, Colin Murray, Le Song, Ying Xin Wu, WilmaScope, Poster Compendium of IEEE Information Visualization (2004)
[Alo98] Alonso, D., Rose, A., Plaisant, C., and Norman, K., Viewing personal history records: A Comparison of tabular format and graphical presentation using LifeLines, Behavior and Information Technology 17, 5, 1998, 249-262.
[Aub03] Auber, D., Delest, M., Domenger, J-P., Ferraro, P., Strandh, R., EVAT - Environment for Visualization and Analysis of Trees, in Poster Compendium of IEEE Information Visualization (2003)
[Bel06] BELIV’06, BEyond time and errors: novel evaluation methods for Information Visualization, a workshop of the AVI 2006 International Working Conference. http://www.dis.uniroma1.it/~beliv06/
[Bmr06] Information Visualization Benchmark Repository www.cs.umd.edu/hcil/InfovisRepository
[Bor06] InfoVis CyberInfrastructure — http://iv.slis.indiana.edu
[CAM06] Critical Assessment of Microarray Data Analysis (CAMDA) conference, http://www.camda.duke.edu/camda06
[CES06] Council of European Social Science Data Archives (CESSDA) – http://www.nsd.uib.no/cessda
[Che00] Chen, C., Czerwinski, M. (Eds.) Introduction to the Special Issue on Empirical evaluation of information visualizations, International Journal of Human-Computer Studies, 53, 5, (2000), 631-635.
[Che04] Chen, C., Citation and Co-Citation Perspective, Poster Compendium of IEEE Information Visualization (2004)
[Che05] Jin Chen, Diansheng Guo, Alan M. MacEachren, Space-Time-Attribute Analysis and Visualization of US Company Data, Poster Compendium of IEEE Information Visualization (2005)
[Chi93] Chinchor, N., Hirschman, L., Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3), Computational Linguistics 19, 3 (1993) 409 - 449
[Cow05] Cowley, P., Nowell, L., Scholtz, J., Glassbox: an instrumented infrastructure for supporting human-interaction with information, Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) , pp. 296.3 (2005)
[Del04] Maylis Delest, Tamara Munzner, David Auber, Jean-Philippe Domenger, Tulip, Poster Compendium of IEEE Information Visualization (2004) [Fek03] Fekete, J-D and Plaisant, C., InfoVis 2003 Contest, www.cs.umd.edu/hcil/iv03contest (2003)
[Fek06] Fekete, J.-D., Infovis Toolkit, http://ivtk.sourceforge.net/
[Geh04] Gehre, J., Ginsparg, P., Kleinburg, J., Overview of the 2003 KDD cup, SIGKDD Explorations, 5,2 (2004) 149-151.
[Gon03] Gonzales, V., Kobsa, A., Benefits of
information visualization for administrative data analysts, Proceedings of the Seventh International
Conference on Information Visualization,
[Gri02] Grinstein, G., Hoffman, P., Pickett, R., Laskowski, S., Benchmark Development for the Evaluation of Visualization for Data Mining, in Fayyad, U., Grinstein, G., Wierse, A. (Eds.) Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, San Francisco (2002) 129-176.
[Ham04] Frank van Ham, Technische Universiteit Eindhoven Contest Submission, Poster Compendium of IEEE Information Visualization (2004)
[Hof05] Heike Hofmann, Hadley Wickham, Dianne Cook, Junjie Sun, Christian Röttger, Boom and Bust of Technology Companies at the Turn of the 21st Century, Poster Compendium of IEEE Information Visualization (2005)
[Hon03] Hong, J. Y., D'Andries, J., Richman, M., Westfall, M., Zoomology: Comparing Two Large Hierarchical Trees, in Poster Compendium of IEEE Information Visualization (2003)
[Hos05] Bettina Hoser, Michael Blume, Jan Schröder, and Markus Franke, CPS- Company Positioning System: Visualizing the Economic Environment, Poster Compendium of IEEE Information Visualization (2005)
[Hsu04] Hsu Tzu-Wei, Lee Inman Farabaugh, Dave McColgin, Kevin Stamper, MonkEllipse, Poster Compendium of IEEE Information Visualization (2004)
[Inf04] Fekete, J.-D., Grinstein, G. and Plaisant, C., InfoVis 2004 Contest, www.cs.umd.edu/hcil/iv04contest
[Inf05] Grinstein, G., U. Cvek, M. Derthick, M.
Trutschl, IEEE InfoVis 2005 Contest, Technology Data in the
[Inf06] InfoVis 2006 Contest http://sun.cs.lsus.edu/iv06/
[Ira03] Irani, P. , Ware, C., Diagramming information structures using 3D perceptual primitives, ACM Transactions on Computer-Human Interaction, 10, 1 (2003), 1-19
[Kei04] Keim, D., Christian
Panse, Mike Sips, Joern Schneidewind, Helmut Barro,
[Kei95] Keim, D., Bergeron, R. D., Pickett, R., Test
datasets for evaluating data visualization techniques. In Grinstein, G., Levkowitz, H. , Perceptual Issues in Visualization,
[Kom04] Komlodi, A., Sears, A., Stanziola, E., InformationVisualization Evaluation Review, ISRC Tech. Report, Dept. of Information Systems, UMBC. UMBC-ISRC-2004-1 http://www.research.umbc.edu/~komlodi/IV_eval (2004).
[Lee04] Lee Bongshin, Mary Czerwinski, George Robertson, Benjamin B. Bederson, PaperLens, Poster Compendium of IEEE Information Visualization (2004)
[Lin04] Lin Xia, Jan Buzydlowski, Howard D. White, Associative Information Visualizer, Poster Compendium of IEEE Information Visualization (2004)
[Kob04] Kobsa, A., User experiments with tree visualization systems, Proc. of IEEE Symposium on Information Visualization (2004) 9-16
[Mac86] Mackinlay, J., Automating the design of graphical presentations of relational information, ACM Trans. on Graphics, 5, 2 (1986) 110, 141
[Mor03] Morse, D. R., Ytow, N., Roberts, D. McL., Sato, A., Comparison of Multiple Taxonomic Hierarchies Using TaxoNote, in Poster Compendium of IEEE Information Visualization (2003)
[Mul97] Mullet, K., Fry, C., Schiano, D., On your
marks, get set, browse! (the great CHI'97 Browse Off), Panel description in ACM CHI'97 extended abstracts, ACM,
[Mun03] Munzner, T., Guimbretière, F., Tasiran, S., Zhang, L. and Zhou, Y., TreeJuxtaposer: Scalable tree comparison using Focus+Context with guaranteed visibility. ACM Transactions on Graphics, SIGGRAPH 03 (2003) 453-462
[Nai01] North American Industry Classification System, www.census.gov/epcd/www/naics.html
[NYT06] New York Times – Election 2004 http://www.nytimes.com/packages/html/politics/2004_ELECTIONGUIDE_GRAPHIC/?oref=login (retrieved June 2005)
[Pal00] Pallett, D., Garofolo, J., Fiscus, J., Measurement in support of research accomplishments, Communications of the ACM, 43, 2 (2000) 75-79
[Pla02] Plaisant, C., Grosjean, J., and Bederson, B. B., SpaceTree: Supporting exploration in large node-link tree: design evolution and empirical evaluation, IEEE Symposium on Information Visualization (2002), 57-64.
[Pla04] Plaisant, C. The Challenge of Information
Visualization Evaluation, in Proceedings
of the working conference on Advanced Visual Interfaces (AVI 2004), pp.
[Sar04] Saraiya, P., North, C., Duca, K., An evaluation of microarray visualization tools for biological insight, Proc. of IEEE Symposium on Information Visualization (2004) 1-8
[Sch02] J. Scholtz, L. Arnstein, M. Kim, T. Kindberg, and S. Consolvo, User-Centered Evaluations of Ubicomp Applications, Intel Corporation IRS-TR-02-006, May 2002 2002.
[Sch05]Scholtz, J., Steves, M.P., A Framework for Evaluating Collaborative Systems in the Real World, to appear in Proc. Hawaii International Conference on System Sciences, 2005
[She03] Sheth, N., Börner, K., Baumgartner, J., Mane, K., Wernert, E., Treemap, Radial Tree, and 3D Tree Visualizations, in Poster Compendium of IEEE Information Visualization (2003)
[Shn06] Strategies for Evaluating Information Visualization Tools: Multidimensional In-depth Long-term Case Studies, Shneiderman, B., Plaisant, C., Proc. of BELIV’06, BEyond time and errors: novel evaLuation methods for Information Visualization, a workshop of the AVI 2006 International Working Conference, ACM (2006) 38-43
[SMo06] Smart Money Map of the Market www.smartmoney.com (retrieved June 2005)
[Spen00] Spenke, M., Beilken, C., InfoZoom - Analysing Formula One racing results with an interactive data mining and visualization tool, in Ebecken, N. Data mining II, (2000), 455–464
[Sta00] Stasko, J. Catrambone, R., Guzdial, M. and McDonald, K., An Evaluation of Space-Filling Information Visualizations for Depicting Hierarchical Structures, International Journal of Human-Computer Studies, 53, 5 (2000) 663-694.
[Teo04] Soon Tee Teoh, Kwan-Liu Ma, One-For-All -
[Tra00] Trafton, J., Tsui, T., Miyamoto, R.; Ballas, J., Raymond, P., Turning pictures into numbers: extracting and generating information from complex visualizations. International Journal of Human Computer Studies, 53, 5 (2000), 827-850.
[Tho05] Thomas, J. and Cook, K. (Eds.) Illuminating the Path: The Research and Development Agenda for Visual Analytics, IEEE CS Press (2005), http://nvac.pnl.gov/agenda.stm
[TRE06] Text REtrieval Conference (TREC), http://trec.nist.gov/
[Tym04] Jaroslav Tyman, Grant P. Gruetzmacher, John Stasko, InfoVisExplorer, Poster Compendium of IEEE Information Visualization (2004)
[VAS06] Grinstein, G., O’Connell, T., Laskowski, S., Plaisant, C., Scholtz, J., Whiting, M., VAST 2006 Contest: A tale of Alderwood, Proc. of IEEE Visual Analytics Science and Technology conference (2006) to appear.
[VAS06b] VAST 2006 Contest: www.cs.umd.edu/hcil/VASTcontest06
[Voo00] Voorhees, E., Harman, D., Overview of the sixth Text Retrieval Conference (TREC-6), Information Processing and Management, 36 (2000) 3-35
[Wei04] Weimao Ke, Katy Borner,
[Won04] Wong Pak Chung, Beth Hetzler, Christian Posse, Mark Whiting, Sue Havre, Nick Cramer, Anuj Shah, Mudita Singhal, Alan Turner, Jim Thomas, IN-SPIRE, Poster Compendium of IEEE Information Visualization (2004)
[Zei05] Zeis Annerose, Sergej Potapov, Martin Theus,