Promoting Insight
Based Evaluation of Visualizations:
From Contest to
Benchmark Repository
Catherine
Plaisant
Human-Computer Interaction
Lab.
plaisant@cs.umd.edu
Jean-Daniel
Fekete
INRIA
Futurs/LRI
Université
Paris-Sud
Jean-Daniel.Fekete@inria.fr
Georges Grinstein
Institute for Vis. & Perception Research
grinstein@cs.umd.edu
ABSTRACT
Information
Visualization (InfoVis) is now an accepted and growing field with numerous
visualization components used in many applications. However, questions about
the potential uses and maturity of novel visualizations remain. Usability
studies and controlled experiments are helpful but generalization is difficult.
We believe that the systematic development of benchmarks will facilitate the
comparison of techniques and help identify their strengths under different
conditions. A benchmark typically
consists of a dataset, a list of tasks, and a list of non-trivial discoveries. We
were each involved in the organization of three information visualization
contests for the 2003, 2004 and 2005 IEEE Information Visualization
Symposia. Our goal is to encourage the
development of benchmarks, push the forefront of the InfoVis field by making
difficult problems available, create a forum for the discussion of evaluation
and provide an interesting event at the InfoVis conference. The materials produced by the contests are
archived in the Information Visualization Benchmark Repository. We review the state of the art and challenges
of evaluation in InfoVis, describe the three contests, summarize their results,
discuss outcomes and lessons learned, and conjecture the future of
visualization contests.
General Terms
Visualization,
information, competition, contest, benchmark, repository, measure, metrics
Information
Visualization is now an accepted and growing field with numerous visualization
components used in mainstream applications such as SPSS/SigmaPlot, SAS/GRAPH,
and DataDesk, in commercial products such as Spotfire, Inxight, HumanIT, and
ILOG JViews, and in domain specific standalone applications such as interactive
financial visualizations [SMo06] and election data maps [NYT06]. Nevertheless, questions remain about the
potential uses of these novel techniques, their maturity and their limitations.
Plaisant
reviewed evaluation challenges specific to information visualization and
proposed initial steps [Pla04] such as refined evaluation methodologies, use of
toolkits, dissemination of success stories, and the development of contests
(Figure 1), benchmarks and repositories, the focus of this paper.
Figure 1: A collage of sample
screens from the InfoVis 2004 contest illustrating the diversity of
visualization methods used to address a task.
Empirical
user studies are very helpful but take significant time and resources, and are
sometimes found of limited use because they have been conducted with ad-hoc
data and tasks in constrained laboratory situations. Benchmarks facilitate the comparison of
different techniques and encourage researchers to work on challenging problems.
However to be convincing, the utility of new techniques needs to be
demonstrated in a real setting, within a given application domain and set of
users. Contests attempt to create
surrogate situations that are representative of real world situations. They
engage teams’ competitive spirit to produce materials that can help the
community compare visualization tools applied to the same problem.
Competitions
help push the forefront of a field quickly. In some cases it is simply the
emotional aspect of winning or the excitement of live competition that compels
researchers to participate. TREC competitions (the Text REtrieval Conference) [Voo00]
exemplify the best of these in being able to bring in many corporate and
academic research groups.
A
contest poses a problem that many will attempt to solve. If the problem is
challenging and representative of a real world situation, then the solutions
proposed by contestants provide insights into what techniques are possible, and
which ones are potentially better to pursue. Often these solutions provide such
good results that other participants are driven to compete in the next year’s
contests. The current contest data sets and tasks become part of the baseline
against which new techniques can be tested.
These contest submissions then describe the insights found with various
tools, illustrate the current state of the art and entice researchers to find
even better solutions.
In
this paper we review the state of the art and challenges of evaluation in the
field of information visualization, describe the three contests, summarize
their results, discuss the outcome of those three events, and conjecture the
future of visualization contests.
Information
visualization systems can be very complex [Chen00] and require evaluation
efforts targeted at different levels. One approach described in the Visual
Analytics research agenda [Tho05] is focused on three levels: the component
level, the system level, and the work environment level (Figure 2).
At the component level are the individual algorithms
(e.g. clustering or linguistic analysis), visual representations, interactive
techniques and interface designs. Data
analysis algorithms can typically be evaluated with metrics that can be
observed or computed (e.g. speed, accuracy, sensitivity or scalability), while
other components require empirical user evaluation to determine their benefits
[Che00, Chen00b]. Metrics include effectiveness (e.g. time to complete simple
tasks) and efficacy (e.g. number of errors or incomplete tasks). There have been demonstrations of faster
task completion, reduced error rates or increased user satisfaction measured in
laboratory settings using some visualization components. These studies are helpful to compare isolated
interaction techniques or data representations e.g. [Ira03, Alo98]. Studies comparing slightly more complex tools
combining a few components – at least a choice of interaction and visual
representation - are also available, e.g. [Pla02, Kob04, Sta00]. They often reveal that different tools perform
better for different types of tasks, but it is often difficult to pull apart
what part of the system really impacts the performance of the tool. Some limited techniques allow computed scores
to be generated to evaluate the potential quality of simple displays, e.g.[Mac86]
but controlled experiments remain the workhorse of evaluation.
Figure 2: The 3
evaluation levels for Visual Analytics (Figure 6.1 in [Tho05])
At
the system level, interfaces combine and integrate multiple components and need
to be evaluated by comparing them with technology currently used by target
users. Metrics need to address the
learnability and utility of the system.
Those evaluations may take place in the laboratory using surrogate
scenarios but address complex tasks conducted over a longer period of time than
component-level evaluations. A new
approach is to encourage insight-based evaluation. The Infovis 2003-2006 contests are examples
of efforts encouraging insight-based evaluation [Inf03, Inf04, Inf05, Inf06],
and there are recent empirical studies which measure insight [Sar04].
At
the work environment level evaluation addresses issues influencing
adoption. Metrics might include user
satisfaction, trust and productivity. Case studies and ethnographic studies are
used but they remain rare in the field of information visualization. Case
studies report on users in natural environment doing real tasks [Gon03, Tra00].
They can describe discoveries, collaborations among users, the frustrations of
data cleansing and the excitement of data exploration. They can report on
frequency of use and benefits gained. The disadvantage is that they are very
time consuming and may not be replicable or applicable to other domains.
Recently the Beliv’06 workshop [BEL06] provided a good
overview of the most recent work on improving information evaluation, including
the development of specific heuristics, metrics or taxonomies of tasks. Of course usability evaluation remains a
cornerstone of user-centered design and evaluation. It is of paramount
importance for product engineering but also a powerful tool for researchers as
it provides feedback on problems encountered by users and guides designers
toward better designs at all three evaluation levels.
Simple
benchmark data sets abound. Some repositories
simply make data sets available (e.g. the Council of European Social Science
Archives [CES06]) while others offer tools to help promote research in specific
domaina (e.g.
Although
ideally one should be able to evaluate the quality of answers computationally
this is often not possible. The problem is the fuzziness of answers to the
contest: it may be a collection of articles, a new algorithm, or a new
visualization, each of whose correctness may not be computable. This forces
human evaluation (TREC still uses human judging for determining the accuracy of
the retrieval.) and such is the case for the IEEE InfoVis Competitions.
Another
difficulty for information visualization comes from the impact of the discovery
process, an extremely interactive and personal activity. Whereas computational
algorithms can be compared through the accuracy of the results, most often it
is not possible to accurately measure the results of visualization. We do not
have measures of perceptual information transfer. There is beginning research
in measures of interestingness and other metrics related to visualization [Kei95,
Gri02], but these are in their infancy and too simple to be applied to the
current contests.
We
can identify simple tasks which yield precise results or we can specify
exploratory tasks and thus have much less predictable results. This makes the
evaluation process difficult to plan for and forces real time evaluation
criteria which end up being reviewer dependent. Despite these constraints, one
can still argue for simple tasks. A system which does not make it possible to
achieve simple tasks would be a very limiting system and is likely not to
support more complex or exploratory tasks. One can also argue that simple tasks
are unrealistic. We tried to balance task simplicity vs. complexity to obtain a
satisfying tradeoff.
Distinct
challenges present themselves for evaluating at the component or system/work
environment levels.
At the component level, the
main challenge is to move beyond the proliferation of isolated evaluation to a
more concerted effort to generate guidelines for selecting techniques based on
the tasks and data characteristics. A
characteristic of the field of information visualization is the great diversity
of approaches available to designers to handle any type of data and the
combinatorial explosion of possible implementations. Toolkits [Fek06] or even code repositories
[Bor06] can help researchers control some of that diversity to adequately
compare individual components. In
control studies, dataset and task selection has been until now an ad-hoc
process making it difficult to compare results across studies. This would be aided by the development of
comprehensive task taxonomies and benchmark repositories of datasets, tasks and
results. Another problem is that studies
generally include only simple tasks. A
literature survey [Kom04] confirms this fact by stating that experiments
usually include locate and identify tasks, but that tasks requiring users to
compare, associate, distinguish, rank, cluster, correlate or categorize have
only been rarely covered. Those studies
are very difficult to design, and better experimental design training for
researchers will greatly improve the outcome of evaluation efforts.
Another characteristic of visualization is that the
analysis process is rarely an isolated short term process. Users may need to
look at the same data from different perspectives and over a long time. Users may also use analytics to answer
questions about visible and non-visible patterns. They may also be able to
formulate and answer questions they didn’t anticipate having before looking at
the visualization. This is in contrast
with typical empirical studies techniques which recruits subjects for a short
time to work on imposed tasks. Finally,
discoveries can have a huge impact but they occur very rarely, or not at all,
and are unlikely to be observed during a study. Insight based studies as described in [Sar04] are one first step but new evaluations
methods need to be devised to address this problem.
At the system
level, evaluating information
visualizations and their
interfaces is a daunting challenge.
Success is difficult to quantify and utility measures are elusive. Tasks become significantly more complex and
difficult to emulate in a laboratory environment. Working with realistic data is crucial but
“ground truth” is not always available. Even when available the comparison of
steps to results is often impossible. Users’ motivation and expertise greatly
influences performance. In traditional
component level empirical studies the level of training of subjects is
typically limited and subjects are not allowed to consult with colleagues or
use outside sources as they normally would in their work environments. Using domain experts will lead to more
realistic results but individual differences between subjects should be
controlled for results to be useful.
Trust is a particularly important aspect of Visual Analytics (VA) system
evaluation. It is challenging to measure
while of paramount to user acceptance during product deployment. Discovery is
seldom an instantaneous event, but requires studying and manipulating the data
repetitively from multiple perspectives and possibly using multiple tools. Facilitating the transfer of data between
heterogeneous tools and keeping the history of the investigation might well be
just as important for discovery as the functionalities of individual
components. Longitudinal studies may be
more helpful but they are more difficult to conduct. Measuring the impact of integrated components
that require users to manipulate visual as well textual representations, use the
web to find complementary information, integrate analytics and possibly spend
hours brainstorming with colleagues remains a challenge. Another challenge is that success may not be due
to nor easily traceable back to the visualization. For example an effective visualization used
on a daily basis by an analyst may heighten their awareness of certain
activities by allowing them to absorb and remember large amount of information
effortlessly. However it might be
difficult or impossible to link later decisions to a particular tool as
awareness is difficult to identify and measure, and decision-making uses
information from diverse sources. In
fact, the introduction of visualization might even trigger changes in work practices,
exacerbating the problem of identifying cause and effect. Shneiderman and Plaisant have proposed to
use Multidimensional In-depth Long-term Case Studies (MILCS) as a way to study
and evaluate creativity tools such as visual analytics and information
visualization [Shn06].
The first contest took place in 2003 [Inf03]
(Figure 3). We invited submissions of case studies on the use of information
visualization for the analysis of tree structured data, and in particular to
look at differences between pairs of similar trees.
There
are hundreds of types of tree with varying characteristics. In an effort to be representative of this
diversity while remaining accessible for a contest we selected three very
different examples. Three pairs of
datasets were provided in a simple XML format.
·
Phylogenies
Small binary trees (60 leaf nodes) with a link length attribute. No node
attributes except their names.
·
Classifications
Very large trees (about 200,000 leaf nodes) with large fanouts. Three node
attributes, all nominal. Labeling, search and showing results in context is
important. We allowed teams to work on a
subset of the dataset (the "mammal" subtree) if they could not handle
that many nodes.
·
File system and
usage logs
The trees are large (about 70,000 leaf nodes). Many
attributes, numerical and nominal. Changes between the two trees can be
topological changes and attribute value changes. Data for 4 periods was
available.
We
provided general tasks (about 40 tasks in 11 categories) and tasks specific to
the selected datasets. General tasks were low level tasks commonly encountered
while analyzing any tree data: topological tasks (e.g., which branch has the
largest fan-out?), attribute based tasks (e.g., find nodes with high values of
X), or comparison tasks (e.g., did any node or subtrees
"move"?). On the other hand
the tasks specific to particular datasets included more broad goal-setting
tasks (e.g., for the phylogenies, what
mapping between the two trees topologies could indicate co-evolution, and,
maybe, the points where the two proteins were not co-evolving?) We made clear
that it was acceptable not to work on all tasks and that partial answers were
OK. We also clarified that we were not looking for a detailed result list (e.g.,
a list of deleted nodes for the task “what nodes where deleted”) but an
illustration or demonstration of how the visualization helped find the
answer. General background information
was provided about the data and tasks, which was particularly important for the
phylogenies.
Teams
had five months to prepare. The participants were required to submit the
following materials:
•
Two page summary
•
Video
illustrating the interactive techniques used
•
Web page of
accompanying information and
•
Index page with
team information
We
received eight entries. It was a small
number but satisfactory for the first contest.
The
first main finding was that the tasks and datasets were too complex for such a
contest. Each tool addressed only a subset of the tasks and only for a subset
of the datasets. The phylogeny chosen
required domain expertise hence was “real”, and even though it consisted of a
small binary tree, it was not used, probably because the tasks were complex and
required working with biologists.
The
second main finding was that it was difficult to compare systems even with
specific datasets and tasks. We had hoped to focus the attention of submitters
on tasks and results (insights), but the majority of the materials received
focused on descriptions of system features.
Little information was provided on how users could accomplish the tasks
and what the results meant, making it very difficult for the judge to compare.
The systems presented were extremely diverse, each using different approaches
to visualize the data.
There
were three first-place entries. TreeJuxtaposer [Mun03] (Figure 3) submitted the
most convincing description of how the tasks could be conducted and the results
interpreted. Zoomology [Hon03] (Figure 4) demonstrated how a custom design for
a single dataset could lead to a useful tool that addressed many of the tasks
satisfactorily. InfoZoom [Spe00] (Figure 5) was the most surprising entry
(Figure 5). This tool was designed for
manipulating tables and not trees. However the authors impressed the judges by
showing that they could perform most of the tasks, find errors in the data and
provide insights in the data. The three
second-place entries showed promise but provided less information to the judges
on how the tasks were conducted and the meaning of the results. EVAT [Aub03] (Figure
6) demonstrated that powerful analytical tools complementing the visualization
could assist users in accomplishing their tasks. Taxonote [Mor03] (Figure 7) demonstrated
that labeling is an important issue making textual displays attractive. The
All
entries were given a chance to revise their materials after the contest. We
required participants to fill a structured form with screenshots and
explanations for each task. That information is archived in the Information
Visualization Benchmark Repository [Bmr06].
Figure 3:
Treejuxtaposer
Figure 4: Zoomology
Figure 5: InfoZoom
Figure 6: EVAT
Figure 7: Taxonote
Figure 8: A
combination of tools -
The
second competition coincided with the 10 year anniversary of the InfoVis
Symposium. As the visualization of the history of a field of research is a
problem interesting in itself, it naturally formed the core part of the
contest. The key advantage of the topic is that it was familiar to all
participants. The disadvantage was that the selected corpora was not readily
available in a usable form.
The
set of all publications on a topic is too large a universal set of discourse
for a competition. We first argued about which conferences or journals to
include, then decided to limit the dataset to all the IEEE InfoVis Symposium
papers and all of the articles used as reference in those papers. Metadata is
rich for IEEE and ACM publications and unique keys available.
Producing
a clean file (metadata for the collection of documents) was a much bigger
challenge than we had imagined. We first made an assumption that both the
articles and the most important authors in information visualization would be
referred by most of the articles published within the InfoVis symposium. Our
look at the references initiating from articles published within InfoVis seemed
to us at the same time focused on the field and complete. It would be unlikely
that an important publication in information visualization would seldom be
referenced by other articles.
This
was partially correct but text metadata still yielded numerous
ambiguities. IEEE manages the InfoVis
articles which are less curated than those of the ACM. Much text metadata was non-unique (e.g., many-to-one
names such as Smith, Smyth, Smithe, …).
Reference titles were too noisy and in many cases erroneous as text is
handled by the ACM Digital Library as strings and numerical computations such
as string comparisons are still weak.
Much curation on our end was necessary as references were noisy,
sometimes missing, and even sometimes pointing to non-existing URLs.
We
thus embarked on cleaning the data. This
was a complicated process, with multiple passes, and manually intensive, even
with automatic reference extraction as we found no reasonable automatic system
to suitably resolve the problems. We
manually extracted the articles from eight years of pdf files from the symposia
available in the digital library. We
then semi-automatically retrieved the articles referenced in those papers again
from the digital library. We extracted
those which existed when found and manually cleaned and unified the
publications not included in ACM library.
The
result was a file containing 614 descriptions of articles published between
1974 and 2004 by 1,036 authors, referencing 8,502 publications. It took well over 1,000 hours for us to
construct that file, with over 30 people involved.
We
proposed 4 high level tasks with a great deal of flexibility for a variety of
solutions:
1. Create a static visualization showing
an overview of the 10 years of InfoVis
2.
Characterize the research areas
and their evolution
3. The People in InfoVis: where does a
particular author/researcher fit within the research areas defined in task 2?
4. The People in InfoVis: what if any,
are the relationships between two or more researchers?
We
suggested particular names for task 3 to facilitate comparisons between
submissions, and participants used them, along with other names.
The
participants were required to submit
•
A two page
summary
•
A video
illustrating the interactive techniques used
•
A structured web
form providing details as to how the tasks were accomplished and what discovery
or insights were identified
There
were 18 submissions from 6 countries (
Quality
improved dramatically between 2003 and 2004.
The good news was that most teams had provided a lot more insights than
we had seen in the 1st contest.
Still, some teams had tools that seem promising “on paper” but reported
very few insights (in consequence they did not do very well in the
contest.) On the other hand some teams
presented tools that seemed of doubtful utility to the reviewers at first but
were able to report useful insights, therefore fairing better than we had
expected in the results. Of course the
best teams had everything at once:
promising visualizations, lots of insight reported, and convincing
explanations of how the insights were obtained using the tools.
None
of the 12 selected teams answered all the questions. A few of the
participants had extensive experience with text analysis and that was visible
in their results. Other had background
knowledge of the InfoVis community and could provide better hypothesis about
what they were seeing. One tool was
developed entirely from scratch for the contest but most teams showed
interesting new uses of existing techniques. Node Link diagrams were a very
commonly used representation for many of the tasks, with some notable
exceptions.
This
second contest had a single dataset and simpler tasks so we thought reviewing
and comparing results would be much easier.
Not so. Ideally one would be able to evaluate the quality of answers
computationally but this was not possible. The problem was the fuzziness of the
answers and the lack of “ground truth” or even consensus on what the best
answer might look like. Teams’ answers took various forms: from a collection of
articles or names to a new algorithm to a new visualization, all of whose
correctness was not computable. Only human evaluation was appropriate to judge
the validity of the answers. In information retrieval, TREC for example does
uses human judging for determining the relevance of documents (i.e. the
answers) from which metrics can be computed for a team’s set of results. Short
of spending time with the team throughout the discovery process (an extremely
interactive and personal activity) we could only base our judgment on the
materials provided (video and Web form).
There were three 1st
place entries and one student 1st place:
· The entry from
· The entry from Microsoft and
· The student 1st prize went to a team from
the
Second
place prizes (see Figure 13 to 19) went to the Université de Bordeaux I with
the
We were satisfied
that teams reported useful insights but we were still surprised by how few were
reported, and even fewer really surprising insights. Insights about the whole
structure were rare and only came from teams who had experience looking at
other domains (e.g., the fact that InfoVis is a small world, tightly connected,
was mentioned only by 2 teams.) One team
noticed that the most referenced papers were published at CHI, not at
InfoVis. Only three teams noticed the
existence of references to future papers, a problem resulting from automatically
processing references and confusing multiple versions with similar titles such
as a video and a paper. Only one insight dealt with something that was
surprisingly missing namely that there were no papers in the dataset from
several of other competing InfoVis conferences, despite the fact that they had
been held for several years.
Teams interpreted
the tasks and used the data in surprisingly very different ways. A task such as
“describe the relationships between authors” was interpreted in at least the
following nine different ways as report on co-authorship; or co-citation; or
people working on similar topics; or having a similar number of co-authors; or being
a part of big groups or teams; or having
a similar number of publications; or a similar number of references; or working
in the same institution; or working in the same clique or empire.
Teams also used the data differently: they created
displays showing either only the IEEE Infovis symposium papers or all papers
including the references. Sometimes they
combined both authors and topics and sometimes they used separate
displays. In one case we suspected that a
team used only the papers first authora but could not tell for sure. One team only used references from InfoVis
papers references but not references to papers from other venues. The data made visible was generally pruned
dramatically to work with the tools or to create more useful or possibly appealing
displays. Few ever attempted to show complete views. Some teams had a “celebrity” approach
ignoring everything but the star papers or authors based on some unique
criteria (e.g., numbers of citations).
Some clustered first then pruned later with no clear explanation of what
has been pruned.
Reviewing the
displays seemed easy at first, but it quickly became impossible to remember
what data we had just been looking at, not mentioning trying to compare results
even when it would have been possible.
Many displays had no legends or very poor legends and none had any
summary of the process that generated the display. Each team probably had a clear model of the
scope of the data and how it was filtered, aggregated and interpreted, but the
displays did not reflect that.
Few
teams even attempted to answer question 1, to create a static visualization
showing an overview of the 10 years of InfoVis.
Teams merely reused one of the screen shots from other tasks so we felt
only one aspect of the data had been portreyed and not the entire 10 years of
InfoVis. Teams reported very different topics and different numbers of topics
(from 5 to 12) and some created topics on the fly, refining the topics
iteratively. Sometimes a seemingly
narrow topic would take a prominent place: “parallel coordinates” was a major
topic in one case while in another system “taxonomy” was a major topic. Reviewing all the submissions gave us an
impression of randomness in the choice or labeling of the topics. One of the student team used their
professor’s notes to extract topics. It
was innovative but again, affected our ability to making comparisons. Most visualizations limited the total number
of topics which limited the insights to be related to those topics. But topic extractions were not the focus of
the contest so we did not judge the quality of the topics. Nevertheless this made
it more difficult to compare insights.
Some tools (e. g., In-spire [Won04] and an entry from
Overall, labeling
remained a very big issue. Very rarely could we actually guess paper titles
when looking at a display. Better
dynamic layout techniques for labels were clearly needed. Labels for papers usually consisted of the first
few words or even just the first author making it difficult to remember if we
were looking at author relationships, or papers, or even topic relationships,
e.g. a large node labeled “Johnson” could represent the often-referenced
Treemap paper.
Some tools had
only one window [Teo04], but most used multiple windows, showing either
variants [Ham04] or very different displays for different tasks [Chen04], [Kei04].
The Paperlens submission [Lee04] illustrated the importance of coordinating
views. Only two teams dealt with missing
data and uncertainty, others ignored the problem entirely. Visual metaphors seemed to have had an effect
on the words teams used to describe their findings, e.g. one team [Ahm04]
talked about empires when looking at towers in 3D, while others talked about
cliques while looking at clusters on node link diagrams. Unfortunately, we also
saw examples of “nice pictures” that didn’t seem to lead to any insight.
The
2004 contest session at the workshop was very well attended and we received
extremely positive feedback. Attendees
reported being able to appreciate the wide diversity of solutions and contrast
the different techniques. We conjecture
that the topic we had selected also made the contest more accessible.
Figure 9: Link diagram, from
Figure 10 In-Spire
clusters from the Pacific Northwest National Laboratory [Won04]
Figure 11
PaperLens distributions from Microsoft and the
Figure 12:
Wilmascope topic flows from the
Figure 13: Document graphs from the U. de Bordeaux I and
the
Figure 14: Link diagram from the
Technische Universiteit Eindhoven [Ham04]
Figure 15: Topical overview and focus from Georgia
Institute of Technology [Hsu04]
Figure 16: Document timeline and classes from the
Figure 17: Topic classification from
Figure 18: Author link diagram from
Figure 19: Topic and author timeline from the
In
the third competition [Inf05] the chairs aimed for the evaluation of more
complete visualization systems and a different type of data. The data set was larger and the questions
more targeted. The goal was to identify how well visualization or visual
analytics systems or even specific tools could perform with a large but easily
understood data set. The chairs missed a key point in that the problem was
probably better phrased as a GIS challenge rather than simply an information
visualization one. The chairs also released the data set only for the
competition. The owner of the data set did not permit an open release,
something that the chairs tried to avoid and hopefully will avoid in the
future.
We selected a large,
information rich, and real data set. The
data consisted of information on about 87,659 technology companies in the US,
including year founded, zip code, yearly sales, yearly employment information,
along with industry and product information using the North American Industry
Classification System [Nai01]. This was a large data set with geographic
interpretation, one which pushed the limits of many systems. The data was cleaned by graduate students at
the
The
three questions related to the characterization of correlations or other
patterns amongst variables in the data were
1.
Characterize
correlations or other patterns among two or more variables in the data.
2.
Characterize
clusters of products, industries, sales, regions, and/or companies.
3.
Characterize
unusual products, sales, regions, or companies.
One
additional question was more general and open-ended
4. Characterize any other trend, pattern, or structure that may be of
interest.
The
chairs felt that these precise questions would make evaluation simpler. And
again this was not correct as all questions were open-ended and comparing the
discovery of different correlations was difficult.
The
participants were required to submit materials using the format as in 2004.
There were only 10 participants. This was a surprise but the short time from
available data to submission deadline was probably the most important factor. We had no submissions from student teams
possibly because we released the first version of the data set at the end of
February during which most university information visualization classes already
are well under way.
The
chairs managed the review process and evaluated the entries in a similar manner
as the previous year, but used specific ratings for insight, presentation,
interaction, creativity, flexibility, and novelty. There were two first and two second place
awards. Teams led by the Iowa State University [Hof05] and Penn State
University [Che05] took first place having answered all questions, while the
Universität Karlsruhe [Hos05] and Augsburg University [Zei05] provided strong
answers and received second place prizes.
The
first place winners took two different approaches. The team from
The
second place winners had strong answers. The Company Positioning System from the
Universität Karlsruhe was visually stimulating, and had high scores on
interaction and novelty (Figure 22) [Hos05] while the team from
Figure 20:
Figure 21:
Figure 22: Company Positioning
System by the Universität Karlsruhe
Figure 23:
All in all the four winners
covered a broad spectrum of techniques for information visualization solutions.
In all cases we found that statistical analysis played a key role. The data was
just too large for simple human consumption thereby putting visualization in a
collaborating role with analysis. Much processing of the data took place and
all contestants used coordinated views to answer the questions.
The
contest illustrated the difficulty researchers have at giving adequate evidence
that their tools could effectively conduct the tasks. Demonstrating the power of a tool can be
difficult. Researchers are trained to
describe their tools’ novel features more than illustrating them with
convincing examples using real data. In
2003 participants barely reported any insight at all. Everyone was focused on the description of
their tool. By 2004 more participants
(not all) were able to provide insights. In 2005 insights were more common.
Half
of the participants were students who built their tools. These tools were not as polished as
industrial products or well developed research systems. In 2003, we provided large data sets with
some meaningful subsets and in 2004 the data set was not very large. However, in 2005, there was no subset
provided and the number of participating students dropped. Providing benchmarks that fit student
project's sizes seems important to the success of future contests.
The evaluation process is time
consuming and looked at by some of the chairs as a daunting task. Ideally one could compute metrics and add
these up assuming independence to get a summary score, but since the questions
are open-ended and there is no known “ground truth” it requires human evaluation. Even though there were a limited a number of
specific tasks, these can be interpreted differently and of course participants
did interpret them in many ways. We seem to end up having to compare
non-comparable steps and results.
Evaluating
the results remained a subjective activity. After seeing the submissions, the
2003 and 2004 contest reviewers decided to classify the teams in three
categories: no evidence of insights gathered using the tool, some insight, and
lots of insight, i.e. worthy of a first prize.
The 2005 contest used more than 6 categories. This helped with
discussions but required a great deal more detailed reviewing of the submitted
videos and papers.
Since the data is the same,
repeatedly looking at entries is taxing.
It can be very hard to remember “who did what” or “who had this
insight”. An insight implies novelty of
the finding, so reviewers might more positively weigh a reported insight the
first time they encounter it and undervalue it later on when reported by
another team. Videos were extremely important. Without them it would have made
it impossible to understand how most tools worked and what process was used to
answer the questions. With videos interactions become understandable. Verbal comments on the videos were
indispensable in explaining what the participants were highlighting. This is quite different than simply reviewing
a paper and ranking the results.
Labeling of the results was not as necessary as verbal description on
the videos were sufficient for the descriptions but this required the reviewers
to remember key points. On the downside,
dealing with videos was very time
consuming. Videos were large, download
times high, and distribution to reviewers slow (something we now know how to
resolve). For the first two contests we
were flexible about the format of the videos submitted but this created
problems such as finding converters or hunting for missing codecs. For the 2005 contest we required a single
format and this simplified the process.
With
the InfoVis 2003 contest we attempted to provide real data and tasks while
trying to narrow the problem to one data type (trees) and three representative
tree types. The contest taught us that
the problem was still too large for a contest and that the vague nature of the
tasks made it impossible to compare answers effectively. In contrast the 2004 contest had only one
dataset, much fewer tasks and a more structured reporting format. Nevertheless, the open-ended nature of
realistic tasks and the diversity of approaches still made judging the
submitted entries a challenge.
We felt that the
time to generate a reasonably clean data set was too large, around 1000
man-hours each year. This is a serious issue for the development of
benchmarks. Domain experts should be
solicited for cleanup and experimentation on various task solutions should be
attempted before the data is released.
We hope that industry groups or government agencies wishing to see more
research conducted on specific data of interest to them will take on the burden
of developing the benchmarks datasets or support groups to do so.
Participating
in the contest takes time and motivation.
Most participants reported working very hard to prepare their
submission. Many acknowledged that it
pushed them to improve and test their tools.
Some students were encouraged to work on the contest as a class
project. Some wanted to test their PhD
research. A small company reported
appreciating the exposure.
In
the 3 contests we gave small prizes to the first and second place teams. Sponsors provided various prizes. Those were appreciated, especially by the
students who liked the gaming stations.
We also presented winners with certificates that many told us they were
happy to hang in their office.
Participants appreciated being able to be able to mention the award and the
small publication to their resumes. On
the other hand, some tenure-track faculty reported being interested but
preferring focusing on writing full papers.
We
realized that we should in the future anticipate the data set and plan
earlier. Given that we ran into errors
and noise in almost all the data sets, having more time will help clean the
data and prepare better tasks. The 2006
contest selected the census data and made the announcement at the 2005
conference thereby providing potential participants a great deal of time,
almost a year, to work on the problem.
Pushing the deadline further into the late summer would allow summer
interns to work on the contest, but would reduces the reviewing period
dramatically.
Many
people downloaded the dataset without submitting results and we collected names
and emails. The chairs performed an informal survey of those that had
downloaded the 2005 contest to see why there were so few participants. The
participants stated that there were no problems with the data set or questions,
that the data set was a great data set to show system and tool capabilities,
and that all had enjoyed the process and would do it again. Most expressed that
they wanted a better organized website, automated email on data or news
updates, and would have preferred the data in a database format. Some expressed
strong interest in splitting entries into commercial and academic categories.
Four of the non-participants stated that the requirement to attend the
conference hindered their participation and most expressed that they were too
busy in their company to tackle such a project. Several expressed a desire for
some mini-questions such as “find a more elegant way to look at …”.
There was one recurring theme which all participants
and non-participants expressed and that was the need for more time. That was
the reason the 2006 contest data was made available at the 2005 conference.
For all three contests we were
able to have a whole session at the conference to summarize the results and
allow some authors to present. In 2003
only the first place authors presented and we summarized the second place
submissions. Attendees commented that it
would be better to have shorter presentations but allow more presenters to
speak. The following year we arranged
for all first and second place winners to present with the second place ones having
only 2 minutes. This format was very
well received. We specified tasks
presenters should focus on so the attendees could better compare the different
entries, at the cost of not seeing every feature of the tools. We found that handing out the awards rapidly
and keeping photos to a minimum (a group picture at the end of the session) was
preferable. This left more time for the
presentations and still gave a festive atmosphere to the event. All winners were also given a chance to have
a poster displayed during the normal poster session.
A
contest is only a first step. The
revised materials provided by the authors and the datasets have to be available
after the event. We have strived to keep
the contest pages active and we also have made the submissions available in the
InfoVis repository hosted at the
Contests
represent an artificial testing situation where the opinion of judges reflects
the quality of the submitted materials, as opposed to the actual merits
exhibited when tools are tested interactively and discussed with
designers. The impact of contests is
most obvious with those that participate and those that see the results but the
datasets and tasks remain available after the contests thereby extending their
impact. They can be used by developers to exercise their tools and identify
missing features, and by evaluators to enrich their testing procedures with
complex tasks. These developers and evaluators then have baseline results with
which to compare their results. We hope that these data can also be used in
controlled experiments, and that the more specific lists of tasks used for
those experiment can be added to the repository for reuse.
Benchmarks
are difficult to create, promote and use. Our belief is that we are developing
solid and evolving benchmarks and are beginning to understand how to better
evaluate submissions. Good benchmarks
must be real (witness the success of TREC and CAMDA) to both draw the audiences
and participants and to strongly push the technology curve. Good benchmark
tasks must be open-ended to provide for the flexibility in solutions. We know
that this makes the evaluations more difficult to measure analytically but this
is realistic. We need to think that more human evaluation will be required in
the future and evolve a collection of volunteer judges. These contests continue
to demonstrate the challenges of benchmark design and especially of system and
tools evaluation.
By
making the results of analyses available to the community we provide a
repository of baselines for developers to compare to. Teams did interpret our tasks in many
different ways, making comparison difficult; nevertheless we feel strongly that
it was extremely useful to compare with the same data set and tasks.
The
integration of analysis is becoming more necessary as data sets are more
complex, large, and coming from diverse sources. The identification of
anomalous patterns of data from phone calls, from bank transactions, and from
news articles requires new techniques and strong analytical tools. We believe
that such data sets and competitions will continue to encourage the community
to work on difficult problems while building a baseline of comparable tasks and
datasets.
We
thank the organizers of the IEEE Visualization InfoVis Symposium, in particular
John Dill, Tamara Munzner and Stephen Spencer, for their continual support,
members of the InfoVis community for their intellectual stimulations, and the
participants without whom there would be no contest. We would also like to
thank the students for their help in extracting the metadata, cleansing the
data set, and producing a richly usable data set: Caroline Appert (Université
Paris-Sud, France) and Urska Cvek, Alexander Gee, Howie Goodell, Vivek Gupta,
Christine Lawrence, Hongli Li, Mary Beth Smrtic, Min Yu and Jianping Zhou
(University of Massachusetts at Lowell).
We
thank the sponsors who provided first prizes, The Hive Group, ILOG and Stephen
North personally. After the first
release of the datasets many others offered their help, including Jeff Klingner
from Stanford, Kevin Stamper, Tzu-Wei Hsu, Dave McColgin, Chris Plaue, Jason
Day, Bob Amar, Justin Godfrey and Lee Inman Farabaugh, from Georgia Tech,
Niklas Elmqvist from Chalmers, Sweden, Jung-Rung Han, Chia-Ning Chiang and
Tamara Munzner from UBC, and Maylis Delest from the Université de Bordeaux.
We
thank ACM and IEEE and in particular Mark Mandelbaum and Bernard Rous for
helping make the 2004 data available and working with us to prepare the
dataset, Shabnam Tafreshi for help with the website, and finally but not least Paolo Buono from the University of
Bari, Italy, for participating in the review process. We thank Michael Best for working with us on
releasing the technology company data for the 2005 contest.
We
also thank Sharon Laskowski for working with Catherine Plaisant on the
evaluation section of the NVAC research agenda [Tho05] which helped refined
some of the sections of this paper and lead to Figure 2.
References
[Ahm04] Adel Ahmed, Tim Dwyer, Colin Murray, Le Song,
Ying Xin Wu, WilmaScope, Poster
Compendium of IEEE Information Visualization (2004)
[Alo98] Alonso,
D., Rose, A., Plaisant, C., and Norman, K., Viewing personal history records: A
Comparison of tabular format and graphical presentation using LifeLines, Behavior and Information Technology 17,
5, 1998, 249-262.
[Aub03] Auber, D., Delest, M., Domenger, J-P.,
Ferraro, P., Strandh, R., EVAT - Environment for Visualization and Analysis of
Trees, in Poster Compendium of IEEE
Information Visualization (2003)
[Bel06] BELIV’06, BEyond time and errors: novel
evaluation methods for Information Visualization, a workshop of the AVI 2006
International Working Conference.
http://www.dis.uniroma1.it/~beliv06/
[Bmr06] Information Visualization Benchmark Repository
www.cs.umd.edu/hcil/InfovisRepository
[Bor06] InfoVis
CyberInfrastructure — http://iv.slis.indiana.edu
[CAM06] Critical Assessment of Microarray Data
Analysis (CAMDA) conference, http://www.camda.duke.edu/camda06
[CES06] Council of European Social Science Data
Archives (CESSDA) – http://www.nsd.uib.no/cessda
[Che00] Chen, C., Czerwinski, M. (Eds.) Introduction
to the Special Issue on Empirical evaluation of information visualizations, International Journal of Human-Computer
Studies, 53, 5, (2000), 631-635.
[Che04] Chen, C., Citation and Co-Citation
Perspective, Poster Compendium of IEEE
Information Visualization (2004)
[Che05] Jin Chen, Diansheng Guo, Alan M. MacEachren,
Space-Time-Attribute Analysis and Visualization of US Company Data, Poster Compendium of IEEE Information
Visualization (2005)
[Chi93] Chinchor, N., Hirschman, L., Evaluating
message understanding systems: an analysis of the third message understanding
conference (MUC-3), Computational
Linguistics 19, 3 (1993) 409 - 449
[Cow05] Cowley,
P., Nowell, L., Scholtz, J., Glassbox: an instrumented infrastructure for
supporting human-interaction with information, Proceedings of the Proceedings of the 38th Annual Hawaii International
Conference on System Sciences (HICSS'05) , pp. 296.3
(2005)
[Del04] Maylis Delest, Tamara Munzner, David Auber,
Jean-Philippe Domenger, Tulip, Poster
Compendium of IEEE Information Visualization (2004) [Fek03] Fekete, J-D and
Plaisant, C., InfoVis 2003 Contest,
www.cs.umd.edu/hcil/iv03contest (2003)
[Fek06] Fekete, J.-D.,
Infovis Toolkit,
http://ivtk.sourceforge.net/
[Geh04] Gehre, J., Ginsparg, P., Kleinburg, J.,
Overview of the 2003 KDD cup, SIGKDD
Explorations, 5,2 (2004) 149-151.
[Gon03] Gonzales, V., Kobsa, A., Benefits of
information visualization for administrative data analysts, Proceedings of the Seventh International
Conference on Information Visualization,
[Gri02] Grinstein, G., Hoffman, P., Pickett, R.,
Laskowski, S., Benchmark Development for the Evaluation of Visualization for
Data Mining, in Fayyad, U., Grinstein, G., Wierse, A. (Eds.) Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, San Francisco (2002) 129-176.
[Ham04] Frank van Ham, Technische Universiteit
Eindhoven Contest Submission, Poster
Compendium of IEEE Information Visualization (2004)
[Hof05] Heike Hofmann, Hadley Wickham, Dianne Cook,
Junjie Sun, Christian Röttger, Boom and Bust of Technology Companies at the
Turn of the 21st Century, Poster
Compendium of IEEE Information Visualization (2005)
[Hon03] Hong, J. Y., D'Andries, J., Richman, M.,
Westfall, M., Zoomology: Comparing Two Large Hierarchical Trees, in Poster Compendium of IEEE Information
Visualization (2003)
[Hos05] Bettina Hoser, Michael Blume, Jan Schröder,
and Markus Franke, CPS- Company Positioning System: Visualizing the Economic
Environment, Poster Compendium of IEEE
Information Visualization (2005)
[Hsu04] Hsu Tzu-Wei,
Lee Inman Farabaugh, Dave McColgin, Kevin Stamper, MonkEllipse, Poster Compendium of IEEE Information
Visualization (2004)
[Inf04] Fekete, J.-D., Grinstein, G. and Plaisant, C.,
InfoVis 2004 Contest, www.cs.umd.edu/hcil/iv04contest
[Inf05] Grinstein, G., U. Cvek, M. Derthick, M.
Trutschl, IEEE InfoVis 2005 Contest, Technology Data in the
[Inf06] InfoVis 2006 Contest http://sun.cs.lsus.edu/iv06/
[Ira03] Irani, P. , Ware, C., Diagramming information
structures using 3D perceptual primitives, ACM
Transactions on Computer-Human Interaction, 10, 1 (2003), 1-19
[Kei04] Keim, D., Christian
Panse, Mike Sips, Joern Schneidewind, Helmut Barro,
[Kei95] Keim, D., Bergeron, R. D., Pickett, R., Test
datasets for evaluating data visualization techniques. In Grinstein, G., Levkowitz, H. , Perceptual Issues in Visualization,
Springer,
[Kom04] Komlodi, A., Sears, A., Stanziola, E.,
InformationVisualization Evaluation Review, ISRC
Tech. Report, Dept. of Information Systems, UMBC. UMBC-ISRC-2004-1
http://www.research.umbc.edu/~komlodi/IV_eval (2004).
[Lee04] Lee Bongshin, Mary Czerwinski, George Robertson, Benjamin B. Bederson, PaperLens, Poster Compendium of IEEE Information
Visualization (2004)
[Lin04] Lin Xia,
Jan Buzydlowski, Howard D. White, Associative Information Visualizer, Poster Compendium of IEEE Information
Visualization (2004)
[Kob04] Kobsa, A., User experiments with tree
visualization systems, Proc. of IEEE
Symposium on Information Visualization (2004) 9-16
[Mac86] Mackinlay, J., Automating the design of
graphical presentations of relational information, ACM Trans. on Graphics, 5, 2 (1986) 110, 141
[MLR06]
[Mor03] Morse,
D. R., Ytow, N., Roberts, D. McL., Sato, A., Comparison of Multiple Taxonomic
Hierarchies Using TaxoNote, in Poster
Compendium of IEEE Information Visualization (2003)
[Mul97] Mullet, K., Fry, C., Schiano, D., On your
marks, get set, browse! (the great CHI'97 Browse Off), Panel description in ACM CHI'97 extended abstracts, ACM,
[Mun03] Munzner, T., Guimbretière, F., Tasiran, S.,
Zhang, L. and Zhou, Y., TreeJuxtaposer: Scalable tree comparison using
Focus+Context with guaranteed visibility. ACM
Transactions on Graphics, SIGGRAPH 03 (2003) 453-462
[Nai01] North American Industry Classification System, www.census.gov/epcd/www/naics.html
[NYT06] New York Times – Election 2004
http://www.nytimes.com/packages/html/politics/2004_ELECTIONGUIDE_GRAPHIC/?oref=login (retrieved June 2005)
[Pal00] Pallett, D., Garofolo, J., Fiscus, J.,
Measurement in support of research accomplishments, Communications of the ACM, 43, 2 (2000) 75-79
[Pla02] Plaisant, C., Grosjean, J., and Bederson, B.
B., SpaceTree: Supporting exploration in large node-link tree: design evolution
and empirical evaluation, IEEE Symposium
on Information Visualization (2002), 57-64.
[Pla04] Plaisant, C. The Challenge of Information
Visualization Evaluation, in Proceedings
of the working conference on Advanced Visual Interfaces (AVI 2004), pp.
109—116,
[Sar04] Saraiya, P., North, C., Duca, K., An
evaluation of microarray visualization tools for biological insight, Proc. of IEEE Symposium on Information
Visualization (2004) 1-8
[Sch02]
J. Scholtz, L. Arnstein, M. Kim, T. Kindberg, and S. Consolvo, User-Centered Evaluations of Ubicomp Applications, Intel Corporation IRS-TR-02-006, May
2002 2002.
[Sch05]Scholtz, J., Steves, M.P., A Framework for
Evaluating Collaborative Systems in the Real World, to appear in Proc. Hawaii International Conference on
System Sciences, 2005
[She03] Sheth, N., Börner, K., Baumgartner, J., Mane,
K., Wernert, E., Treemap, Radial Tree, and 3D Tree Visualizations, in Poster Compendium of IEEE Information
Visualization (2003)
[Shn06] Strategies for Evaluating Information
Visualization Tools: Multidimensional In-depth Long-term Case Studies,
Shneiderman, B., Plaisant, C., Proc.
of BELIV’06, BEyond time and errors:
novel evaLuation methods for Information Visualization, a workshop of the AVI
2006 International Working Conference, ACM (2006) 38-43
[SMo06] Smart Money Map of the Market www.smartmoney.com (retrieved June 2005)
[Spen00] Spenke, M., Beilken, C., InfoZoom - Analysing
Formula One racing results with an interactive data mining and visualization
tool, in Ebecken, N. Data mining II, (2000), 455–464
[Sta00] Stasko, J. Catrambone, R., Guzdial, M. and
McDonald, K., An Evaluation of Space-Filling Information Visualizations for
Depicting Hierarchical Structures, International
Journal of Human-Computer Studies, 53, 5 (2000) 663-694.
[Teo04] Soon Tee Teoh, Kwan-Liu Ma, One-For-All -
University of
[Tra00] Trafton, J., Tsui, T., Miyamoto, R.; Ballas,
J., Raymond, P., Turning pictures into numbers: extracting and generating
information from complex visualizations. International
Journal of Human Computer Studies,
53, 5 (2000), 827-850.
[Tho05] Thomas,
J. and Cook, K. (Eds.) Illuminating the
Path: The Research and Development Agenda for Visual Analytics, IEEE CS
Press (2005), http://nvac.pnl.gov/agenda.stm
[TRE06] Text
REtrieval Conference (TREC), http://trec.nist.gov/
[Tym04] Jaroslav Tyman, Grant P. Gruetzmacher, John
Stasko, InfoVisExplorer, Poster Compendium of IEEE Information Visualization
(2004)
[VAS06] Grinstein, G., O’Connell, T., Laskowski, S.,
Plaisant, C., Scholtz, J., Whiting, M., VAST 2006 Contest: A tale of Alderwood,
Proc. of
IEEE Visual Analytics Science and Technology conference (2006) to
appear.
[VAS06b] VAST 2006 Contest:
www.cs.umd.edu/hcil/VASTcontest06
[Voo00] Voorhees, E., Harman, D., Overview of the
sixth Text Retrieval Conference (TREC-6), Information
Processing and Management, 36 (2000) 3-35
[Wei04] Weimao Ke, Katy Borner,
[Won04] Wong Pak Chung, Beth Hetzler, Christian Posse, Mark Whiting, Sue Havre, Nick Cramer,
Anuj Shah, Mudita Singhal, Alan Turner, Jim Thomas, IN-SPIRE, Poster Compendium of IEEE Information
Visualization (2004)
[Zei05] Zeis Annerose, Sergej Potapov, Martin Theus,