InfoVis 2004 Contest
Data and Tasks
Data Format Description and Examples

Contents (last updated February 13)

Description
Identifiers
Examples
How You Can Help

Description

The dataset of the Infovis 2004 Contest contains the metadata of important articles and books about information visualization collected from several sources.

It is one large XML file containing a list of article or book descriptions. Each article or book metadata is described using the following elements:

<article>

The main entry describing an article. It also defines the unique identifier of the article using the attribute "id". This identifier is described in the section Identifiers. The article element contains the following elements in that order: "title" , "source", "pages", "url", "abstract", "keywords", "authors", "date" and "references". They are describe below.

<title>

The article or book title. If there is a subtitle, it is separated from the main title with a colon (:) character.

<source>

The description of the book or article such as the name of the proceedings or the ISBN of the book. When the article belongs to a conference series references by the ACM Digital Library, the attribute "ref" contains the identifier of the series.

<pages>

Empty element with one or two attributes. The attribute "from" contains the first page of the article for a conference or journal article. It contains the number of pages for a book. The attribute "to", when it exists, contains the last page number for a conference or journal article.

<url>

Contains an URL which may be valid (optional).

<abstract>

Contains the abstract of the article or book, as a sequence of <par> elements (optional).

<keywords>

A coma-separated list of keywords (optional).

<authors>

A list of author names in <author_ref> elements. Each of these elements contain the name as specified in the first page of the article and a unique identifier in the attribute "ref", when provided by the ACM Digital Library.

<date>

The publication date of the article in the attribute "from". When an attribute "to" is defined, is refers to the last date of the conference where the article is published.

<references>

A list of articles or books cited in the article, each in a <ref> element. The element contains the reference as it has been captured from the article (it can contain typos). When the referenced article or book has been found in the ACM Digital Library or in the dataset, the element defines the "ref" attribute that uniquely identifies the article.

Identifiers

Identifiers are used in the dataset to reference unambiguously articles, conferences or author names. The follow the rules of XML identifiers and follow some simple naming conventions:

The ACM Digital Library can be queried from an identifier. For example, if the identifier is "acm721078", then the following query will retrieve its entry in the DL: http://portal.acm.org/citation.cfm?id=721078

Examples

Here is a complete metadata entry for a paper of the InfoVis series:

 
<article id="acm721078"><title>An Operator Interaction Framework for Visualization Systems</title>
<source ref="acm647341">Proceedings of the 1998 IEEE Symposium Information Visualization</source>
<pages from="63" to="70"/>
<abstract><par>Information visualization encounters a wide variety of different data domains. The visualization community has developed representation methods and interactive techniques. As a community, we have realized that the requirements in each domain are often dramatically different. In order to easily apply existing methods, researchers have developed a semiology of graphic representations. We have extended this research into a framework that includes operators and interactions in visualization systems, such as a visualization spreadsheet. We discuss properties of this framework and use it to characterize operations spanning a variety of different visualization techniques. The framework developed in this paper enables a new way of exploring and evaluating the design space of visualization operators, and helps end--users in their analysis tasks.</par>
</abstract>
<authors>
<author_ref ref="P74503">Ed Huai-hsin Chi</author_ref>
<author_ref ref="P145715">John Riedl</author_ref>
</authors>
<date from="10-19-1998" to="10-20-1998"/>
<references><ref>Graph Visualizer 3D. http://www.omg.unb.ca/hci/projects/gv3d/, March 1998.</ref>
<ref ref="acm191775">C. Ahlberg and B. Shneiderman. "Visual information seeking: Tight coupling of dynamic query filters with starfield displays". In Proceedings of ACM CHI'94 Conference on Human Factors in Computing Systems, volume 1 of Information Visualization, pages 313-317, 1994. Color plates on pages 479-480.</ref>
<ref ref="acm37086">R. A. Becker and W. S. Cleveland. "Brushing scatterplots". Technometrics, 29(2):127-142, 1987.</ref>
<ref ref="acm614292">R. A. Becker, S. G. Eick, and A. R. Wilks. "Visualizing network data". IEEE Transaction on Visualization and Computer Graphics, 1(1):16-28, 1995.</ref>
<ref ref="id6160">J. Bertin. "Semiology of Graphics: Diagrams, Networks, Maps". University of Wisconsin Press, Madison, WI, 1967/1983.</ref>
<ref ref="acm857632">S. K. Card and J. Mackinlay. "The structure of the information visualization design space". In Processings of Information Visualization Symposium (InfoVis'97), pages 92-99. IEEE, IEEE CS Press, 1997.</ref>
<ref ref="acm238446">S. K. Card, G. G. Robertson, and W. York. "The webbook and the web forager: An information workspace for the world-wide web". In Proceedings of ACM CHI'96 Conference on Human Factors in Computing Systems, pages 111-117. ACM, ACM Press, 1996.</ref>
<ref ref="id6974">D. A. Norman. "The Design of Everyday Things". Doubleday, 1988.</ref>
.../... (I cut some of the references for compactness)
<ref ref="acm617498">C. Upson, T. Faulhaber, Jr., D. Kamins, D. Laidlaw, D. Schlegel, J. Vroom, R. Gurwitz, and A. van Dam. "The application visualization system: A computational environment for scientific visualization". IEEE Computer Graphics and Applications, pages 30-42, July 1989.</ref>
<ref ref="acm857601">A. Varshney and A. Kaufman. "FINESSE: A financial information spreadsheet". In IEEE Information Visualization Symposium, pages 70-71, 125, 1996.</ref>
<ref>Advanced Visualization System home page. http://www.avs.com, Feb. 1997.</ref>
<ref>IBM Visualization Data Explorer (DX). http://www.almaden.ibm.com/dx/, Feb. 1997. (current as of date).</ref>
<ref>IRIS Explorer home page. http://www.nag.co.uk:80/Welcome IEC.html, Feb. 1997.</ref>
<ref ref="acm857579">J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, and V. Crow. "Visualizing the non-visual: Spatial analysis and interaction with information from text documents". In Proc. Information Visualization Symposium (InfoVis '95), pages 51-58. IEEE, IEEE CS, 1995.</ref>
</references>
</article>

A simple book entry:

 
<article id="acm33404">
<title>The visual display of quantitative information</title>
<source> ISBN: 0-9613921-0-</source>
<pages from="197"/>
<authors>
<author_ref ref="P75580">Edward R. Tufte</author_ref>
</authors>
<date from="1986"/>
<references>
</references>
</article>

An article missing the abstract, and with references that do not have ref IDs (therefore no metadata beyond the reference string):

 
<article id="acm97933">
<title>Worlds within worlds metaphors for exploring n-dimensional virtual worlds</title>
<source ref="acm97924">Symposium on User Interface Software and Technology: Proceedings of the 3rd annual ACM SIGGRAPH symposium on User interface software and technology</source>
<pages from="76" to="83"/>
<authors>
<author_ref ref="P253376">S. K. Feiner</author_ref>
<author_ref ref="P49409">Clifford Beshers</author_ref>
</authors>
<date from="1990"/>
<references>
<ref>BANC78   Banchoff, T. "Computer Animation and the Geometry of Surfaces in 3- and 4-Space." Proc. Int. Cong. of Math, 1978, 1005-1013. </ref>
<ref ref="acm102314"> C. M. Beshers , S. K. Feiner, Real-time 4D animation on a 3D graphics workstation, Proceedings on Graphics interface '88, p.1-7, December 1989, Edmonton, Alberta, Canada  </ref>
<ref ref="acm73670"> C. M. Beshers , S. Feiner, Scope: automated generation of graphical interfaces, Proceedings of the 2nd annual ACM SIGGRAPH symposium on User interface software and technology, p.76-85, November 13-15, 1989, Williamsburg, Virginia, United States  </ref>
<ref ref="acm807391"> William C. Donelson, Spatial management of information, Proceedings of the 5th annual conference on Computer graphics and interactive techniques, p.203-209, August 23-25, 1978  </ref>
<ref>FEIN82   Feiner, S., D. Salesin, and T. Banchoff. "DIAL: A diagrammatic animation language." IEEE Computer Graphics and Applications, 2:7, September 1982, 43-54. </ref>
<ref>HULL89   Hull, J. Options, Futures, and Other Derivative Securities, Prentice-Hall, NJ, 1989. </ref>
<ref>KILP76   Kilpatrick, P.J. The Use of a Kinesthetic Supplement in an Interactive Graphics System, Ph.D. Thesis, Univ. of North Carolina, Chapel Hi!l, 1976. </ref>
<ref ref="acm363544"> A. Michael Noll, A computer technique for displaying n
-dimensional hyperobjects, Communications of the ACM, v.10 n.8, p.469-473, Aug. 1967  </ref>
<ref>OUHY89   Ouh-young, M., D. Beard and F. Brooks, Jr. "Force Display Performs Better than Visual Display in a Simple 6-D Docking Task." Proc. IEEE Robotics and Automation Conf., May 1989, 1462-6. </ref>
.../...
<ref ref="acm37421"> William C. Thibault , Bruce F. Naylor, Set operations on polyhedra using binary space partitioning trees, ACM SIGGRAPH Computer Graphics, v.21 n.4, p.153-162, July 1987  </ref>
<ref ref="acm275628"> Thomas G. Zimmerman , Jaron Lanier , Chuck Blanchard , Steve Bryson , Young Harvill, A hand gesture interface device, Proceedings of the SIGCHI/GI conference on Human factors in computing systems and graphics interface, p.189-192, April 05-09, 1987, Toronto, Ontario, Canada  </ref>
</references>
</article>

How the Dataset was Built  (this may answer many of your questions...)

We chose to represent the field of Information Visualization though the articles published in the Infovis Symposium series and all their references. The assertion was that all the important papers on Information Visualization should be referenced from articles in the Infovis series.

The dataset has been built in several phases: collecting the InfoVis articles, extracting their references, finding (or not) the referenced articles in the ACM Digital Library, collecting the metadata from the referenced articles from the ACM Digital Libray, and puting everything together. The process is VERY tedious and not quite finished yet.  We need you help to complete the task and clean the dataset.

Collecting the InfoVis Articles
We gathered all the available articles from the proceedings of the entire InfoVis series from 1995 to 2002 (we hope to add 2003 as it becomes available). They belong to and are stored in the IEEE CS Digital Library (DL) but ACM also references IEEE articles by collecting the IEEE CS DL metadata so it was easier to work solely with the ACM DL.
Unfortunately the IEEE metadata does not contain the reference list so we had to extract them by hand - see next section).  On the other hand each article stored in the ACM DL is described by metadata that contain the list of references taken from the reference section of the article. Each reference is linked to its entry in the ACM DL when the entry can be found. Our dataset has been built from these metadata information.

Extracting the references from the InfoVis articles
There are two well know sources of public citation informations: CiteSeer (ResearchIndex) (http://citeseer.nj.nec.com/) and ParaCite (http://paracite.eprints.org/). They automatically extract references from PDF files through complex heuristic implemented as perl scripts. These scripts are not reliable and don't work on PDF files containing bitmaps, such as the proceedings of Infovis from 1995 to 1997. Therefore, we had to extract the references by hand. 
Ten students helped collect and check the references: Caroline Appert, Urska Cvek, Alexander G. Gee, Howie Goodell, Vivek Gupta, Christine Lawrence, Hongli Li, Mary Beth Smrtic, Min Yu and Jianping Zhou. It took about 3 weeks to collect 9 years of references.  First we had to cut and paste each reference line from Acrobat Reader.  Acrobat has some OCR capabilities that produce an acceptable result on references. However they had to be hand-cleaned carefully. We also manually added quotes around the articles title in the reference as there is simply no reliable way to automatically extract titles from a reference string.

Collecting the metadata of references – when available
Once collected, the references had to be linked to ACM DL articles when possible and their metadata collected. To do that, we tried several methods, none of which worked perfectly.
Using the title of the reference which had been marked by hand, we could automatically query the ACM digital library and parse the returned results. This approach didn't work as well as we expected. The search engine of the ACM DL is not very resistant to noise in the title. The presence of subtitles also creates problems. We might try using Google searches in the future.  We also had some problems while querying the ACM DL which is very busy so some articles are missing but they should be included eventually.

How You Can Help

We now have to cleanup the linking process by hand, trying to solve and check the references using google and the ACM Digital Library. This process will continue into March.

If you know how to help in that process, please, let us know and spare us some precious time. We need a reliable way to find ACM articles in the Digital Library from the references in the Infovis articles. There are certainly many ways to do it but we are not specialist of parsing and information retrieval.

We also need to clean-up the dataset by checking that the references resolved are right. For example, some articles have been published as technical reports or PhD thesis. Since we only search the articles by name, we could point to ACM articles when the reference describes the technical report or PhD thesis. There must also be errors with articles sharing the same title.

Finally, we rely on the ACM Digital Library to resolve the article authors. Each author is assigned a unique identifier, but it turns out this assignment is not as reliable as we hoped. We would be glad to have authors correctly unified in the dataset.

Please send us email with corrections.

Acknowledgment

We would like to thank ACM and IEEE for their help and permission.

Return to Dataset and Tasks
Return to InfoVis 2004 Contest