The dataset of the Infovis 2004 Contest contains the metadata of important articles and books about information visualization collected from several sources.
It is one large XML file containing a list of article or book descriptions. Each article or book metadata is described using the following elements:
The main entry describing an article. It also defines the unique identifier of the article using the attribute "id". This identifier is described in the section Identifiers. The article element contains the following elements in that order: "title" , "source", "pages", "url", "abstract", "keywords", "authors", "date" and "references". They are describe below.
The article or book title. If there is a subtitle, it is separated from the main title with a colon (:) character.
The description of the book or article such as the name of the proceedings or the ISBN of the book. When the article belongs to a conference series references by the ACM Digital Library, the attribute "ref" contains the identifier of the series.
Empty element with one or two attributes. The attribute "from" contains the first page of the article for a conference or journal article. It contains the number of pages for a book. The attribute "to", when it exists, contains the last page number for a conference or journal article.
Contains an URL which may be valid (optional).
Contains the abstract of the article or book, as a sequence of <par> elements (optional).
A coma-separated list of keywords (optional).
A list of author names in <author_ref> elements. Each of these elements contain the name as specified in the first page of the article and a unique identifier in the attribute "ref", when provided by the ACM Digital Library.
The publication date of the article in the attribute "from". When an attribute "to" is defined, is refers to the last date of the conference where the article is published.
A list of articles or books cited in the article, each in a <ref> element. The element contains the reference as it has been captured from the article (it can contain typos). When the referenced article or book has been found in the ACM Digital Library or in the dataset, the element defines the "ref" attribute that uniquely identifies the article.
Identifiers are used in the dataset to reference unambiguously articles, conferences or author names. The follow the rules of XML identifiers and follow some simple naming conventions:
The ACM Digital Library can be queried from an identifier. For example, if the identifier is "acm721078", then the following query will retrieve its entry in the DL: http://portal.acm.org/citation.cfm?id=721078
Here is a complete metadata entry for a paper of the InfoVis series:
<article id="acm721078"><title>An Operator Interaction Framework for Visualization Systems</title>
<source ref="acm647341">Proceedings of the 1998 IEEE Symposium Information Visualization</source>
<pages from="63" to="70"/>
<abstract><par>Information visualization encounters a wide variety of different data domains. The visualization community has developed representation methods and interactive techniques. As a community, we have realized that the requirements in each domain are often dramatically different. In order to easily apply existing methods, researchers have developed a semiology of graphic representations. We have extended this research into a framework that includes operators and interactions in visualization systems, such as a visualization spreadsheet. We discuss properties of this framework and use it to characterize operations spanning a variety of different visualization techniques. The framework developed in this paper enables a new way of exploring and evaluating the design space of visualization operators, and helps end--users in their analysis tasks.</par>
<author_ref ref="P74503">Ed Huai-hsin Chi</author_ref>
<author_ref ref="P145715">John Riedl</author_ref>
<date from="10-19-1998" to="10-20-1998"/>
<references><ref>Graph Visualizer 3D. http://www.omg.unb.ca/hci/projects/gv3d/, March 1998.</ref>
<ref ref="acm191775">C. Ahlberg and B. Shneiderman. "Visual information seeking: Tight coupling of dynamic query filters with starfield displays". In Proceedings of ACM CHI'94 Conference on Human Factors in Computing Systems, volume 1 of Information Visualization, pages 313-317, 1994. Color plates on pages 479-480.</ref>
<ref ref="acm37086">R. A. Becker and W. S. Cleveland. "Brushing scatterplots". Technometrics, 29(2):127-142, 1987.</ref>
<ref ref="acm614292">R. A. Becker, S. G. Eick, and A. R. Wilks. "Visualizing network data". IEEE Transaction on Visualization and Computer Graphics, 1(1):16-28, 1995.</ref>
<ref ref="id6160">J. Bertin. "Semiology of Graphics: Diagrams, Networks, Maps". University of Wisconsin Press, Madison, WI, 1967/1983.</ref>
<ref ref="acm857632">S. K. Card and J. Mackinlay. "The structure of the information visualization design space". In Processings of Information Visualization Symposium (InfoVis'97), pages 92-99. IEEE, IEEE CS Press, 1997.</ref>
<ref ref="acm238446">S. K. Card, G. G. Robertson, and W. York. "The webbook and the web forager: An information workspace for the world-wide web". In Proceedings of ACM CHI'96 Conference on Human Factors in Computing Systems, pages 111-117. ACM, ACM Press, 1996.</ref>
<ref ref="id6974">D. A. Norman. "The Design of Everyday Things". Doubleday, 1988.</ref>
.../... (I cut some of the references for compactness)
<ref ref="acm617498">C. Upson, T. Faulhaber, Jr., D. Kamins, D. Laidlaw, D. Schlegel, J. Vroom, R. Gurwitz, and A. van Dam. "The application visualization system: A computational environment for scientific visualization". IEEE Computer Graphics and Applications, pages 30-42, July 1989.</ref>
<ref ref="acm857601">A. Varshney and A. Kaufman. "FINESSE: A financial information spreadsheet". In IEEE Information Visualization Symposium, pages 70-71, 125, 1996.</ref>
<ref>Advanced Visualization System home page. http://www.avs.com, Feb. 1997.</ref>
<ref>IBM Visualization Data Explorer (DX). http://www.almaden.ibm.com/dx/, Feb. 1997. (current as of date).</ref>
<ref>IRIS Explorer home page. http://www.nag.co.uk:80/Welcome IEC.html, Feb. 1997.</ref>
<ref ref="acm857579">J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, and V. Crow. "Visualizing the non-visual: Spatial analysis and interaction with information from text documents". In Proc. Information Visualization Symposium (InfoVis '95), pages 51-58. IEEE, IEEE CS, 1995.</ref>
A simple book entry:
<title>The visual display of quantitative information</title>
<source> ISBN: 0-9613921-0-</source>
<author_ref ref="P75580">Edward R. Tufte</author_ref>
An article missing the abstract, and with references that do not have ref IDs (therefore no metadata beyond the reference string):
<title>Worlds within worlds metaphors for exploring n-dimensional virtual worlds</title>
<source ref="acm97924">Symposium on User Interface Software and Technology: Proceedings of the 3rd annual ACM SIGGRAPH symposium on User interface software and technology</source>
<pages from="76" to="83"/>
<author_ref ref="P253376">S. K. Feiner</author_ref>
<author_ref ref="P49409">Clifford Beshers</author_ref>
<ref>BANC78 Banchoff, T. "Computer Animation and the Geometry of Surfaces in 3- and 4-Space." Proc. Int. Cong. of Math, 1978, 1005-1013. </ref>
<ref ref="acm102314"> C. M. Beshers , S. K. Feiner, Real-time 4D animation on a 3D graphics workstation, Proceedings on Graphics interface '88, p.1-7, December 1989, Edmonton, Alberta, Canada </ref>
<ref ref="acm73670"> C. M. Beshers , S. Feiner, Scope: automated generation of graphical interfaces, Proceedings of the 2nd annual ACM SIGGRAPH symposium on User interface software and technology, p.76-85, November 13-15, 1989, Williamsburg, Virginia, United States </ref>
<ref ref="acm807391"> William C. Donelson, Spatial management of information, Proceedings of the 5th annual conference on Computer graphics and interactive techniques, p.203-209,
August 23-25, 1978</ref>
<ref>FEIN82 Feiner, S., D. Salesin, and T. Banchoff. "DIAL: A diagrammatic animation language." IEEE Computer Graphics and Applications, 2:7, September 1982, 43-54. </ref>
<ref>HULL89 Hull, J. Options, Futures, and Other Derivative Securities, Prentice-Hall, NJ, 1989. </ref>
<ref>KILP76 Kilpatrick, P.J. The Use of a Kinesthetic Supplement in an Interactive Graphics System, Ph.D. Thesis, Univ. of North Carolina, Chapel Hi!l, 1976. </ref>
<ref ref="acm363544"> A. Michael Noll, A computer technique for displaying n
-dimensional hyperobjects, Communications of the ACM, v.10 n.8, p.469-473, Aug. 1967 </ref>
<ref>OUHY89 Ouh-young, M., D. Beard and F. Brooks, Jr. "Force Display Performs Better than Visual Display in a Simple 6-D Docking Task." Proc. IEEE Robotics and Automation Conf., May 1989, 1462-6. </ref>
<ref ref="acm37421"> William C. Thibault , Bruce F. Naylor, Set operations on polyhedra using binary space partitioning trees, ACM SIGGRAPH Computer Graphics, v.21 n.4, p.153-162, July 1987 </ref>
<ref ref="acm275628"> Thomas G. Zimmerman , Jaron Lanier , Chuck Blanchard , Steve Bryson , Young Harvill, A hand gesture interface device, Proceedings of the SIGCHI/GI conference on Human factors in computing systems and graphics interface, p.189-192, April 05-09, 1987, Toronto, Ontario, Canada </ref>
We chose to represent the field of Information Visualization though the articles published in the Infovis Symposium series and all their references. The assertion was that all the important papers on Information Visualization should be referenced from articles in the Infovis series.
The dataset has been built in several phases: collecting the InfoVis articles, extracting their references, finding (or not) the referenced articles in the ACM Digital Library, collecting the metadata from the referenced articles from the ACM Digital Libray, and puting everything together. The process is VERY tedious and not quite finished yet. We need you help to complete the task and clean the dataset.
Extracting the references from the InfoVis
There are two well know sources of public citation informations: CiteSeer (ResearchIndex) (http://citeseer.nj.nec.com/) and ParaCite (http://paracite.eprints.org/). They automatically extract references from PDF files through complex heuristic implemented as perl scripts. These scripts are not reliable and don't work on PDF files containing bitmaps, such as the proceedings of Infovis from 1995 to 1997. Therefore, we had to extract the references by hand.
Ten students helped collect and check the references: Caroline Appert, Urska Cvek, Alexander G. Gee, Howie Goodell, Vivek Gupta, Christine Lawrence, Hongli Li, Mary Beth Smrtic, Min Yu and Jianping Zhou. It took about 3 weeks to collect 9 years of references. First we had to cut and paste each reference line from Acrobat Reader. Acrobat has some OCR capabilities that produce an acceptable result on references. However they had to be hand-cleaned carefully. We also manually added quotes around the articles title in the reference as there is simply no reliable way to automatically extract titles from a reference string.
We now have to cleanup the linking process by hand, trying to solve and check the references using google and the ACM Digital Library. This process will continue into March.
If you know how to help in that process, please, let us know and spare us some precious time. We need a reliable way to find ACM articles in the Digital Library from the references in the Infovis articles. There are certainly many ways to do it but we are not specialist of parsing and information retrieval.
We also need to clean-up the dataset by checking that the references resolved are right. For example, some articles have been published as technical reports or PhD thesis. Since we only search the articles by name, we could point to ACM articles when the reference describes the technical report or PhD thesis. There must also be errors with articles sharing the same title.
Finally, we rely on the ACM Digital Library to resolve the article authors. Each author is assigned a unique identifier, but it turns out this assignment is not as reliable as we hoped. We would be glad to have authors correctly unified in the dataset.
Please send us email with corrections.
We would like to thank ACM and IEEE for their help and permission.