InfoVis 2003 Contest
Datasets - Descriptions - Tasks
Background information about the Classification Data
These 2 datasets are two different versions of scientific classifications of living organisms in the Animal Kingdom, Animalia. They are not comprehensive, including only about 15% of the approximately one million described species, but they are still very large trees. Each classification is organized hierarchically, using the system developed by the Swedish scientist Linnaeus in which levels in the hierarchy are given ranks (in increasing specificity: phylum, class, order, family, tribe, genus, species, with sub-, infra-, and super- handling levels in between). In the XML files you will see a series of nodes, each with a scientific name and rank, nested to show which organisms fall into successively larger named groups. A child node is interpreted as belonging to the group named in its parent node. Some nodes also have common, or vernacular, names. For example, this dataset includes 57 species of treefrogs in the genus Hyla , which together with other related frogs make up the family Hylidae, one of many families in the Order Anura (toads and frogs), and so on. By walking the path of nodes from root (Kingdom Animalia) to leaf, one gets the complete formal classification for a particular species. You will find that not all paths include all ranks, however. You may assume that a name refers to a comparable group of animals in each of the two datasets, though the exact children and tree topology may differ. Sometimes branch order reflects ideas about which children branched off first, but branch order is not significant in these particular datasets.
Biologists use classifications as filing systems. For example, classifications govern how scientific specimens are stored in museums, how field guides present maps and pictures and text about species, how libraries store and provide access to scientific studies about organisms. Thus, to be useful a classification needs to be broadly understandable and relatively stable. Organisms are given scientific names, usually Latin or Latin-like, that must follow certain rules. One rule is that the name must be unique. Organisms thought to be close relatives are put into a group together, and related groups are likewise grouped, forming a nested hierarchy. Groupings are given a rank at each level of the hierarchy, such as Phylum or Order or Family to facilitate comparison across groups, and these scientific names and ranks are used in communication among biologists around the world. Common names are informal ways of referring to organisms. While they are not standardized (they often differ according to language and dialect of the laypeople using them) common names can be very useful for non-experts and that is why they are included here.
People in many walks of life often want to know more about a particular animal. Knowing its scientific name or where it fits into the Animal Kingdom can lead to more information. Alas, those simple tasks are complicated by the fact that, for various reasons, there are different versions of classifications. While these tasks may seem trivial, they are examples of the broad sorts of information retrieval questions of interest to biologists, library scientists, educators, and the general public.
Comparing the two classification datasets:
- To what extent are the differences in the classifications due to differences in how animals are thought to be related? Are there other kinds of differences and can you explain them?
or, considering one dataset or the other:
- Can you say in how many different subtrees a particular common name (such as "dolphin" or "horse") is used? How closely are these animals related? Are common names a good guide to understanding relationships?
- How many species or subspecies are named after biologists named "Townsend"? Note that the answer will be different if you are looking at common names versus Latin names. Can you look at the pattern of names to deduce where in the world they might have done research? On what kinds of animals?
- Some scientific names are maddeningly similar. For example, Spirulida and Spirurida are two nodes in two different subtrees. A user types in the wrong one. What kind of feedback does your tool provide to alert the user quickly? Do the names have the same rank? Is the typed name in the expected part of the tree?
- For the top five subtrees with the most nodes -- are they likely to have a parent of a particular rank? Or does this happen in many ranks? Can you comment on how useful "rank" is?
These datasets were provided by Cynthia Parr, Bongshin Lee, and Dana Campbell at University of Maryland and University of Michigan.
For more background about animals, see the Animal Diversity Web .
The classification-A is taken directly from ITIS, the Integrated Taxonomic Information System. The classification-B includes much of the information in ITIS but is also informed by several other sources: the EMBL Reptile Database , University of Michigan's Bird Division , and the Smithsonian Institution's Mammal Species of the World Database . We have done our best to eliminate errors in these datasets, but minor errors from the original sources may persist.
Return to InfoVis 2003 Contest
Return to InfoVis 2003 Contest - Materials