ĐĎॹá > ţ˙ ¨ ţ˙˙˙ ˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ţ˙ ŕ
ňůOhŤ +'łŮ0 Ř č ô , @ ` l
Ź ¸ Ŕ Č Đ ä ? Automatically Generated Topic Maps of World Wide Web Resources uto Authorized OCLC User ed uth uth Normal.dot O Authorized OCLC User ed 7 th Microsoft Word 8.0 r@ Şżr @ öÇÂŕ°ž@ ś.V°ž@ ŕ˝á°ž Ő Automatically Generated Topic Maps of World Wide Web Resources
Carol Jean Godby, Eric Miller, Ray Reighart
OCLC Online Computer Library Center
6565 Frantz Road
Dublin, Ohio, 43017
{godby, miller, reighart} @ oclc.org
Abstract
At the world's major center for research and development in library automation, we are interested in technology that organizes, standardizes, and facilitates access to electronic text. Currently, the World Wide Web affords unprecedented access to globally distributed, primarily textual, information, but users quickly learn that sites of interest can be difficult to find without expert guidance or a measure of luck. Though search engines have been
available for five or six years, other methods for discovering information are just as popular, such as personal recommendations or collections of links from trusted sources. But the reliance on serendipitous discovery
strategies is odd because the information needs of an individual user are often confined to a single subject or closely related subjects, as evidenced by the growing number of personal Web pages that collect links on such diverse subjects as computational linguistics, freight trains, Japanese art, calligraphy, computer viruses,
non-invasive breast cancer, or the worldwide disappearance of frogs. This behavior suggests both a problem and an opportunity. The problem is that subject access to Web information is still primitive, at best. But many users are taking the initiative to organize portions of the Web themselves, making it is possible to leverage their work with the help of appropriately designed software.
To address this problem, OCLC and the World Wide Web Consortium are sponsoring the development of the
Resource Description Framework (RDF), an international standard for an infrastructure that enables the encoding, exchange and reuse of descriptive data about Web resources. At the core of RDF is a
syntax-independent model for describing resources, properties associated with these resources, and relationships among resources. RDF enables groups that manage data collections to define the vocabularies for describing resources (e.g. "author, title, subject" for describing Web documents; or "name, address, hair-color" for describing
people, etc.) and to build collections of these properties. Figure 1 illustrates a simple RDF description of a Web page that has properties including a title and an author. The author property, in turn, has the properties name, email and affiliation.
Since the RDF graph is a generic data model, a syntax is required for communicating this model. RDF utilizes XML eXtensible Markup Language as a syntax for the transmission of this model
among Web-based applications. Figure 2 shows the XML representation corresponding to Figure 1. See
Miller (1998) for additional information regarding this data model and its corresponding syntactic representation.
John Smith
Home, Inc.
smith@home.com
The Smith family tree
Figure 2. XML code for a simple RDF graph.
If collections of Web resources are described as in Figure 2, it is possible to create generic tools for the navigation of this graph based on particular properties. To aid in the creation of such navigation tools, Netscape ( http://www.mozilla.org/rdf/doc ) has developed an engine that produces a memory-resident acyclic graph from collections of RDF descriptions. The content of the graph and the details of the display are left to the application developer, but the RDF engine provides an easy way to create meta-indexes of Web pages that are now laboriously created by hand. For example, one possible input to an application using the RDF engine is a bookmark file. If application software can automatically extract authors, titles and URLs from the Web pages referenced in a bookmark file, browseable author/title displays can be created with no human input.
In our demo, an early research prototype, we show how RDF graphs can be enhanced to include subject access through terminology automatically extracted from Web pages. This work is a natural extension of the
research reported by Ibekwe-San Juan (1998), which showed that syntactic term variants such as root hair, deformed root hair, and root hair deformation form a graph structure that produces an effective schematic representation of the growth and development of topics in a corpus of academic research articles in plant
biology. The broad goals of our research program in terminology identification are to identify high-quality subject terminology in ways that are fully automated, computationally efficient, conceptually simple, and generic; to impose reasonable organizations on this terminology; and to embed this terminology into useful applications that enhance access to electronic text.
Our terminology identification system is conceptually similar to that described by Daille (1994) and implemented in Java as a set of filters that perform three steps. First, associa
preprocessed with association statistics to identify ngrams most likely to contain subject vocabulary. The ngrams serve as input to the second step, a partial parser that extracts noun phrases according to patterns specified by the software application. Finally, the set of noun phrases is filtered with heuristics like those described in Nakagawa and Kori (1998) and Wacholder (1998) that identify subject-rich compound nominals as well as atomic nouns and
can be used to create simple hierarchical or graphical structures based on syntactic similarity. Our system, described more fully in Godby and Reighart (1999), has many configurable parameters, but it can result in
drastic reductions of the input text. The final terminology graph may represent as little as .0003% of the original file, a good result considering our goal of creating relatively sparse terminology maps.
Figure 3 shows a portion of an RDF graph populated with automatically extracted subject terminology. In this example, electronic library has two narrower terms, electronic library model and electronic library model implementation. The RDF graph links the terms, identifies their relationships, and clusters the pointers to the
resources that contain them.
electronic library
...
electronic library model
Figure 3. An RDF graph with subject vocabulary.
ţ˙ ŐÍŐ. +,ůŽD ŐÍŐ. +,ůŽ P h p ¤ Ź ´ ź Ä Ě Ô Ü
ä / ä * OCLC Online Computer Library Center, Inc. o 0
1 ? Automatically Generated Topic Maps of World Wide Web Resources Title 6 >
_PID_GUID ä A N { F F 5 A 9 E B F - 1 C 4 8 - 1 1 D 3 ? @ l Ą ľ Ú Ű Ü Ý ć Š Î U
V
ź
Ó Ň
Ó
Ô
Ő
Ö
ő h i ü ú ü ü ü ü ü ü ü ü ÷ ő ő ő ő ő ő ő ő ő ő ő ő ő ő ő ő ő $ $ ? @ l Ą ľ Ú Ű Ü Ý ć Š Î U
V
ź
Ó Ň
Ó
Ô
Ő
Ö
ő h i ž ő L { § ˝ Č É ů [ \ + i ŕ á b : P n o ş ď G s | Ż × K w ą ţüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüü Ei ž ő L { § ˝ Č É ů [ \ + i ŕ á b : P n o ş ď ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý ý - 8 6 5 D - 0 0 5 0 0 4 0 0 6 8 9 8 } Ý ć l
q
y
´
¸
ş
ż
Ä
Ď
i { § Š Č Î ů ˘ Ł § Â Ş Ă Ç Ń Ň î o ď đ ą F4 ö5 6 6 6 6 "6 ć6 ř6 8 "8 :8 9 X9 h: : : Ź: ; ; <