You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format. However, this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the Department of Computer Science of the University of Maryland at College Park under terms that include this permission. All other rights are reserved by the author(s).
Chinese-English Semantic Resource Construction. Bonnie J. Dorr. Gina-Anne Levow. Dekang Lin. Scott Thomas. June 2000.
We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources--a classification of English verbs called EVCA (English Verbs Classes and Alterations) [Levin, 1993], a Chinese conceptual database called HowNet [Zhendong, 1988c, Zhendong, 1988b] (http://www.how-net.com), and a large machine-readable dictionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval. (Also cross-referenced as UMIACS-TR-2000-27) (Also cross-referenced as LAMP-TR-044) University of Maryland Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland,
Large-Scale Construction of a Chinese-English Semantic Hierarchy. Bonnie J. Dorr. Gina-Anne Levow. Dekang Lin. June 2000.
This paper addresses the problem of building conceptual resources for multilingual applications. We describe new techniques for large-scale construction of a semantic hierarchy for Chinese verbs, using thematic-role information to create links between Chinese concepts and English classes. We then present an approach to compensating for gaps in the existing resources. The resulting hierarchy is used for a multilingual lexicon for Chinese-English machine translation and cross-language information retrieval applications. (Also cross-referenced as UMIACS-TR-2000-17) (Also cross-referemced as LAMP-TR-040) University of Maryland Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland,
A Survey of Current Paradigms in Machine Translation. Bonnie J. Dorr. Pamela W. Jordan. John W. Benoit. December 1998.
This is paper is a survey of the current machine translation research in the US, Europe, and Japan. A short history of machine translation is presented first, followed by an overview of the current research work. Representative examples of a wide range of different approaches adopted by machine translation researchers are presented. These are described in detail along with a discussion of the practicalities of scaling up these approaches for operational environments. In support of this discussion, issues in, and techniques for, evaluating machine translation systems are discussed. Also cross-referenced as UMIACS-TR-98-72) University of Maryland Institute for Advanced Computer Science, Department of Computer Science, University of Maryland,
A Thematic Hierarchy for Efficient Generation from Lexical-Conceptual. Bonnie J. Dorr. Nizar Habash. David Traum. October 1998.
This paper describes an implemented algorithm for syntactic realization of a target-language sentence from an interlingual representation called Lexical Conceptual Structure (LCS). We provide a mapping between LCS thematic roles and Abstract Meaning Representation (AMR) relations; these relations serve as input to an off-the-shelf generator (Nitrogen). There are two contributions of this work: (1) the development of a thematic hierarchy that provides ordering information for realization of arguments in their surface positions; (2) the provision of a diagnostic tool for detecting inconsistencies in an existing online LCS-based lexicon that allows us to enhance principles for thematic-role assignment. (Also cross-referenced as UMIACS-TR-98-50) (Also cross-refernced as LAMP-TR-022) University of Maryland Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland,
Lexical Selection for Cross-Language Applications: Combining LCS with. Bonnie J. Dorr. Maria Katsova. October 1998.
This paper describes experiments for testing the power of large-scale resources for lexical selection in machine translation (MT) and cross-language information retrieval (CLIR). We adopt the view that verbs with similar argument structure share certain meaning components, but that those meaning components are more relevant to argument realization than to idiosyncratic verb meaning. We verify this by demonstrating that verbs with similar argument structure as encoded in Lexical Conceptual Structure (LCS) are rarely synonymous in WordNet. We then use the results of this work to guide our implementation of an algorithm for cross-language selection of lexical items, exploiting the strengths of each resource: LCS for semantic structure and WordNet for semantic content. We use the Parka Knowledge-Based System to encode LCS representations and WordNet synonym sets and we implement our lexical-selection algorithm as Parka-based queries into a knowledge base containing both information types. (Also cross-referenced as UMIACS-TR-98-49) (Also cross-referenced as LAMP-TR-021) University of Maryland Institute for Advanced Computer Studies, Department of Computer, University of Maryland,
A Comparative Study of Knowledge-Based Approaches for Cross-Language. Douglas W. Oard. Bonnie J. Dorr. Paul G. Hackett. Maria Katsova. April 1998.
Cross-language retrieval systems seek to use queries in one natural language to guide the retrieval of documents that might be written in another. Acquisition and representation of translation knowledge plays a central role in this process. This paper explores the utility of two sources of manually encoded translation knowledge, bilingual dictionaries and translation lexicons, for cross-language retrieval. We have implemented six query translation techniques that use bilingual dictionaries, one based on lexical-semantic analysis, and one based on direct use of the translation output from an existing machine translation system; these are compared with a document translation technique that uses output from the same existing translation system. Average precision measures on portions of the TREC collection suggest that arbitrarily selecting a single translation from a bilingual dictionary is typically no less effective than using every translation in the dictionary, that query translation using an existing machine translation system can achieve somewhat better effectiveness than simple dictionary-based techniques, and that performing document translation rather than query translation may result in further improvements in retrieval effectiveness under some conditions. (Also cross-referenced as UMIACS-TR-98-27) University of Maryland Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland,
Toward Compact Monotonically Compositional Interlingua Using Lexical Aspect. Bonnie J. Dorr. Mari Broman Olsen. Scott C. Thomas. December 1997.
We describe a theoretical investigation into the semantic space described by our interlingua (IL), which currently has 191 main verb classes divided into 434 subclasses, represented by 237 distinct Lexical Conceptual Structures (LCSs). Using the model of aspect in Olsen (1994b, 1997a)---monotonic aspectual composition---we have identified 71 aspectually basic subclasses that are associated with one or more of 68 aspectually non-basic classes via some lexical (``type-shifting'') rule (Bresnan 1982, Pinker 1984, Levin and Rappaport Hovav 1995). This allows us to refine the IL and address certain computational and theoretical issues at the same time. (1) >From a linguistic viewpoint, the expected benefits include a refinement of the aspectual model in (Olsen:1994b, 1997a) (which provides necessary but not sufficient conditions for aspectual composition), and a refinement of the verb classifications in (Levin 1993); we also expect our approach to eventually produce a systematic definition (in terms of LCSs and compositional operations) of the precise meaning components responsible for Levin's classification. (2) Computationally, the lexicon is made more compact. Also cross-referenced as UMIACS-TR-97-86 Also cross-referenced as LAMP-TR-012 University of Maryland Institute for Advanced Computer Studies, University of Maryland Laboratory for Language and Media Processing, Department of Computer Science, University of Maryland,
Using WordNet to Posit Hierarchical Structure in Levin's Verb Classes. Mari Broman Olsen. Bonnie J. Dorr. David J. Clark. December 1997.
In this paper we report on experiments using WordNet synset tags to evaluate the semantic properties of the verb classes cataloged by Levin 1993. This paper represents ongoing research begun at the University of Pennsylvania (Rosenzweig et al. 1997, Palmer et al. 1997) and the University of Maryland (Dorr and Jones 1996b, 1996d, 1996e). Using WordNet sense tags to constrain the intersection of Levin classes, we avoid spurious class intersections introduced by homonymy and polysemy (_run a bath, run a mile_). By adding class intersections based on a single shared sense-tagged word, we minimize the impact of the non-exhaustiveness of Levin's database (Dorr and Olsen 1996, Dorr to appear). By examining the syntactic properties of the intersective classes, we provide a clearer picture of the relationship between WordNet/EuroWordNet and the LCS interlingua for machine translation and other NLP applications. Also cross-referenced as UMIACS-TR-97-85 Also cross-referenced as LAMP-TR-011 University of Maryland Institute for Advanced Computer Studies, University of Maryland Laboratory for Language and Media Processing, Department of Computer Science, University of Maryland,
Development of an Object Oriented Parser/Generator, Ontologies, and. Bonnie J. Dorr. February 1996.
This document reports on research conducted at the University of Maryland for the Korean/English Machine Translation (MT) project. Our primary objective was to develop an interlingual representation based on lexical conceptual structure (LCS) and to examine the relation between this representation and a set of linguistically motivated semantic classes. We have focused on several areas in support of our objectives: (1) updating a Korean message-passing parser to handle more Korean linguistic phenomena and porting this to Windows on the PC so that it runs with LCS composition; (2) scaling up the Korean lexicon to include thousands of new words converted by the Yale-romanization program, to be integrated with the Korean message-passing parser; (3) investigation of the syntax-semantics relation and use of this relation in automatic classification of verbs; (4) investigation of the aspectual dimensions as it impacts lexical semantics and the lexical choice process in multilingual generation; and (5) automatic construction of LCS's from lexical-semantic templates and thematic grids. (Also cross-referenced as UMIACS-TR-97-37) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Aspectual Modifications to a LCS Database for NLP Applications. Bonnie J. Dorr. Mari Broman Olsen. May 1997.
Verbal and compositional lexical aspect provide the underlying temporal structure of events. Knowledge of lexical aspect, e.g., (a)telicity, is therefore required for interpreting event sequences in discourse (Dowty, 1986: Moens and Steedman, 1988; Passoneau, 1988), interfacing to temporal databases (Androutsopoulos, 1996), processing temporal modifiers (Antonisse, 1994), describing allowable alternations and their semantic effects (Resnik, 1996; Tenny, 1994), and selecting tense and lexical items for natural language generation ((Dorr and Olsen, 1996; Klavans and Chodorow, 1992), cf. (Slobin and Bocaz, 1988)). We show that it is possible to represent lexical aspect---both verbal and compositional---on a large scale, using Lexical Conceptual Structure (LCS) representations of verbs in the classes cataloged by Levin (1993). We show how proper consideration of these universal pieces of verb meaning may be used to refine lexical representations and derive a range of meanings from combinations of LCS representations. A single algorithm may therefore be used to determine lexical aspect classes and features at both verbal and sentence levels. Finally, we illustrate how knowledge of lexical aspect facilitates the interpretation of events in NLP applications. (Also cross-referenced as UMIACS-TR-97-21) (Also cross-referenced as LAMP-TR-007) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
LEXICALL: Lexicon Construction for Foreign Language Tutoring. Bonnie J. Dorr. February 1997.
We focus on the problem of building large repositories of lexical conceptual structure (LCS) representations for verbs in multiple languages. One of the main results of this work is the definition of a relation between broad semantic classes and LCS meaning components. Our acquisition program---LEXICALL---takes, as input, the result of previous work on verb classification and thematic grid tagging, and outputs LCS representations for different languages. These representations have been ported into English, Arabic and Spanish lexicons, each containing approximately 9000 verbs. We are currently using these lexicons in an operational foreign language tutoring and machine translation. (Also cross-referenced as UMIACS-TR-97-09) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Bilingual Lexicon Construction Using Large Corpora. Wade Shen. Bonnie J. Dorr. October 1997.
This paper introduces a method for learning bilingual term and sentence level alignments for the purpose of building lexicons. Combining statistical techniques with linguistic knowledge, a general algorithm is developed for learning term and sentence alignments from large bilingual corpora with high accuracy. This is achieved through the use of filtered linguistic feedback between term and sentence alignment processes. An implementation of this algorithm, TAG-ALIGN, is evaluated against approaches similar to [Brown et al. 1993] that apply Bayesian techniques for term alignment, and [Gale and Church 1991] a dynamic programming method for aligning sentences. The ultimate goal is to produce large bilingual lexicons with a high degree of accuracy from potentially noisy corpora. (Also cross-referenced as UMIACS-TR-97-50) Institute for Advanced Computer Studies, Department of Computer Science,
A Survey of Multilingual Text Retrieval. Douglas W. Oard. Bonnie J. Dorr. April 1996.
This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as ``multilingual'' text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the multilingual thesauri on which they are based and by the level of user training that is required to achieve satisfactory search effectiveness. A general model for multilingual text retrieval is used to review the development of the field and to describe modern production and experimental systems. The report concludes with some observations on the present state of the art and an extensive bibliography of the technical literature on multilingual text retrieval. The research reported herein was supported, in part, by Army Research Office contract DAAL03-91-C-0034 through Battelle Corporation, NSF NYI IRI-9357731, Alfred P. Sloan Research Fellow Award BR3336, and a General Research Board Semester Award. (Also cross-referenced as UMIACS-TR-96-19) Electrical Engineering Department, Univ. of Maryland, University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Automatic Extraction of Semantic Classes from Syntactic. December 1995.
Bonnie J. Dorr. Doug Jones. This paper addresses the issue of word-sense ambiguity in extraction from machine-readable resources for the construction of large-scale knowledge sources. We describe two experiments: one which took word-sense distinctions into account, resulting in 97.9% accuracy for semantic classification of verbs based on (Levin, 1993); and one which ignored word-sense distinctions, resulting in 6.3% accuracy. These experiments were dual purpose: (1) to validate the central thesis of the work of (Levin, 1993), i.e., that verb semantics and syntactic behavior are predictably related; (2) to demonstrate that a 20-fold improvement can be achieved in deriving semantic information from syntactic cues if we first divide the syntactic cues into distinct groupings that correlate with different word senses. Finally, we show that we can provide effective acquisition techniques for novel word senses using a combination of online sources. (Also cross-referenced as UMIACS-TR-95-65) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Bonnie J. Dorr. Jye-hoon Lee. Clare Voss. Sungki Suh. Development of Interlingual Lexical Conceptual Structures with. February 1995.
This document reports on research conducted at the University of Maryland for the Korean/English Machine Translation (MT) project. Our primary objective was to develop an interlingual representation based on lexical conceptual structure (LCS) and to examine the relation between this representation and a set of linguistically motivated semantic classes. We view the work of the past year as a critical step toward achieving our goal of building a generator: the classification of LCS's into a semantic hierarchy provides a systematic mapping between semantic knowledge about verbs and their surface syntactic structures. We have focused on several areas in support of our objectives: (1) investigation of morphological structure including distinctions between Korean and English; (2) porting a fast, message-passing parser to Korean (and to the IBM PC); (3) study of free word order and development of the associated processing algorithm; (4) investigation of the aspectual dimension as it impacts morphology, syntax, and lexical semantics; (5) investigation of the relation between semantic classes and syntactic structure; (6) development of theta-role and lexical-semantic templates through lexical acquisition techniques; (7) definition a mapping between KR concepts and interlingual representations; (8) formalization of the lexical conceptual structure (Also cross-referenced as UMIACS-TR-95-16) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Bonnie J. Dorr. March 1994.
Development of Cross-Linguistic Syntactic and Semantic Parameters for Parsing and Generation. This document reports on research conducted at the University of Maryland for the Korean/English Machine Translation (MT) project. The translation approach adopted here is interlingual i.e., a single underlying representation called Lexical Conceptual Structure (LCS) is used for both Korean and English. The primary focus of this investigation concerns the notion of `parameterization' i.e., a mechanism that accounts for both syntactic and lexical-semantic distinctions between Korean and English. We present our assumptions about the syntactic structure of Korean-type languages vs. English-type languages and describe our investigation of syntactic parameterization for distinguishing between these two types of languages. We also present the details of the LCS structure and describe how this representation is parameterized so that it accommodates both languages. We address critical issues concerning interlingual machine translation such as locative postpositions and the dividing line between the interlingua and the knowledge representation. Difficulties in translation and transliteration of Korean are discussed and complex morphological properties of Korean are presented. Finally, we describe recent work on lexical acquisition and conclude with a discussion about two hypotheses concerning semantic classification that are currently being tested. (Also cross-referenced as UMIACS-TR-94-26) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Last Generated Fri Aug 11 04:01:01 EDT 2000