File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/87/j87-3002_intro.xml
Size: 7,048 bytes
Last Modified: 2025-10-06 14:04:37
<?xml version="1.0" standalone="yes"?> <Paper uid="J87-3002"> <Title>LARGE LEXICONS FOR NATURAL LANGUAGE PROCESSING: UTILISING THE GRAMMAR CODING SYSTEM OF LDOCE</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> The grammar coding system employed by the Longman Dictionary of Contemporary English (henceforth LDOCE) is the most comprehensive description of grammatical properties of words to be found in any published dictionary available in machine readable form. This paper describes the extraction of this, and other, information from LDOCE and discusses the utility of the coding system for automated natural language processing.</Paragraph> <Paragraph position="1"> Recent developments in linguistics, and especially on grammatical theory -- for example, Generalised Phrase Structure Grammar (GPSG) (Gazdar et al., 1985), Lexical Functional Grammar (LFG) (Kaplan and Bresnan, 1982) -- and on natural language parsing frameworks for example, Functional Unification Grammar (FUG) (Kay, 1984a), PATR-II (Shieber, 1984) -- make it feasible to consider the implementation of efficient systems for the syntactic analysis of substantial fragments of natural language. These developments also emphasise that if natural language processing systems are to be able to handle the grammatical and semantic idiosyncracies of individual lexical items elegantly and efficiently, then the lexicon must be a central component of the parsing system. Real-time parsing imposes stringent requirements on a dictionary support environment; at the very least it must allow frequent and rapid access to the information in the dictionary via the dictionary head words. The research described below is taking place in the context of three collaborative projects (Boguraev, 1987; Russell et al., 1986; Phillips and Thompson, 1986) to develop a general-purpose, wide coverage morphological and syntactic analyser for English. One motivation for our interest in machine readable dictionaries is to attempt to provide a substantial lexicon with lexical entries containing grammatical information compatible with the grammatical framework employed by the analyser.</Paragraph> <Paragraph position="2"> The idea of using the machine readable source of a published dictionary has occurred to a wide range of researchers, for spelling correction, lexical analysis, thesaurus construction, and machine translation, to name but a few applications. Most of the work on automated dictionaries has concentrated on extracting lexical or other information, essentially by batch processing (eg. Amsler, 1981 ; Walker and Amsler, 1986), or Copyright 1987 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee and/or specific permission. 0362-613X/87/030203-218503.00 Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 203 Bran Boguraev and Ted Briscoe Large Lexicons for Natural Language Processing on developing dictionary servers for office automation systems (Kay, 1984b). Few established parsing systems have substantial lexicons and even those which employ very comprehensive grammars (eg. Robinson, 1982; Bobrow, 1978) consult relatively small lexicons, typically generated by hand. Two exceptions to this generalisation are the Linguistic String Project (Sager, 1981) and the IBM CRITIQUE (formerly EPISTLE) Project (Heidorn et al., 1982; Byrd, 1983); the former employs a dictionary of approximately 10,000 words, most of which are specialist medical terms, the latter has well over 100,000 entries, gathered from machine readable sources. In addition, there are a number of projects under way to develop substantial lexicons from machine readable sources (see Boguraev, 1986 for details). However, as yet few results have been published concerning the utility of electronic versions of published dictionaries as sources for such lexicons. In this paper we provide an evaluation of the LDOCE grammar code system from this perspective.</Paragraph> <Paragraph position="3"> We chose to employ LDOCE as the machine readable source to aid the development of a substantial lexicon because this dictionary has several properties which make it uniquely appropriate for use as the core knowledge base of a natural language processing system. Most prominent among these are the rich grammatical subcategorisations of the 60,000 entries, the large amount of information concerning phrasal verbs, noun compounds and idioms, the individual subject, collocational and semantic codes for the entries and the consistent use of a controlled 'core' vocabulary in defining the words throughout the dictionary. (Michiels (1982) contains further description and discussion of LDOCE.) In this paper we focus on the exploitation of the LDOCE grammar coding system; Alshawi et al.</Paragraph> <Paragraph position="4"> (1985) and Alshawi (1987) describe further research in Cambridge utilising different types of information available in LDOCE.</Paragraph> <Paragraph position="5"> The information available in the dictionary is both very rich and diverse, but also typically only semiformalised, as it is intended for human, rather than machine, interpetation. As a consequence the programs we are developing, both to restructure and to exploit this information, need to undergo constant revision as they are being used. The system we describe is not intended for off-line use, where one might attempt to derive, completely automatically, a lexicon for natural language analysis. Rather than trying to batch process the electronic source, lexicon development from the LDOCE tape is more incremental and interactive. Our system is designed as an integral part of a larger grammar (and lexicon) development environment, where new lexical entries are automatically generated from the on-line version of the dictionary, checked for correctness and consistency and only then added to the 'final' lexicon.</Paragraph> <Paragraph position="6"> The problem of utilising LDOCE in natural language processing falls into two areas. Firstly, we must provide an environment in which the machine readable source is linked to the development environment in an appropriate fashion and secondly, we must restructure the information in the dictionary, using the development environment, in such a way that natural language processing systems are able to utilise it effectively. As an example, we demonstrate how the LDOCE grammar codes can be put to practical use by linking up the system with the experimental PATR-II parsing system.</Paragraph> <Paragraph position="7"> Finally, we offer an evaluation of the utility of the LDOCE grammar coding system from the perspective of natural language processing.</Paragraph> </Section> class="xml-element"></Paper>