File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/86/p86-1018_intro.xml
Size: 6,627 bytes
Last Modified: 2025-10-06 14:04:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P86-1018"> <Title>SEMANTICALLY SIGNIFICANT PATTERNS IN DICTIONARY DEFINITIONS *</Title> <Section position="3" start_page="0" end_page="112" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Large natural language processing systems need lexicons much larger than those available today with explicit information about lexlcal-semantic re%ationships, about usage, about forms, about morphology, about case frames and selection restrictions and other kinds of collocational information. Apresyan, Mel'cuk, and Zholkovsky studied the kind of explicit lexical information needed by non-native speakers of a language. Their Explanatory-Combinatory Dictionary (1970) explains how each word is used and how it combines with others in phrases and sentences. Their dream has now been realized in a full-scale dictionary of Russian (Mel'cuk and Zholkovsky, 1985) and in example entries for French (Mel'cuk et al., 1984). Computer programs need still more explicit and detailed information. We have discussed elsewhere the kind of lexical information needed in a question answering system (Evens and Smith, 1978) and by a system to generate medical case reports (Li et al., 1985).</Paragraph> <Paragraph position="1"> This research was supported by the</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> National Science Foundation under IST-85- </SectionTitle> <Paragraph position="0"> 10069.</Paragraph> <Paragraph position="1"> A number of experiments have shown that relational thesauri can significantly improve the effectiveness of an information retrieval system (Fox, 1980; Evens et al., 1985; Wang et al., 1985). A relational thesaurus is used to add further terms to the lquery, terms that are related to the ~riglnal by lexlcal relations like synonymy, taxonomy, set-membership, or the part-whole relation, among others. The addition of these related terms enables the system to identify more relevant documents. The development of such relational thesauri would be comparatively simple if we had a large lexicon containing relational information. (A comparative study of lexical relations can be found in Evens et al., 1980).</Paragraph> <Paragraph position="2"> The work involved in developing a lexicon for a large subset of English is so overwhelming, that it seems appropriate to try to build a lexicon automatically by analyzing information in a machine-readable dictionary. A collegiate level dictionary contains an enormous amount of information about thousands of words in the natural language it describes. This information is presented in a form intended to be easily understood and used by a human being with at least some command of the language. Unfortunately, even when the dictionary has been transcribed into machine-readable form, the knowledge which a human user can acquire from the dictionary is not readily available to the computer.</Paragraph> <Paragraph position="3"> There have been a number of efforts to extract information from machine-readable dictionaries. Amsler (1980, 1981, 1982) and Amsler and John White (1979) mapped out the taxonomic hierarchies of nouns and verbs in the</Paragraph> </Section> <Section position="2" start_page="0" end_page="112" type="sub_section"> <SectionTitle> Merriam-Webster Pocket Dictionary. </SectionTitle> <Paragraph position="0"> Michiels (1981, 1983) analyzed the Longman Dictionary of C0ntemporary Englis h (LDOCE), taking advantage of the fact that that dictionary was designed to some extent to facilitate computer manipulation. Smith (1981) studied the &quot;defining formulae&quot; - significant recurring phrases - in a selection of adjective definitions from We bster\[s Carolyn White (1983) has developed a program to create entries for Sager's Linguistic String Parser (1981) from WY.</Paragraph> <Paragraph position="1"> Chodorow and Byrd (1985) have extracted taxonomic hierarchies, associated wlth feature information, from LDOCE and W7.</Paragraph> <Paragraph position="2"> We have parsed W7 adjective definitions (Ahlswede, 1985b) using</Paragraph> </Section> <Section position="3" start_page="112" end_page="112" type="sub_section"> <SectionTitle> Sager's Linguistic String Parser (Sager, </SectionTitle> <Paragraph position="0"> 1981) in order to automatically identify lexical-semantic relations associated with defining formulae. We have also (Ahlswede and Evens, 1983) identified defining formulae in noun, verb and adverb definitions from W7. At present we are working on three interrelated projects: identification and analysis of lexical-semantic -elations in or out of WY; generation of computed definitions for words which are used or referred to but not defined in WY; and parsing of the entire dictionary (or as much of it as possible) to generate from it a large general lexical knowledge base.</Paragraph> <Paragraph position="1"> This paper represents a continuation of our work on defining formulae in dictionary definitions, in particular definitions from W7. The patterns we deal with are limited to recurring phrases, such as '&quot;any of a&quot; or &quot;a quality or state of&quot; (common in noun definitions) and &quot;of or relating to&quot; (common in adjective definitions). From such phrases, we gain information not only about the words being defined but also about the words used in the definitions and other words in the lexicon.</Paragraph> <Paragraph position="2"> Specifically, we can extract selectional information, co-occurrence relations, and lexical-semantic relations. These methods of extracting information from W7 were designed for use in the lexicon builder described earlier by Ahlswede (1985a).</Paragraph> <Paragraph position="3"> The computational steps involved in this study were relatively simple. First W7 definitions were divided by part of speech into separate files for nouns, verbs, adjectives, and others. Then a separate Keyword In Context (KWIC) Index was made for each part of speech.</Paragraph> <Paragraph position="4"> Hypotheses were tried out initially on a subset of the dictionary containing only those words which appeared eight or more times in the Kucera and Francis corpus (1968) of a million words of running English text. Those that proved valid for this subset were then tested on the full dictionary. This work would have been impossible without the kind permission of the G. & C. Merriam Company to use the machine-readable version of W7 (Olney et al. 1967).</Paragraph> </Section> </Section> class="xml-element"></Paper>