File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/w91-0216_intro.xml
Size: 12,272 bytes
Last Modified: 2025-10-06 14:05:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W91-0216"> <Title>For the Lexicon That Has Everything</Title> <Section position="2" start_page="0" end_page="181" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Any natural language processing system needs both knowledge about words and knowledge about the world. Many natural language systems divide these two kinds of knowledge into two knowledge bases, which we call the lexicon and the encyclopedia for the purposes of this discussion. We argue that the distinction between the lexicon and the encyclopedia is difficult to maintain both in theory and in practice. We describe the design and development of a large lexical database intended to support parsing, generation, and information retrieval applications. We claim that these applications require information of many different kinds, some of which is traditionally stored in a dictionary, some in a thesaurus, and some in an encyclopedia.</Paragraph> <Paragraph position="1"> We need to support not just alphabetic access to this information but access through semantic links. Bierman \[1964\] was one of the first to describe lexical-semantic links between words. They define the basic organization of semantic information, he claims.</Paragraph> <Paragraph position="2"> He paints an image of a very large single-page dictionary with language-specific nodes connected by semantic relations.</Paragraph> <Paragraph position="3"> Can we distinguish between the lexicon and the encyclopedia in this context? Bierwisch and Kiefer \[1970\] assume that both kinds of information are contained in the same lexical entry. The distinction between linguistic or lexical and encyclopedic knowledge, they say, corresponds to the difference between the core and the periphery of a lexical entry, where: The core of a lexical reading comprises all and only those semantic specifications that determine, roughly speaking, its place within the system of dictionary entries, i.e., delimit it from other (non-synonymous) entries. The periphery consists of those semantic specifications which could be removed from its reading without changing its relation to other lexical readings within the same grammar. \[Bierwisch and Kiefer 1970, 69-70\] The major difficulty with this criterion is its instability. As new entries are added to the system, information sufficient to distinguish one entry from another may have to be shifted from the periphery to the core - and thus from the encyclopedia to the lexicon. For instance, suppose a new entry, &quot;leopard - a large wild cat&quot; is to be added. The entire lexicon must be searched for entries that mention large wild cats. If one is found, say &quot;lion - a large wild cat,&quot; then enough information must be added to both definitions to differentiate leopard and lion from each other.</Paragraph> <Paragraph position="4"> Apresyan, Mel'~uk, and Zolkovsky run into the same difficulty of distinguishing lexieal and encyclopedic information in attempting to define the lexical universe of a word CO. The main themes dealt with under the heading 'lexical universe' are: 1) the types of CO; 2) the main part or phases of CO; 3) typical situations occurring before and after CO, etc. Thus, the section lexical universe for the word skis consists of a list of the types of skis (racing, mountain, jumping, hunting), their main parts (skis proper and bindings), the main objects and actions necessary for the correct use (exploitation) of skis (poles, grease, to wax), the main types of activities connected with skis (a ski-trip, a ski-race ...) ... the sections contain only such words as are necessary for talking on the topic, and nothing else. \[Apresyan et al. 1970\] The problem is that &quot;what is needed for talking about the topic&quot; depends very much on who is going to do the talking. The definition of ski in Webster's New International (2nd edition) begins: One of a pair of narrow strips of wood, metal, or plastic, usually in combination, bound one on each foot and used for gliding over a snow-covered surface.</Paragraph> <Paragraph position="5"> Apresyan et al. do not provide for three of the items mentioned here: what skis are made of (wood, plastic, or metal), what shape they come in (long and narrow) and where they belong spatially (on the human foot). Yet these items could be essential to understanding implicit inferences in a story.</Paragraph> <Paragraph position="6"> It was snowing. Jim took out his skis and the can of wax. He began to wax the wood carefully. Then he looked for the poles.</Paragraph> <Paragraph position="7"> It could be needed to answer questions: Jim skied rapidly down the mountain.</Paragraph> <Paragraph position="8"> Question: What was Jim wearing? slippers skis sandals Although in English and Russian it is possible to refer to skis without knowing that they are long and narrow it is not possible in Navajo or certain African languages where physical shapes determine verb forms. While the entry in Webster's New International goes on at length beyond the sentence given above, it does not include all the items that Apresyan mentions. Clearly the boundaries of the lexical universe are not well defined. The dichotomy between the lexicon and the encyclopedia is particularly hard to preserve during the updating process. Recognizing definitions phrased in ordinary English is difficult \[Bierwisch and Kiefer, 1970\]. This information does not come neatly packaged and marked &quot;for the lexicon&quot; and &quot;for the encyclopedia&quot;. How do we tell which is which? Addition of information to one part of the entry may necessitate updating other parts of the entry. For example, if we learn that record is a verb as well as a noun we need to add morphological information and describe the relation between record and erase. We should probably describe recording materials, as well. We also need to add that the verb record is a factive, i.e., the assertion that someone records an action implies the assertion that the action really occurreed. Which of this information is lexical and which is encyclopedic? Both theoretical and practical arguments convince us that the lexicon-encyclopedia dichtomy is not valid.</Paragraph> <Paragraph position="9"> Information about semantic relationships between words - thesaurus information - is needed for many reasons. It is crucial to semantic access. Hirst and Morris \[1990\] have shown that it is fundamental to language understanding. Fox \[1980, 1988\], Nutter et al. \[1988\] and Wang et al. \[1985\] have used thesaurus information to improve the results of an information retrieval system. Eiler \[1979\] has shown the importance oflexical relationships in human text generation. Lee \[1991\] is using this kind of information to generate cohesive text by machine. Zhang \[1990\] is using it to generate explanations.</Paragraph> <Paragraph position="10"> We need to store all this information not just for words but for phrases. Becker \[1975\] argues cogently that language is ordinarily generated in large swatches, not a word at a time. Commercial dictionaries include many phrasal entries. For example, approximately 160~ of the main entries in Webster's Seventh Collegiate Dictionary are phrases and many other phrases appear as &quot;runons&quot; at the foot of other entries. Charniak \[1972\] makes it clear that &quot;birthday party&quot; needs an entry of its own - and shows also that its lexical universe is huge.</Paragraph> <Paragraph position="11"> 2 Organization of the Lexical Database Because we want to see our lexicon used by as many people as possible, we have sought sources for our data that will permit us to distribute the database to anyone who plans to use for research purposes. Collins Publishers has generously agreed we may give a copy of our lexicon including data derived from the first edition of the Collins English Dictionary (CED) to anyone who qualifies to obtain a machine-readable copy from the Data Collection Initiative. Another valuable source of lexical data is the Brandeis Verb Lexicon constructed by Grimshaw and Jackendoff \[1985\]. Sven Jacobson \[1964, 1978\] has kindly allowed us to keyboard and distribute his Dictionary of Adverb Placement. We have also put into machine readable form the Adjective, Adverb, Noun, and Verb Lists developed from Householder's NSF project of twenty-five years ago \[1964, 1965\].</Paragraph> <Paragraph position="12"> Our lexical database is organized and stored using the Oracle Relational Database Management System. Our database relations (which we will call tables to distinguish them from semantic relations) include a main table with the word, and its homograph and sense number (from the CED) combined into one field, the part of speech, and the source. Each word with a different homograph and sense number (assigned in CED) is put into a different entry for the purpose of lexical disambiguation. Words that have no homograph or sense number in CED are assigned a code of-1. We have also designed separate tables for each part of speech. Each table contains different information specific to that part of speech.</Paragraph> <Paragraph position="13"> The noun table contains information about whether a noun is regular or irregular, abstract or concrete, count or mass, human, animate or inanimate, singular or plural, common or proper, collective or not, what gender it is, and whether it appears in an Indiana noun list. We have a separate table for Indiana nouns (those that support that clauses) giving the number of the Indiana list \[Bridgeman 1965\]. Then there is still another table that gives the definition and an example for each Indiana noun list. We also have a separate table for nouns with irregular plurals, like child, goose, and oz.</Paragraph> <Paragraph position="14"> There are a number of different verb tables. The main verb table tells whether a verb is regular or irregular, dynamic or stative, transitive or intransitive (or both), takes a sentential complement or not, can be put into passive voice or not. If it is in a speech act class \[Wierzbicka, 1989\] or a performative class \[McCawley, 1979\] then the class will be given. Then there is a table for strong verbs with their forms. There is a case table giving information about verb arguments. If a verb takes sentential complements it appears in a special table that tells what complementizers the verb takes, its implicative class (factive, etc.), whether it is subject to raising, and whether it appears in an Indiana verb list. The Indiana verb table gives Indiana verb classes in which the verb appears. There is yet another table that gives the defining information for the Indiana verb lists.</Paragraph> <Paragraph position="15"> The adjective table indicates whether the adjective is dynamic or stative, gradable or non-gradable, inherent or non-inherent. An adjective may be intensitive. It may appear as a post-determiner. It may be a general adjective susceptible to subjective measure, a general adjective susceptible to objective measure including size or shape, or color. It may be a denominal adjective denoting material, or a denominal adjective denoting provenance or style. Information about the semantic classes an adjective belongs to is essential to determining its position in the sentence during text generation. While most adjectives can occur in both attributive and predicate positions, some are non-attributive, others are non-predicative. We also have a table for unpredictable adjective inflections and another for Indiana adjectives \[Householder et al. 1965\]. Our adverb tables have been fully discussed elsewhere \[Pin-Ngern et al. 1990\].</Paragraph> <Paragraph position="16"> We also have a table listing lexical-semantic relations with definitions and examples and then several tables of lexical- semantic relationships \[Ahlswede and Evens, 1988a\]. Our plans include tables containing other information from CED such as definitions, pronunciations, and etymologies, but these have not been built since none of us is currently using that information.</Paragraph> </Section> class="xml-element"></Paper>