File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-2166_metho.xml
Size: 14,749 bytes
Last Modified: 2025-10-06 14:12:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2166"> <Title>Machine-Readable Dictionary&quot; in The Uses of Large Text Databases, Proceedings of the Third</Title> <Section position="3" start_page="0" end_page="816" type="metho"> <SectionTitle> 2. Applications </SectionTitle> <Paragraph position="0"> The initial impetus for building a computational lexicon arose from the needs of the CRI-TIQUE text-critiquing system (previously called EPISTLE, Ileidom et al. 1982). Basic syntactic information such as part of speech, subcategorization for verbs (e.g. trans, intrans, complement taking properties), irregular forms, some inherent semantic information (such as male, female for nouns), some graphemic, phonological, and stylistic features were gathered from a range of (primarily) maelfine-readable sources. This system (called UDICT, the ultimate dictionary) is described in Byrd 1983 and Byrd et al. 1986. A modified version of the original dictionary is still in use by that project.</Paragraph> <Paragraph position="1"> Our experience in attempting to build a solid broad-coverage computational lexicon revealed to us the range of projects potentially in need of such a lexical resource. Unfortunately, it also revealed to us a range of problems. First, the projects: we received requests for information from NIA' projects such as the experimental English-to-German machine translation system I,MT /McCord 1988/, the natural language data hase query project TQA/Damerau et al. 1982, Johnson 1984/, the kind-types Knowledge Representation system KT /Dahlgrcn and McDowell 1986/, and others. In fact, the LMT system uses UDICT for lexicon back-up when the LMT lexicon does not contain or does not analyze an item/McCord and Wolff 1987/. The analyses output from UDICT are compiled into LMT internal format for use by LMT. This is exactly the use we envision for COMPLEX.</Paragraph> <Paragraph position="2"> In addition to use by NLP systems, some of the information in COMPLEX might be used directly by lexicographers to aid in creating lexicographers' workstations for projects such as dictionary building and machine-assisted translation. It could also be useful to psycholinguists seeking lists of words with particnlar lexical properties for test materials. /Taft mad Forster 1976, Cutler 1983/. Since COMPLEX is machine readable, it is a simple matter to extract lists with selected features.</Paragraph> <Paragraph position="3"> Some of the problems that arose as a result of our experience in attempting to build and provide a solid broad-coverage computational lexicon for NLP projects are discussed in the next section.</Paragraph> <Paragraph position="4"> Most important is the problem of polysemy. We realized that until the problem of sense distinctions is tackled, may computational lexicon will be of limited usefulness. The other problem particular to using machine readable dictionaries is the Mapping problem, also discussed below.</Paragraph> <Paragraph position="5"> 3. The Polysemy Problem and The</Paragraph> <Section position="1" start_page="0" end_page="816" type="sub_section"> <SectionTitle> Mapping Problemdeg </SectionTitle> <Paragraph position="0"> Each entry in UDICT consists of lists of feao tures and attribute-value pairs. There is one list for each part of speech. For example, the word &quot;claim&quot; has two parts of speech in U1)ICF:</Paragraph> <Paragraph position="2"> In this case, &quot;claim&quot; is morphologically simple so the STRU(TI'URE value is tim same as the input word.</Paragraph> <Paragraph position="3"> The polysemy problem arises because of the fact that there is only one list of features ~ permitted for each part of speech. The question is to decide what features to put into the feature bundle. This is not a trivial matter but there are several options. One is to put only those features that apply to ~dl senses of a word, that is, the intersection of the set of features for each sense. Another would be to list the un#m of all features for each sense. Of course, there is the option of representing different senses of a word, with the corresponding set of features, but then this brings along another more fundamental problem: what is a sense? Consider a system such as that reported in Boguraev 1986 and 1987 in which sense distinctions are in fact made. The grammar development system, intended for a GPSG-style parser, utilizes the grammatical codes in the ixmgman Dictionary of Contemporary English /1978/, henceforth I,I)OCE, as the basis for fisting of featuredegvalue sets. llowever, notice that tiffs system is forced to accept the sense distinctions from I,I)OCE, for better or for worse. Similarly, the project described in Wilks et al. 1987 uses LDOCE defi~ nitions as the basis for lexlcal semantic structures. Semantic information is to be extracted from dictionary entries in LI)OCE to build sense frames.</Paragraph> <Paragraph position="4"> These structures (with sorne enhancements) are to provide the basis for knowledge-based parsing.</Paragraph> <Paragraph position="5"> Both project s are pursuing important paths in NLP research, and in particular in the use o1' machine readable dictionaries. However, each is constrained by the sense distinctions dictated by LDOCE.</Paragraph> <Paragraph position="6"> LDOCF, is a small dictionary, so there are many distinctions omitted. Furthermore, often important grammatical distinctions ate merged for the sake of From now on, the term &quot;features&quot; is used to apply to both features and attribute-value pairs in UDICI'. space. As human readers, we may be able to decode such abbreviatkms, but it is doubtful that compt~tecs are capable of such interpretation. Take for example, the entry tbr the verb &quot;button&quot;: buttom;: to button (up) one's shirt My shirt doesn't botton (up) easilydeg The entry is listed as requiting a human subject, yet tlm CXarmple sentence has the surface subject &quot;shirt/' The problem here is that the underlying Agent i~ '7~uma~/' but not the surface subject. Regular altematkms like this are sometimes captured hnplicifly ia the definition in the fomt of the parew thcsized ~(cause to)&quot;, but this is in no way explicit in the dictionary resource. A detailed study of the semantic codes for subject from H)OCE is givet~ below.</Paragraph> <Paragraph position="7"> 'Fo sum, there are various solutions lo the problem of senses, each of them inadequate in one way or another. The solution to list only the intersection of fi~atures (the approach in most of UDICT) or the solution to list the ration of t'ca~ tures (taken for the verbs in IJDICI') does not capture the fact that difibrent senses et'a word exhibit different syntactic behavior, hnportant information is obscured mid omitted .by these approach,~s. On the other band, the solution chosen b:C/ Wilks et al. 1987 or by Boguraev 1986 and 1987 is to take the sense distinctions provided by LDOCE. But this then requires a system to adopt LDOCE senses, even when they are ineomo pletc or incorrect. In order to use more than one MRD, a way te map senses in one dictionary onto senses in another is required, since sense disdeg tinctkms across dictionaries rarely correspond.</Paragraph> <Paragraph position="8"> Altemativdy, one could compose a set of ideal data structures~ and thcn hunt in various resources, including dk:tionarles, for informatiou which cotnpletes the required lields. This is the proposal set forth in Atkins 1987, 2 and it is the route we arc cur.rcntly pursuing although our results arc still too prellminmy to be reported.</Paragraph> <Paragraph position="10"/> </Section> </Section> <Section position="4" start_page="816" end_page="817" type="metho"> <SectionTitle> 4.~. CONIPLEX S~nucture </SectionTitle> <Paragraph position="0"> Tile previous sections of this paper have described the limitations of UDICT. With this in .tnind, this section gives the information to be eon~ rained in COMPLEX. Currently, we draw on the following sources: 3 1. enhanced UDICT (\[mxSys) 2. Brandeis Verb Lexicon 3. defirdfions and grammatical information fi'om We too arc using tile sense distinctions cm LDOCII';, although we are aware of its limitations. (See also Michiels 1982). Our system is not hard,, wired into I,I)OCE. Ccmsider the design fer one sense of the verb &quot;bring&quot;: We acknowledge the valuable input of Beryl T~ (Sue) Atkins, who was visiting the Lexical Systems Group at H~M during April, 1988. We also acknowledge input from Beth l.evin. The Brandeis Verb Lexicou was devcJopcd by Jane Grimshaw and Ray JackendoWs, NSF grant number NSF ~STo81-20403 awarded to Brandeis University. .--LDOCE :SENSENUM. I :SGRAMCODES. DI ( to, for );TI :SUBJCODES. NONE :SEL RES_SUBJ. NONE :SEL_RES_DO. NONE :SEL_RES IO. NONE :DEF. to come wlth or lead: Note that there are three distinct data sets. Each of these structures will be described in turn.</Paragraph> <Section position="1" start_page="817" end_page="817" type="sub_section"> <SectionTitle> 4.2 Lexieal Systems. </SectionTitle> <Paragraph position="0"> In the example above, the Lexical Systems data show four feature types: two MORl'Hological, one PHONological, nine SYNTACTIC and one SYSTEM feature. Other feature types not shown in this analysis are SEMANTIC, STYLISTIC, and GRAPHEMIC. The two morphological features (MORPH) give the irregular inflectional attribute-value pairs for the past and past participial forms of the verb (PAST brought) and (PASTPART brought). The next feature is phonological (PHON); AXNT means that the word is accented on the final syllable. In the case of &quot;bring&quot; the word is monosyllabic, but in a word like ,,&quot;persuade&quot; the AXNT feature distinguishes word initial from word final stress. This phonological feature is needed for some morphological rules in English2 The next nine features are syntactic: &quot;bring&quot; can start multi-word constructions such as &quot;bring about&quot; (MWESTART); it is an infinitival form (INF), and it is inherently irregular IRREG; its number is PLUR; it subcategorizes as a di-transitive DITRAN (i.e. it takes two objects), takes an NPING and NPTOV complement, and that it is a transitive verb; its tense is PRES. The SYSTEM feature STORED shows that the word iS stored in our database rather than resulting from analysis by our affixation and compounding rules.</Paragraph> <Paragraph position="1"> The data structure displayed under the Lexieal Systems Analysis (LexSys) is based on UDICT. As shown in the example above for &quot;claim&quot;, UDICT data is an tmstructw'ed list of features and attribute-value pairs, This output is then structured into a feature hierarchy according to feature type. There are six categories at the top level: SYNTACTIC, PHONological, MORPHological, SEMANTIC, STYLISTIC, and GRAPHEMIC. Features are then listed under part of speech for each category, and there are up to five levels of depth. This has important implications for feature addition, since the system needs to forbid occurrence of certain features under certain nodes. For example, THATCOMP cannot apply to determiners in English or MALE cannot be an inherent property of verbs in English, although a verb could have the contextual property of selecting for MALE arguments. The arrangement of the data in a structure also permits efficient querying. Thu,;, if an application requires only one type of feature, such as phonological or syntactic, this feature set is easily extracted from the larger data structure.</Paragraph> </Section> <Section position="2" start_page="817" end_page="817" type="sub_section"> <SectionTitle> 4.3 Brandeis Codes for &quot;bring&quot; </SectionTitle> <Paragraph position="0"> The Brandeis Codes subcategorize &quot;bring&quot; for direct object (DO). Furthermore, if the verb takes a DO with the preposition &quot;to&quot; (Pl0), then it also takes an NP. If an indirect object is present (IO), then so is a DO. Finally, &quot;bring&quot; will take a DO following by an indirect object introduced by &quot;to&quot;; this code is not intended to apply to other uses of &quot;to &quot;.</Paragraph> <Paragraph position="1"> Observe that, like the features for UDICT, Brandeis Codes represent the intersection of subcat~ egorization properties of verbs. There are about 900 verbs, 28 features, and 19 prepositions or preposition types. The codes characterize some inherent features (such as &quot;Modal&quot;), control properdeg ties, and contextual features (such as ACCING &quot;accusative followed by -ing phrase). Cases where combinations of features are required are indicated in the codes.</Paragraph> <Paragraph position="2"> Note also that there is some overlap of infermarion between the Lexical Systems analysis and the Brandeis analysis, such as SUISCAT(TRAN) and DO. This is a clear example of identical information in different systems. By gathering together different computational lexicons into one general repository, we can both eliminate duplication when two systems overlap, and increase coverage when they differ. Of course, we will also need methods for resolving disagreements when they arise.</Paragraph> <Paragraph position="4"> The LDOCE data first gives the headword and part of speech; these two values hold for each subsequent sense. Then entries are broken into sense numbers. In this example, sense one has the grammatical codes of &quot;DI&quot; (ditransitive verb used with two or more objects) and &quot;T 1&quot; (transitive with one object when used with the prepositions &quot;to&quot; and &quot;for&quot;). There is no subject area, (such as &quot;mode icine&quot;, &quot;mathematics&quot;, &quot;law&quot;), nor are there any selectional restrictions. Next follows the definition and example sentences, which are included for the purpose of helping the human user. They are not relevant to a computational lexicon except as a potential source of implicit information. (See Atkins et al. 1986).</Paragraph> </Section> </Section> <Section position="5" start_page="817" end_page="817" type="metho"> <SectionTitle> BIB </SectionTitle> <Paragraph position="0"> Questions were put to us concerning the accuracy and completeness of the LDOCE codes.</Paragraph> <Paragraph position="1"> We decided to undertake an in-depth study of selectional restrictions for subject to get some concrete data on how precise and thorough the I~,IDOCE codes really are. This study is described in the next section.</Paragraph> </Section> class="xml-element"></Paper>