File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/c92-2081_concl.xml
Size: 7,252 bytes
Last Modified: 2025-10-06 13:56:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-2081"> <Title>The Automatic Creation of Lexical Entries for a Multilingual MT System</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> LDOCE </SectionTitle> <Paragraph position="0"> The Longman Dictionary of Contemporary English \[Procter et al., 1978\] is a filll-sized dictionary designed for learners of English as a second language. It contains 41,122 headword enwies, defined in terms of 72,177 word senses, m machine-readable form (a type-setting tape).</Paragraph> <Paragraph position="1"> With few exceptions, the definitions in LDOCE are stated using a control vocabulary of approximately 2,(X)0 words. The eonn'ol vocabulary words tend to bc highly ambiguous (approximately 17,000 senses are listed in LDOCE for tile 2,1X~.) s?clling forms).</Paragraph> <Paragraph position="2"> Both tile book and tape versions of LDOCE use a system of grammatteal codes of about 1111 syntactic (sub)categories which vary in generality. Nouns, for example, may be assigned categories snch as noun, or counl-nolln or count-noun-followed-by-infinitive-with-TO, or vocative-noun-used-in-direct-address. The syntactic categories for verbs are particularly exten-Acres Dr COLING.92, NANa'~, 23-28 AOOT 1992 5 3 5 P~oc:. ol: COLING-92, Nhbrrt!s, At~6.23-28, 1992 sire and include categories such as transitiveverb-followed-by-the-infinitive-without-TO. null In addition, the machine-readable version of LDOCE contains codes which are not found in the book and among them are codes which specify the semantic class of a noun (as one of 34 categories) and the semantic preferences on the complements of verbs and adjectives.</Paragraph> <Paragraph position="3"> From LDOCE to a Partially Specified Entry The mapping process from LDOCE to ULTRA word sense entries assumes a particular linguistic context. All the information contained in the LDOCE defimtion is automatically extracted and used in the appropriate ULTRA specification. For some parts of speech (e.g., nouns), most of the information stored in the interlingual entry can be ex~acted automatically; for others (e.g., verbs and adjectives), only a portion of the information isavailable.</Paragraph> <Paragraph position="4"> For this project we began with a Lisp version of LDOCE, which formats the information from the type-setting tape \[Boguraev et at., 1987\]. To date, we have extracted information from LDOCE nouns for specifying IR entries for entities, from verbs and adjectives for specifying IR entries for relations, and from adverbs for specifying IR entries for relation modifiers and proposition modifiers. These are the major open class categories of IR word sense tokens and constitute over 95% of the tokens defined thus far. Below we summarize the information required by the categories corresponding to nouns and to verbs (the information which is currently provided automatically is marked by @).</Paragraph> <Paragraph position="5"> Entities: @ the sense token indexes a corresponding I.DOCE word sense definition, @ whether it is a class term, the name of an individual, or an anaphoric element, @ whether it is countable or not,</Paragraph> <Paragraph position="7"> Below is a sample screen of the interactive session for completing the IR lexical entry for one sense of &quot;bank&quot; in LDOCE. The first screen is created automatically and completed manually to produce the see*rid screen~ Note that for entities (nouns) only one feature, described above as &quot;the semantic class,&quot; is not provided automatically from LDOCE. This field corresponds to the semantic categories used in ULTRA prior to the use of LDOCE for automatic extraction. These categories were hand crafted, based on surface lingnistic phenomena and are used to satisfy the semantic preferences of adjectives and verbs. The automatically created entries tot entities contain the LDOCE semantic categories as well, but these will not be used by ULTRA until we have examined the consistency of the LDOCE categories as a basis for semantic preferences. Relations: fillers of the case roles, @ the LDOCE subject domain; In the case of relations, LDOCE does not provide case roles or semantic classes (for verbs), or a direct marking as to whether a verb is stative or dynamic. We have developed a verb hierarchy from LDOCE, based on the genus (hypernym) of a verb definition, and are in the process of disambiguating the terms in this hierarchy. These then will be used as the verb classes for ULTRA's relations. We have been able to extract case role information in some cases \[Wilks et al.; 90\] from implicit information in Longman's and will include this in the lexical entxies. Again the semantic preferences for the fillers of the case roles are those originally used in ULTRA. As in the case of entities above, the LDOCE semantic preferences are also included in the entry for future use.</Paragraph> <Paragraph position="8"> Extraction is performed by applying a sequence of flex programs (a new generation version of the UNIX lexical analyzer utility, lex) which transform information from the LDOCE Lisp format into a Lisp association list, the data structure used by the interactive lcxical entry interface for the ULTRA system (sample screens appear in the previous secton).</Paragraph> <Paragraph position="9"> The word senses added to file ULTRA system using these techniques were chosen first on the basis of whether they were exemplified in the dictiona~ 3, entry, and second, whether they were one of the first three senses of a given homonym (the LDOCE senses are listed in order of frequency of use). Files containing the definitions of all noun. verb, adverb and adjective senses for which there were example sentences were first automatically generated. An additional file containing example sentences tagged by the word sense being exemplified was also created. Next, association lists conesponding to IR entries fur each of the word senses were generated. Finally, another procedure was applied which automatically supplied a pointer to the example context in the example sentence file.</Paragraph> <Paragraph position="10"> 4. Approaches to Achieving Full Specilication It was clear at the outset of this project that a great deal of lexical acqttisition could be done automatically and we have initiated projects to investigate whether the missing information can be identified automatically through further analysis of the defintions, examples, gramatic'dl categories, etc.</Paragraph> <Paragraph position="11"> Finally, in order to automate the construction of lexical items fully on the fly during translation, procedures must be defined to select specific senses on the basis of the source language linguistic context of the item being defined. Similarly, procedures must be developed to automatically ,specify the different language-pazlicular lexical entries (these procedures do exist in English to a limited extent), and these must be adapted to other langnages.</Paragraph> <Paragraph position="12"> Finally, tecbniques for using bilingual dictionaries in the language-specific lexical specification process must be developed.</Paragraph> </Section> class="xml-element"></Paper>