File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/p87-1027_metho.xml
Size: 28,916 bytes
Last Modified: 2025-10-06 14:12:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P87-1027"> <Title>The Derivation of a GrammaticaUy Indexed Lexicon from the Longman Dictionary of Contemporary English</Title> <Section position="4" start_page="193" end_page="194" type="metho"> <SectionTitle> VERBALHEAD {PRD FIN AUX VFORAI PAST AGR} NOMINALHEAD {PLU POSS CASE PN COUNT PRD PRO PART NFORM PER} PREPHEAD {PFORM LOC PRD} ADJHEAD {AFORM PRD QUA ADV NUM NEG PAI~I ~ AGR DEF}. </SectionTitle> <Paragraph position="0"> The features appearing on certain categories in addition to the sets defined above are COMP, IN'V, NEG and SUBCAT which are relevant to verbal categories; SPEC, DEF and SUBCAT, applicable to nominal categories; GERUND, POSS and SUBCAT for prepositional categories; and SUBCAT alone for adjectival categories. With exception of SUBCAT, which must be specified for all lexical entries, and the respective head features sets, the only other features required by the lexical nodes in the grammar are NEG, and DEF. Features like SLASH, WH, UB and EVER, which are required by the grammar to implement the GPSG treatment of certain linguistic phenomena, are of no relevance to this paper.</Paragraph> <Paragraph position="1"> The feature set in Figure 2 overleaf defines the information about lexical items which will be required to construct a lexicon compatible both in form and content with the rest of the analysis system. Some of these features, (such as FIX) are specific to bound morphemes(these include, for example, entries for uztive&quot;, ~ng ~ or &quot;nessJ). Other features (for instance WH, REFL) are specific to closed class vocabulary items, such as interrogative, relative and reflexive pronouns. Bound morphemes and closed class vocabulary are exhaustively defined in the hand crafted lexicon.</Paragraph> <Paragraph position="2"> However, this lexicon inevitably only contains a few examples of the much larger open class vocabularydeg In order for the word and sentence grammars to function correctly, open class vocabulary must be defined In terms of the feature set illustrated overleaf (Figure 2a). The features relevant to the open class vocabulary can be divided into those which are predictable on the basis of the part of speech of the item involved, those which follow from the inflectional or deriv~tional morphological rules incorporated into the system, ~nd those which rely on more specific information than part of speech, but nevertheless must be specified for each individual entry. For example the values for the features N, V and BAR in the sample entries above follow from the part of speech of ~oelieve = . The values of PLU and PER are predictable on the basis of the word grammar rules and need not be independently specified for each entry. On the other hand, the values of SUBCAT and LAT are not predictable from either part of speech or general morphological information.</Paragraph> <Paragraph position="3"> We concentrate on this last class of features which must be specified on an entry-by-entry basis in any lexicon which is going to be adequate for supporting the analysis system. Within this class of features some (eg. LAT, AT or BARE..ADJ) are only relevant to the word grammar. It is clear that those features that are derivable from the part of speech information are recoverable from virtually any/vfl~). However, most (if not all) of the features in the third class above are not recoverable from the majority of ~\[\]~.Ds. As indicated above, LDOCE appears to be an exception to this generai\]sation, because it employs a system of grammatical tagging of major syntactic classes, offering detailed information about subcategorisation, morphological irregularity and broad syntactico-semantic information. null</Paragraph> <Paragraph position="5"/> </Section> <Section position="5" start_page="194" end_page="198" type="metho"> <SectionTitle> 3 The source data </SectionTitle> <Paragraph position="0"> It turns out that even though the grammar coding system of LDOCE is not GPSG specific, it encodes much of the information which GPSG requires relating to the subcategorisation classes in the lexicon. The Longman lexicographers have developed a representational system which is capable of describing compactly a variety of data relevant to the task of building a lexicon with grammatical definitions; in particular, they are capable of denoting distinctions between count and ma~ nouns ('do~ vs. Sdesire'), predicative, postpositive and attributive adjectives ('asleep&quot; vs. &quot;elect&quot; vs. &quot;jocular~), noun and adject|ve complementation (~ondness', Tact') and, most importantly, verb complementation and valency.</Paragraph> <Section position="1" start_page="194" end_page="195" type="sub_section"> <SectionTitle> 8.1 The Longman grammar coding system </SectionTitle> <Paragraph position="0"> Grammar codes typically contain a capital letter, followed by a number and, occasionally, a small letter, for example \[TSa\] or \[V3\]. The capital letters encode information &quot;about the way a word works in a sentence or about the position it can fill&quot; (Procter, 1978: xxviii); the numbers &quot;give information about the way the rest of a phrase or clause is made up in relation to the word described&quot; (ibid.). For example, &quot;T&quot; denotes a transitive verb with one object, while &quot;5&quot; specifies that what follows the verb must be a that clause. (The small letters, eg. &quot;a&quot; in the case above, provide information related to the status of various complementisera, adverbs and prepositions in compound verb constructions: here it indicates that the complementiser is optional.) As another example, '~r3&quot; introduces a verb followed by one object and a verb form (V) which must be an infinitive with to (3).</Paragraph> <Paragraph position="1"> In addition, codes can be qualified with words or phrases which provide further information concerning the linguistic context in which the described item is likely, and able, to occur; for example \[Dl(to)\] or \[L(to be)l\]. Sets of codes, separated by semicolons, are associated with individual word senses in the lex/cal entry for a particular item, as the entry for ~feel&quot;, with extracts from its printed form shown in Figure 3, illnstrates. These sets are el/ded and abbreviated in the code field associated with the word sense to save space in the dictionary. Partial codes sharing an initial letter can be separated by commas, for example \[Tl,Sa\]. Word qualifiers relating to a complete sequence of codes can occur at the end of a code field, delimited by a colon, for example \[TI;I0: (DOWN)\].</Paragraph> <Paragraph position="2"> faol I * 1 \[T1,6\] to get the knowledge of by touching with the fingers: ... 2 \[Wv6;Tl\] to experience (the touch or movement of something): ... S \[L7\] to experience (a condition of the mind or body); be consciously.&quot; ... 4 \[LI\] to seem to oneself to be: ... 5 \[TI,5;V3 to believe, esp. for the moment 6 L7\] to give (a sensation): ... 7 \[Wv6;10\] to (be able to) experience sensations: ... 8 \[Wv6;T1\] to suffer because of (a state or event): ... 9 {L9 (~ter,/ov)\] to search with the fingers rather than with the eyes: ...</Paragraph> <Paragraph position="3"> This apparent formal syntax for describing grammatical information in a compact form occasionally breaks down: different classes of error occur in the tagging of word senses. These include, for example, misplaced commas or colon del/miters and occasional migration of other lex/cal information (e.g. usage labels) into the grammar code fields.</Paragraph> <Paragraph position="4"> This type of error and inconsistency arises because grammar codes are constructed by hand and no automatic checking procedure is attempted (l~fichiels, 1982). They provide much of the motivation for our interactive approach to lexicon development, since any attempt at batch processing without extensive user intervention would inevitably result in an incomulete and inaccurate lexicon.</Paragraph> <Paragraph position="5"> $.2 Making use of the gr-mmar codes The program which transforms the LDOCE grammar codes into lexical entries utilisable by the analyser first produces a relatively theory-neutral representation of the lexical entry for a particular word. As an illnstmtion of the process of transforming a dictionary entry into a lexical template we show below the mapping of the third verb sense of %elieve&quot; below into a lexical entry incorporating information about the grammatical category, syntactic subcategorisstion frames and semantic type of verb -- for example a label like (Type 20Ralsing) indicates that under the given sense the verb is a two-place predicate and that if it occurs with a syntactic direct object, this will function as the logical subject of the predicate complement.</Paragraph> <Paragraph position="6"> be-lievo ... v 1 \[I0J to have a firm religious faith 2 iT1\] to consider to be true or honest: to be|ices someoaelto helices someoae's reports 8 \[TSa,b;VS;X (to be) I, (to be} 7\] to hold ss an opinion; suppose: I helices he ha* come. \[ He haJ come, I helices. \[ &quot;Ham he comer m &quot;I be|ices so.* I I helices ~m to hams ~oae it. I I belleee h~m (to be) hovtest This resulting structure is a lexical template, designed as a formal representation for the kind of syntacrico-semantic information which can be extracted from the dictionary and which is relevant to a system for automatic morphological and syntactic analysis of English texts.</Paragraph> <Paragraph position="7"> The overall transformation strategy employed by our system attempts to derive both subcategorisation frames relevant to a particular word sense and information about the semantic nature (i.e. the predicate-argument structure and the logical type) of, especially, verbs. In the main, the code numbers determine a unique subcategorisation. However, such semantic information is not explicitly encoded in the LDOCE grammar codes, so we have adopted an approach attempting to deduce a semantic classification of the particular sense of the verb under consideration on the basis of the complete set of codes assigned to that sense. In any subcategorisatlon frame which involves a predicate complement there will be a non-transparent relationship between the superficial syntactic form and the underlying logical relations in the sentence. In these situations the parser can use the semantic type of the verb to compute this relationship. Expanding on a suggestion of Nfichieis (1982), we classify verbs as subject equi (SEqui), object equi (OEqul), sub-ject raising (SRalsing) or object raising (ORulsing) for each sense which has a predicate complement code associated with it. These terms, which derive from Transformational Grammar, are used as convenient labels for what we regard as a semantic distinction.</Paragraph> <Paragraph position="8"> The five rules which are applied to the grammar codes associated with a verb sense are ordered in a way which reflects the filtering of the verb sense through a series of syntactic tests. Verb senses with an lit+IS\] code are classified as SRaising. Next, verb senses which contain a \[V\] or IX\] code and one of \[D5\], \[DSa\], \[De\] or \[D6a\] codes are classified as OEqui. Then, verb senses which contain a IV\] or \[X l code and a ITS\] or \[TSa\] code in the associated grammar code field, (but none of the D codes mentioned above), are classified as ORalstng. Verb senses with a \[VJ or \[X(to be)\] code, (but no \[T5\] or \[TSa\] codes), are classified. as OEquL Finally, verb senses containing a \[T2\], \[T3\] or iT4\] code, or an \[I2\], \[13\] or \[I4\] code are classified as SEquL Below we give examples of each type; for a detailed description see Boguraev and Briscoe (1987).</Paragraph> <Paragraph position="9"> duster within the features and feature set declarations used by the dictionary and grammar projects. A colnparison of the existing entries for ~oelieve ~ in the hand crafted lexicon (Figure 1) and the third word sense for ~believe m extracted from LDOCE demonstrates that much of the information available from LDOCE is of direct utility -- for example the SUBCAT values can be derived by an analysis of the Takes values and the ORaieing logical type specification above. Indeed, we have demonstrated the feasibility (Alshawi et al., 1985) of driving a parsing system directly from the information av~lable in LDOCE by constructing dictionary entries for the PATR-H system (Shieber, 1984). It is also clear, however, that it is unrealistic to expect that on the basis of only the information available in the machine-readable source we will be able to derive a fully fleshed out lexical entry, capable of fulfilling all the run-time requirements of the analysis system that the lexicon under construction here is intended for.</Paragraph> </Section> <Section position="2" start_page="195" end_page="198" type="sub_section"> <SectionTitle> 3.3 Utility of LDOCE </SectionTitle> <Paragraph position="0"> for automatic lexicon generation Firstly, the information recoverable from LDOCE which is of direct utility is not totally reliable. Errors of omission and assignment occur in the dictionary for example, the entry for aconsider&quot; (Figure B) lacks a code allowing it to function in frames with sentential complement (eg. I consider that it is a great honour to be here). The entry for %xpect&quot;, on the other hand, spuriously separates two very similar word senses (1 and 5), assigning them different grammar codes.</Paragraph> <Paragraph position="1"> C/onslde, ... 2 \[WvS, X (to be) 1,7; V3 l to regard as; think of in a stated way: I conelder pol */oo~ (= I regard you a fool). I I consider it ~ great hoaonr to be ~ ~th yon to~v. I ae o~d he conold, red me (to be) too lazy to be * ~ood worker. I The Shetl~r~d lolandt ~r~ eta~ll~ eontldered ~ pa~rt o~ Scotl~ad .........</Paragraph> <Paragraph position="2"> expect ... 1 \[T3,Sa,b\] to think (that something will happen): I ezpect (tho~) he'll p~s the C/z~mination. \] He expects tO/~l the ez~mlaa~ioa. J &quot;Will the come .ooa~&quot; &quot;I ezpect so.&quot; ........ S \[V3\] to believe, hope and think (that someone will do something): The officer egpected /t~e inca tO do their daty is the C/O~1~ /mtt/e ....... acknowledge ... I \[TI,4,S (to) to agree to the truth of; recogniee the fact or existence (of): I C/~knowledge the trash o~ ~,oar esteemed. J .They o~knowledoed (to ,,e) th~ they were deleted I ~Y ~&quot; knowle~ed ~ei~7 been d~eJe~ed 2 \[T1 (a~); X (to be) 1,7\] to recognise, accept, or admit (as): ~re warn ~knowJedoed to be t~e beet j~aper, t T~l~y ~knowledoed Errors like these nitimately cause the transformation program to fail in the mapping of grammar codes to feature clusters. We have limited our use of LDOCE to verb entries because these appear to be coded most carefully. However, the techniques outlined here axe equally applicable to other open class items.</Paragraph> <Paragraph position="3"> Furthermore, since some of the information requred is only recoverable on the basis of a comparison of codes within a word sense specified in the source dictionary, additional errors can be introduced. For example, we assign ORatslng to verbs which contaln subcategorlsatlon frzmes for sentential complement, a noun phrase object and an infinitive complement within the same sense. However, thls rule breaks down in the case of an entry such as %cknowledge &quot;, where the two codes corresponding to different subcategorisation frames are split between two (spuriously separated) word senses (Figure 6), and consequently incorrectly assigns OEqui to this verb. The rule consequently breaks down and aconsider~ is incorrectly assigned the logical type of an Equi verb.</Paragraph> <Paragraph position="4"> We have tested the classification of verbs into semantic types using a verb list of 139 pre-classified items available in various published sources (eg. Stockwell etal., 1973). The overall error rate in the process of grammar code analysis and transformation was 14~; however, the rules discussed above classify verbs into SRalsing, SEqui and OEqul very successfully.</Paragraph> <Paragraph position="5"> The main source of error comes from the mieclasslfication of ORaising into OEqut verbs. This was confirmed by another test, involving applying the rules for determining the semantic types of verbs over the 7,965 verb entries in LDOCE. The resulting lists, assigning the 719 verb senses which have the potential for predicate complementation into appropriate semantic classes, confirm that errors in our procedure are mostly localised to the (mls)application of the ORalslng rule. Arguably, these errors ~o derive mostly from errors in the dictionary, rather than a defect of the rule; see Boguraev and Briscoe (1987) for further discussion.</Paragraph> <Paragraph position="6"> Secondly, the analysis system requires information which is simply not encoded in the LDOCE entries; for example, the morphological features AT, LAT and BARE_ADJ are not there. This type of feature is critical to the analysis of derivxtional variants, and such information is necessary for the correct application of the word grammar. Otherwise many morphologically productive, but nonexistant, lexical forms will be defined and be potentially analysable by the lexicon system. Therefore, lexical templates are not converted directly to target lexical entries, but form the input to second phase in which errors and inadequacies in the source ~ are corrected.</Paragraph> <Paragraph position="7"> 4 A. methodology and a system for lexicon development In order to provide for fast, simple, but accurate development of a lexicon for the analysis system we have implemented a software environment which is integrated with the transformation program described above and which ofers an integrated morphological generation package and editing facilities for the semi-antomatic production of the target lexicon. The system is designed on the a~umption that no machine-readable dictionary can provide a complete, consistent, and totally accurate source of lexical information. Therefore, rather than batch process the MRD source, the lexicon development software is based around the concept of semi-automatic and rapid construction of entries, involving the continuous intervention of the end user, typically a linguist / lexicographer.</Paragraph> <Paragraph position="8"> In the course of an interactive cycle of development, a number of entries are hypothesised and automatically generated from x single base form. The family of related surface forms is output by the morphological gensr~tor, which employs the same word grammar used for inflectional and derivxtlonal morphology by the analysis system and creates new entries by a~iding a/fixes to the base form in legitimate ways. The generation and refinement of new entries is based on repeated application of the morphological generator to suitable base forms, followed by user intervention involving either rejecting, or minimally editing, the surface forms proposed by the system. Below we sketch a typical pattern of use.</Paragraph> <Paragraph position="9"> If the user asks the system to create an entry for 'rbelieve', the transformation program described in section 3.2 (see Figure 4) will create an entry which contains all the syntactic information specified in Figure 1. In addition, many surface forms with associated grammatical definitions will be generated automatically: null cobclievc overbclieve 8ubbelieve believed disbelieve postbclieve unbelieyc bolieveo interbelievo prebelieve underbelieve believer misbelteve rebeltevo believable beltewlng outbeliove s~4believe believal believes The system generates these forms from the base entry in batches and displays the results in syntactic frames associated with subcategorisatlon possibilities. Thees frames, which are used to tap the user's gramdeg maticality judgements, are as semantically 'bleached' as possible, so that they will be as compatible as poesible with the semantic restrictions that verbs place on their arguments. Each possible SUBCAT feature value in the grammar is associated with such frames, for example: Internally, frames are more complex than illustrated above. Surface phrasal forms with marked slots in them are associated with more detailed feature specifications of lexical categories which are compatible with the fully \]nstantiated lexical items allowed by the grammar to fill the slots. Such detailed frame specifications are automatically generated on the basis of syntactic analysis of sentences made up from the frame phrase skeleton with valid lexical items substituted for the blank slot filler. Figure 9 below shows a fragment of the system's inventory of frames.</Paragraph> <Paragraph position="10"> 7&quot;r~r&quot; ... -1 t~t ~omm~ ;- ao,net~'.g.</Paragraph> <Paragraph position="11"> \[! -, V /, BAR 0, aGK IN /, V -, B~ 2, NFOB~4 NORM, PER 3, PLU /, COUNT /, CASE NOM\], SUBC~? b'FIS\] 7'I~C ... &quot;1 ,m'nm.,e to be somet,~/ng.</Paragraph> <Paragraph position="12"> \[~ -, V +, BAI O, aGlt \[N /, V -, BAg 2, NFOa.q liOB/4, PEB. 3, PLU +, COUNT /, CASE NOX\], S~CA! 0El \[N -, V +, B~. 0, IGR \[~ +, V -, BAR 2, gFORM ~OB/4, PER 3, PLU +, COUNT +, CASE ~OX\], SUBCl? ORI \[N -, V'/, BAR O, IG~, \[~ *, V -, BAIL 2, gFOB/4 ~OB/4, PER 3, PLU +, COUNT +, CaSE NOX\], suBcI? SE2\] ~r-.,. &quot;7 fAen. to be ~ p~o~em, IN -. V *, BJa. O. IGl \[!\[ *. V -. BaR 2. NFO~ NORM, PEX 3, PLU *, COUNT *, CaSE NOHI, su~c~! u,:\] \[N -, V +. BA.~ O. iGR \[N *, V -, B~ 2, ~FOB.q NORH, PER 3, PLU +, COU~T +, CISE ~OX\], SU~CA? OR\] * 77~ C ... ~ t.~.ze to be ~ pzo~enL \[~ -, V *, BAR O, .tGB. \[N /, V -, B~q. 2, IqFOR/4 NORM. PER 3, PLU /, COUNT /, CASE NOI4\], SU~CAT 0El The system ensures that slots in syntactic frames are filled by surface forms which have the syntactic features the sentence grammar requires. Displaying such instantiated frames provides a double check both on the outright correctness of the surface form and on the correctness of a surface form paired with a particular definition. For example, the user can reject They oeerbelieee that 8orneone is something completely, but The v be\[ievem that someone is something is indicative of an incorrect definition, rather than surface form. Syntactic frames encoding other 'transformational' possibilitlse are often associated with particular SUBCAT values since these provide the user with more helpful data to accept or reject a particular assignment. Thus for example selecting between Raising and OEqui verbs is made easier if the frames for \[SUBCAT OR.\] are instantiated simultaneously: 7~ ~ so, z~o,w to be ,o,,a~,~ / per~ eomeo,~ to be eo,ne~nC/ The user has two broad options: to reject a set of frames and associated surface form outright or to edit either the surface form or definition associated with a set of frames. Exercising the first option causes all instances of the surface form and associated syntactic frames to be removed from the screen and from further consideration by the user. However, this action has no effect on the eventual output of the system, so these morphologically productive but non-existent forms and definitions will still be implicit in the lexicon and morphology component of the English analyser. It is assumed that this overgeneration is harmless though, because such forms will not occur in actual input.</Paragraph> <Paragraph position="13"> Editing a surface form or associated definition resuite in a new (non-productive) entry which will form part of the system's output to be included as an independent irregular entry in the target lexicon. If the user edits a surface form, the edited version is substituted in all the relevant syntactic frames. Provided the user is satisfied with the modified frames, a new entry is created with the new surface form, treated as an indivisible morpheme, and paired with the existing definition. Similarly, if the user edits a definition associated with a set of syntactic frames, a new set of frames will be constructed and if he or she is happy with these, a new entry will be created with existing surface form and modified definition. (The English analyeer can be run in a mode where non-productive separate entries are 'preferred' to productive ones.) The user can modify both the surface form and the associated definition during one interaction with a particular potential entry; for example, the definition for ~believal m contains both an incorrect surface form and definition for a nominal form of the base form ~oeUeve =. After the associated syntactic frames are displayed to the user, instead of rejecting the entire entry at this point, he or she can modify the surface form to create a new entry for ~oellef&quot; -- a process which results in the revised syntactic frames: The user now has three options; rejecting the third syntactic frame, or alternatively deleting the associated sub-entry with a \[SUBCAT OR\] feature definition, followed by confirmation will result in the construction of a new entry for the lexicon. The third option, should the user decide that nominal forms never take OR complements, is to edit the morphological rules themselves. This option is more radical and would presumably only be exercised when the user was certain about the linguistic data.</Paragraph> <Paragraph position="14"> The system described so far allows the semi-automatic, computer-aided production of base entries and irregular, non-productive derived entries on the ba.</Paragraph> <Paragraph position="15"> sis of selection and editing of candidate surface forms and definitions thrown up by the derivationai generA~ tor. However, this approach is only as good as the initial base entry constructed from LDOCE. If the base entry is inadequate, the predictions produced by the generator are likely to be inadequate too. This will result in too much editing for the system to be much help in the rapid production of a sizeable lexicon. Fortunately, the system of syntactic frames and editing facilities outlined above can also be used to refine base entries and make up for inadequacies in the LDOCE grammar code system (from the perspective of the target grammar). For example, LDOCE encodes trAusitivity adequately but does not represent systematically whether a particular transitive has a passive form. In the target grammar, there are two SUBCAT values NP and NOPASS which distinguish these types of verb. Therefore, all verbs with a transitive LDOCE code are inserted into the two sets of syntactic frames shown below. When these frames axe iustantiated with particular verbs rejection of one or other is enough to refine the LDOCE code to the appropriate SUBCAT value. For example, the instantiated frames for &quot;cost n are: The fact that &quot;cost&quot; does not fit into the NP paw sive (second) frame, behaving in a way compatible with the NOPASS predictions, means it acquires a NOPASS SUBCAT value. Since these frames will be displayed first and the operation changes the base entry, subsequent forms and definitions generated by the system will be based on the new edited base entry.</Paragraph> <Paragraph position="16"> This example, also highlights one of the inherent problems in our approach to lexicon development. Syntactic frames are used in preference to direct perusal of definitions in terms of feature lists to speed up lexicon development by tapping the user's grAmmaticality judgements directly and to reduce the amount of editing and keyboard input. They also provide the user with a degree of insulation from the technical details of the morphological and syntactic formalism. However, semantically 'bleached' frames can lea~l to confusion when they interact with word sense ambiguity. For example, aweigh ~ has two senses one of which allows passive and one of which does not (compare The baby toaa toeighed by the doctor with * Ten pound6 tuaa t#eighed by the baby).</Paragraph> <Paragraph position="17"> Unfortunately, the syntactic frames given for NP / NOPASS axe not 'bleached' enough because they tend to select the sense of &quot;weigh ~ which does Mlow passive. The example raises wider issues about the integration of some treatment of word meaning with the production of such a lexicon. These issues go beyond this paper, but the problem illustrated demonstrates that the type of techniques we have described are heuristic aids rather than failsafe procedures for the rapid construction of a sizeable and accurate lexicon from s machine-readable dictionary of variable accuracy and consistency.</Paragraph> </Section> </Section> class="xml-element"></Paper>