File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1427_metho.xml
Size: 19,254 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1427"> <Title>Robust, Applied Morphological Generation .... ......... ..... . _</Title> <Section position="3" start_page="201" end_page="203" type="metho"> <SectionTitle> 2 Morphological Generation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="201" end_page="202" type="sub_section"> <SectionTitle> 2.1 The Generator </SectionTitle> <Paragraph position="0"> The morphological generator covers the productive English affixes s for the plural form of nouns and the third person singular present tense of verbs, and ed for the past tense, en for the past participle, and ing for the present participle forms of verbs. 1 The generator is implemented in Flex.</Paragraph> <Paragraph position="1"> The standard use of Flex is to construct 'scanners', programs that recognise lexical patterns in text (Levine et al., 1992). A Flex description--the high-level description of a scanner that Flex takes as input--consists of a set of 'rules': pairs of regular expression patterns (which Flex compiles into deterministic finite-state automata (Aho et al., 1986)), and actions consisting of arbitrary C code. Flex creates as output a C program which at run-time scans a text looking for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code. Flex is part of the Berkeley Unix distribution and as a result Flex programs are very portable. The standard version of Flex works with any ISO-8559 character set; Unicode support is also available.</Paragraph> <Paragraph position="2"> The morphological generator expects to receive as input a sequence of tokens of the form lemma+inflection_label, where lemma specifies tim lemma of the word form to be generated, inflection specifies the type of inflection (i.e. s, ed~ cn or ing), and label specifies the PoS of the word form. The PoS labels follow the same pattern as in the Lancaster CLAWS tag sets (Garside et al., 1987; Burnard, 1995)~ with noun tags starting with N, etc. The symbols + and _ are delimiters.</Paragraph> <Paragraph position="3"> An example of a morphological generator rule is given in (1).</Paragraph> <Paragraph position="4"> ~\Ve do not currently cover comparative and superlative forms of adjectives or adverbs since their productivity is much less predictable.</Paragraph> <Paragraph position="5"> The left-hand side of the rule is a regular expression. The braces signify exactly one occurrence of an element of the character set abbreviated by the symbol h; we assume here that h abbreviates the upper and lower case letters of the alphabet. The next symbol + specifies that there ..... must. be a..sequence of one or.=more characters, each belonging to the character set abbreviated by h. Double quotes indicate literal character symbols. The right-hand side of the rule gives the C code to be executed when an input string matches the regular expression. When the Flex rule matches the input address+s_N, for example, the C function np_word_.form (defined elsewhere in the generator) is called to determine the word form corresponding to the input: the function deletes the inflection type and PoS label specifications and the delimiters, removes the last character of the lemma, and finally attaches the characters es; the word form generated is thus addresses.</Paragraph> <Paragraph position="6"> Of course not all plural noun inflections are correctly generated by the rule in (1) since there are many irregularities and subregularities. These are dealt with using additional, more specific, rules. The order in which these rules are applied to the input follows exactly the order in which the rules appear in the Flex description. This makes for a very simple and perspicuous way to express generalizations and exceptions. For instance, the rule in (2) generates the plural form of many English nouns that originate from Latin, such as stimulus.</Paragraph> <Paragraph position="7"> (2)</Paragraph> <Paragraph position="9"> With the input stimulus+s_N, the output is stimuli rather than the incorrect *stimuluses that would follow from the application of the more general rule in (1). By ensuring that this rule precedes the rule in (1) in the description, nouns such as stimulus get the correct plural form inflection. Some other words in this class, though, do not have the Latinate plural form (e.g. *boni as a plural form of bonus); in these cases the generator contains rules specifying the correct forms as exceptions.</Paragraph> </Section> <Section position="2" start_page="202" end_page="202" type="sub_section"> <SectionTitle> 2.2 Inflectional Preferences </SectionTitle> <Paragraph position="0"> The rules constitutingthe geiaerator do not necessarily have to be mutually exclusive, so they can be used to capture the inflectional morphology of lemmata that have more than one possible inflected form given a specific PoS label and inflectional type. An example of this is the multiple inflections of the noun cactus, which has not only the Latinate plural form cacti but also the English~ptura4.form.cactuses: , In addition, inflections of some words differ according to dialect. For example, the past participle form of the verb to bear is borne in British English, whereas in American English the preferred word form is born.</Paragraph> <Paragraph position="1"> In cases where there is more than one possible inflection for a particular input lemma, the order of the rules in the Flex description determines the inflectional preference. For example, with the noun cactus, the fact that the rule in (2) precedes the one in (1) causes the generator to output the word form cacti rather than cactuses even though both rules are applicable. 2 It is important to note, though, that the generator will always choose between multiple inflections: there is no way for it to output all possible word forms for a particular input. 3</Paragraph> </Section> <Section position="3" start_page="202" end_page="202" type="sub_section"> <SectionTitle> 2.3 Consonant Doubling </SectionTitle> <Paragraph position="0"> An important issue concerning morphological generation that is closely related to that of inflectional preference is consonant doubling.</Paragraph> <Paragraph position="1"> This phenomenon, occurring mainly in British English, involves the doubling of a consonant at the end of a lemma when the lemma is inflected. For example, the past tense/participle inflection of the verb to travel is travelled in British English, where the final consonant of the lemma is doubled before the suffix is attached.</Paragraph> <Paragraph position="2"> In American English the past tense/participle inflection of the verb to travel is usually spelt traveled. Consonant doubling is triggered on the basis of both orthographic and phonological information: when a word ends in one vowel -&quot;Rule choice based on ordering in the description can in fact be overridden by arranging for the second or subsequent match to cover a larger part of the input so that the longest match heuristic applies (Levine et al., 1992). But note that the rules in (t) and (2) will always match the same input span.</Paragraph> <Paragraph position="3"> 3Flex does not allow the use of rules that have identical left-hand side regular expressions.</Paragraph> <Paragraph position="4"> followed by one consonant and the last part of ..-: =the, word is stressedyin-general:.the ~eonsona, nt is doubled (Procter, 1995). However there are exceptions to this, and in any case the input to the morphological generator does not contain information about stress.</Paragraph> <Paragraph position="5"> Consider the Flex rule in (3), where the symbols C and V abbreviate the character sets consisting of (upper and lower case) consonants and .vowels,. respectively.</Paragraph> <Paragraph position="6"> (3) {A}*{C}{V}&quot;t+ed_V deg' {return(cb_wordf orm(O, '%&quot;, &quot;ed&quot; ) ) ; } Given the input submit+ed_ V this rule correctly generates submitted. However, the verb to exhibit does not undergo consonant doubling so this rule will generate, incorrectly, the word form exhibitted.</Paragraph> <Paragraph position="7"> In order to ensure that the correct inflection of a verb is generated, the morphological generator uses a list of (around 1,100) lemmata that allow consonant doubling, extracted automatically from the British National Corpus (BNC; Burnard, 1995). The list is checked before inflecting verbs. Given the fact that there are many more verbs that do not allow consonant doubling, listing the verbs that do is the most economic solution. An added benefit is that if a lemma does allow consonant doubling but is not included in the list then the word form generated will still be correct with respect to American English.</Paragraph> </Section> <Section position="4" start_page="202" end_page="203" type="sub_section"> <SectionTitle> 2.4 Deriving the Generator </SectionTitle> <Paragraph position="0"> The morphological generator comprises a set of of approximately 1,650 rules expressing morphological regularities, subregularities, and exceptions for specific words; also around 350 lines of C/Flex code for program initialisation and defining the functions called by the rule actions.</Paragraph> <Paragraph position="1"> The rule set is in fact obtained by automatically reversing a morphological analyser. This is a much enhanced version of the analyser originally developed for tile GATE system (Cunningham et al., 1996). Minnen and Carroll (Under review) describe in detail how the reversal is performed. The generator executable occupies around 700Kb on disc.</Paragraph> <Paragraph position="2"> The analyser--and therefore the generator-includes exception lists derived from WordNet (version 1.5: Miller et al., 1993). In addition. we have incorporated data acquired semi- null automatically from the following corpora and machine readable,dictionaries: the..LOB. corpus (Garside et al., 1987), the Penn Tree-bank (Marcus et al., 1993), the SUSANNE corpus (Sampson, 1995), the Spoken English Corpus (Taylor and Knowles, 1988), the Oxford Psycholinguistic Database (Quinlan, 1992), and the &quot;Computer-Usable&quot; version of the Oxford Advanced Learner's Dictionary of Current English (OALDCE; Mitton, 1.9.92).</Paragraph> </Section> <Section position="5" start_page="203" end_page="203" type="sub_section"> <SectionTitle> 2.5 Evaluation </SectionTitle> <Paragraph position="0"> Minnen and Carroll (Under review) report an evaluation of the accuracy of the morphological generator with respect to the CELEX lexical database (version 2.5; Baayen et al., 1993).</Paragraph> <Paragraph position="1"> This threw up a small number of errors which we have now fixed. We have rerun the CELEX-based evaluation: against the past tense, past and present participle, and third person singular present tense inflections of verbs, and all plural nouns. After excluding multi-word entries (phrasal verbs, etc.) we were left with 38,882 out of the original 160,595 word forms. For each of these word forms we fed the corresponding input (derived automatically from the lemmatisation and inflection specification provided by CELEX) to the generator.</Paragraph> <Paragraph position="2"> We compared the generator output with the original CELEX word forms, producing a list of mistakes apparently made by the generator, which we then checked by hand. In a number of cases either the CELEX lemmatisation was wrong in that it disagreed with the relevant entry in the Cambridge International Dictionary of English (Procter, 1995), or the output of the generator was correct even though it was not identical to the word form given in CELEX.</Paragraph> <Paragraph position="3"> We did not count these cases as mistakes. We also found that CELEX is inconsistent with respect to consonant doubling. For example, it includes the word form pettifogged, 4 whereas it omits many consonant doubled words that are much more common (according to counts from the BNC). For example, the BNC contains around 850 occurrences of the word form programming tagged as a verb, but this form is not present in CELEX. The form programing does occur in CELEX, but does not in the BNC.</Paragraph> <Paragraph position="4"> 4A rare word, meaning to be overly concerned with small, unimportant details.</Paragraph> <Paragraph position="5"> We did not count these cases as mistakes either.</Paragraph> <Paragraph position="6"> :Of~he :r~m~i.ning: 359'.mist~kes(:346:~c0neern6d word forms that do not occur at all in the 100M words of the BNC. We categorised these as irrelevant for practical applications and so discarded them. Thus the type accuracy of the morphological analyser with respect to the CELEX lexical database is 99.97%. The token accuracy is 99.98% with respect to the 14,825,661 relevant .tokens .i.mthe BNC .(i.e. ,at.rate ,of two errors per ten thousand words).</Paragraph> <Paragraph position="7"> We tested the processing speed of the generator on a Sun Ultra 10 workstation. In order to discount program startup times (which are anyway only of the order of 0.05 seconds) we used input files of 400K and 800K tokens and recorded the difference in timings; we took the averages of 10 runs. Despite its wide coverage the morphological generator is very fast: it generates at a rate of more than 80,000 words per second. 5</Paragraph> </Section> </Section> <Section position="4" start_page="203" end_page="204" type="metho"> <SectionTitle> 3 The Generator in an Applied </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="203" end_page="204" type="sub_section"> <SectionTitle> System 3.1 Text Simplification </SectionTitle> <Paragraph position="0"> The morphological generator forms part of a prototype system for automatic simplification of English newspaper text (Carroll et al., 1999).</Paragraph> <Paragraph position="1"> The goal is to help people with aphasia (a language impairment typically occurring as a result of a stroke or head injury) to better understand English newspaper text. The system comprises two main components: an analysis module which downloads the source newspaper texts from the web and computes syntactic analyses for the sentences in them, and a simplification module which operates on the output of the analyser to improve the comprehensit)ility of the text. Syntactic simplification (Canning and Tait, 1999) operates on the syntax trees produced in the analysis phase, for example converting sentences in the passive vdice to active, and splitting long sentences at appropriate points. A subsequent lexical simplification stage (Devlin and Tait, 1998) replaces difficult or rare content words with simpler synonyms.</Paragraph> <Paragraph position="2"> The analysis component contains a morphological analyser, and it is the base forms of sit is likely that a modest increase in speed could be obtained by specifying optimisation levels in Flex and gcc that are higher than the defaults.</Paragraph> <Paragraph position="3"> words that are passed through the system; this with a list of exceptions (e.g. heir, unanimous) * eases the task of.the texic~l.simplification toodo ,: =,eollecCed:using.the:pronunciation information in ule. The final processing stage in the system is therefore morphological generation, using the generator described in the previous section.</Paragraph> </Section> <Section position="2" start_page="204" end_page="204" type="sub_section"> <SectionTitle> 3.2 Applied Morphological Generation </SectionTitle> <Paragraph position="0"> We are currently testing the components of the simplification system on a corpus of 1000 news the OALDCE, supplemented by-further cases (e.g. unidimensional) found in the BNC. In the case of abbreviations or acronyms (recognised by the occurrence of non-word-initial capital letters and trailing full-stops) we key off the pronunciation of the first letter considered in isolation. null stories downloaded from .the :Sunde!T!and Echo ....... Simi!arlyi .the orthography .of .the .genit.ive (a local newspaper in North-East England). In marker cannot be determined without taking our testing we have found that newly encountered vocabulary only rarely necessitates any modification to the generator (or rather the analyser) source; if the word has regular morphology then it is handled by the rules expressing generalisations. Also, a side-effect of the fact that the generator is derived from the analyser is that the two modules have exactly the same coverage and are guaranteed to stay in step with each other. This is important in the context of an applied system. The accuracy of the generator is quite sufficient for this application; our experience is that typographical mistakes in the original newspaper text are much more common than errors in morphological processing.</Paragraph> </Section> <Section position="3" start_page="204" end_page="204" type="sub_section"> <SectionTitle> 3.3 Orthographic Postprocessing </SectionTitle> <Paragraph position="0"> Some orthographic phenomena span more than one word. These cannot be dealt with in morphological generation since this works strictly a word at a time. We have therefore implemented a final orthographic postpmcessing stage. Consider the sentence: 6 (4) *Brian Cookman is the attraction at the King's Arms on Saturday night and he will be back on Sunday night for a acoustic jam session.</Paragraph> <Paragraph position="1"> This is incorrect orthographically because the determiner in the final noun phrase should be an, as in an acoustic jam session. In fact an nmst be used if the following word starts with a vowel sound, and a otherwise. We achieve this, again using a filter implemented in Flex, with a set of general rules keying off the next word's first letter (having skipped any intervening sentence-internal punctuation), together</Paragraph> </Section> <Section position="4" start_page="204" end_page="204" type="sub_section"> <SectionTitle> of Sunderland's Vaux Breweries is giving local musicians </SectionTitle> <Paragraph position="0"> a case of the blues&quot; published in the Sunderland Ech, o on 26 August 1999.</Paragraph> <Paragraph position="1"> context into account, since it depends on the identity of the last letter of the preceding word. In the sentence in (4) we need only eliminate the space before the genitive marking, obtaining King's Arms. But, following the newspaper style guide, if the preceding word ends in s or z we have to 'reduce' the marker as in, for example, Stacey Edwards' skilful fingers.</Paragraph> <Paragraph position="2"> The generation of contractions presents more of a problem. For example, changing he will to he'll would make (4) more idiomatic. But there are cases where this type of contraction is not permissible. Since these cases seem to be dependent on syntactic context (see Section 4 below), and we have syntactic structure from the analysis phase, we are in a good position to make the correct choice. However, we have not yet tackled this issue and currently take the conservative approach of not contracting in any circumstances.</Paragraph> </Section> </Section> class="xml-element"></Paper>