File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/a97-1016_evalu.xml

Size: 7,614 bytes

Last Modified: 2025-10-06 14:00:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1016">
  <Title>Automatic Acquisition of Two-Level Morphological Rules</Title>
  <Section position="7" start_page="107" end_page="109" type="evalu">
    <SectionTitle>
5 Results and Evaluation
</SectionTitle>
    <Paragraph position="0"> Our process works correctly for examples given in (Antworth, 1990). There were two incorrect segmentations in the twenty one adjective pairs given on page 106. It resulted from an incorrect string edit mapping of (un)happy to (un)happily. For the suffix, the sequence . .. O:i O:l y:y was generated instead of the sequence.., y:O O:i 0:I O:y. The reason for this is that the root word and the inflected form end in the same letter (y) and one NOCHANGE (y:y) has a lower cost than a DELETE (y:O) plus an INSERT (O:y). The acquired segmentation for the 21 pairs, with the suffix segmentation of (un)happily manually corrected, is:</Paragraph>
    <Paragraph position="2"> happier : happy -4- er happiest : happy -4- est happily = happy -4- ly From these segmentations, the morphotactic component (Section 1) required by the morphological analyzer/generator is generated with uncomplicated text-processing routines. Three correct ~ rules, including two gemination rules, resulted for these twenty one pairsS:</Paragraph>
    <Paragraph position="4"> To better illustrate the complexity of the rules that can be learned automatically by our process, consider the following set of fourteen Xhosa nounlocative pairs:  Source Word --~ Target Word inkosi --~ enkosini iinkosi ~ ezinkosini ihashe -~ ehasheni imbewu -~ embewini amanzi --~ emanzini ubuchopho -~ ebucotsheni ilizwe --, elizweni ilanga --* elangeni ingubo -~ engubeni ingubo -, engutyeni indlu -, endlini indlu --~ endlwini ikhaya ~ ekhayeni ikhaya --~ ekhaya \[ 22\]  Note that this set contains ambiguity: The locative of ingubo is either engubeni or engutyeni. Our process must learn the necessary two-level rules to map ingubo to engubeni and engutyeni, as well as to map both engubeni and engutyeni in the other direction, i.e. to ingubo. Similarly, indlu and ikhaya each have two different locative forms. Furthermore, the two source words inkosi and iinkosi (the plural of inkosi) differ only by a prefixed i, but they have different locative forms. This small difference between source words provides an indication of the sensitivity required of the acquisition process to provide the necessary discerning information to a two-level morphological processor. At the same time, our process needs to cope with possibly radical modifications between source and target words. Consider the mapping between ubuchopho and its locative ebucotsheni. Here, the only segments which stay the same from the source to the target word, are the three letters -buc-, the letter -o- (the deletion of the first -h- is correct) and the second -h-. The target words are correctly segmented during phase one as: level rule compiler KGEN (developed by Nathan Miles) was used to compile the acquired rules into the state tables required by PC-KIMMO. Both PC-KIMMO and KGEN are available from the Summer Institute of Linguistics. null</Paragraph>
    <Paragraph position="6"> Note that the prefix e+ put target words, while alternative of ekhayeni) From this segmented</Paragraph>
    <Paragraph position="8"> The ~ and ~ rules of a special pair can be merged into a single ~=~ rule. For example the four rules above for the special pair q-:O can be merged into</Paragraph>
    <Paragraph position="10"> because both the two questions becomes true for the disjuncted environment e:e _ I o:y _ I u:w - I n:n. The vertical bar (&amp;quot;1&amp;quot;) is the traditional two-level notation which indicate the disjunction of two (or more) contexts. The five ~ rules and the single rule of the special pair i:O in Example 24 can be merged in a similar way. In this instance, the context of the ~ rule (4-:0 -) needs to be added to some of the contexts of the ~ rules of i:O. The following C/:~ rule results:</Paragraph>
    <Paragraph position="12"> In this way the 24 rules are reduced to a set of 16 rules which contain only a single C/~ rule for each special pair. This merged set of 16 two-level rules analyze and generate the input word pairs 100% correctly. null The next step was to show the feasibility of automatically acquiring a minimal rule set for a wide coverage parser. To get hundreds or even thousands of input pairs, we implemented routines to extract the lemmas (&amp;quot;head words&amp;quot;) and their inflected forms from a machine-readable dictionary. In this way we extracted 3935 Afrikaans noun-plural pairs which served as the input to our process. Afrikaans plurals are almost always derived with the addition of a suffix (mostly -e or -s) to the singular form. Different sound changes may occur during this process.</Paragraph>
    <Paragraph position="13"> For example 6, gemination, which indicates the shortening of a preceding vowel, occurs frequently (e.g. hat ---* katte), as well as consonant-insertion (e.g.</Paragraph>
    <Paragraph position="14"> has ---* haste) and elision (ampseed --~ ampsede).</Paragraph>
    <Paragraph position="15"> Several sound changes may occur in the same word.</Paragraph>
    <Paragraph position="16"> For example, elision, consonant replacement and gemination occurs in loof ---* lowwe. Afrikaans (a Germanic language) has borrowed a few words from Latin. Some of these words have two plural forms, which introduces ambiguity in the word mappings: One plural is formed with a Latin suffix (-a) (e.g.</Paragraph>
    <Paragraph position="17"> emetikum --~ emetika) and one with an indigenous suffix (-s) (emetih.m emetih ms). Allomorphs occur as well, for example -ens is an allomorph of the suffix -s in bed + s ---, beddens.</Paragraph>
    <Paragraph position="18"> During phase one, all but eleven (0.3%) of the 3935 input word pairs were segmented correctly. To facilitate the evaluation of phase two, we define a simple rule as a rule which has an environment consisting of a single context. This is in contrast with an environment consisting of two or more contexts disjuncted together. Phase two acquired 531 simple rules for 44 special pairs. Of these 531 simple rules, 500 are ~ rules, nineteen are C/~ rules and twelve are ~ rules. The average length of the simple rule contexts is 4.2 feasible pairs. Compare this with the nAil the examples comes from the 3935 input word pairs.</Paragraph>
    <Paragraph position="19">  average length of the 3935 final input edit sequences which is 12.6 feasible pairs. The 531 simple rules can be reduced to 44 ~ rules (i.e. one rule per special pair) with environments consisting ofdisjuncted contexts. These 44 ~ rules analyze and generate the 3935 word pairs 100% correctly. The total number of feasible pairs in the 3935 final input edit strings is 49657. In the worst case, all these feasible pairs should be present in the rule contexts to accurately model the sound changes which might occur in the input pairs. However, the actual result is much better: Our process acquires a two-level rule set which accurately models the sound changes with only 4.5% (2227) of the input feasible pairs.</Paragraph>
    <Paragraph position="20"> To obtain a prediction of the analysis and generation accuracy over unseen words, we divided the 3935 input pairs into five equal sections. Each fifth was held out in turn as test data while a set of two-level rules was learned from the remaining fourfifths. The average recognition accuracy as well as the generation accuracy over the held out test data is 93.9%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML