XML Viewer - w04-0107

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0107_evalu.xml
Size: 8,369 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0107">
  <Title>Unsupervised Induction of Natural Language Morphology Inflection Classes</Title>
  <Section position="7" start_page="404" end_page="404" type="evalu">
    <SectionTitle>
6 Results and Error Analysis
</SectionTitle>
    <Paragraph position="0"> For each of the search variants described in section 4.2 we executed a by-hand search over the relevant parameters for those settings that optimize the F1 measure (the harmonic mean of recall and precision). The best performing parameter settings are presented in Table 3 while quantitative results using these settings are plotted in Figure 4.</Paragraph>
    <Paragraph position="1"> Examining the performance of each algorithm (Figure 4) reveals that the simple Vertical only search achieves a high precision at the expense of a low recall measure. The simple Vertical search also gives the smallest fragmentation, which, when combined with the high precision score, indicates a conservative algorithm that selects few CIC's. The parameter settings which achieve the highest F1 for Left Block alone and Right Block alone each produce much higher recall than the simple Vertical search. Right Block Recursive increases precision significantly over simple Right Block and achieves  the highest F1 measure of any search variant.</Paragraph>
    <Paragraph position="2"> While Full Horizontal Block also performs well, sharing the values of HORIZ RATIO and HORIZ SIZE forced a compromise between Left Block and Right Block Recursive that did not significantly outperform either algorithm alone.</Paragraph>
    <Paragraph position="3"> Of the 83 unique suffixes in the hand compiled standard inflection classes, 21 did not share a c-stem with any other c-suffix in the Spanish news-wire corpus used for this evaluation--placing an upper limit on recall of 0.75 for the search algorithms presented in this paper.</Paragraph>
    <Paragraph position="4"> Examining the parameter settings that yielded the highest F1 measure for each search variant (Table 3) is also enlightening. Early experiments with Vertical only search clearly demonstrated that a TOP SIZE of two, or restricting the CIC's permitted to be selected to those with at least two adherents, always resulted in better performance than other possible settings. A TOP SIZE of one places no restriction on the adherent size of a CIC, rampantly selecting CIC's, such as the level 10 CIC given at the end of section 4.1, that consist of many c-suffixes that happen to validly concatenate onto a single c-stem--obliterating reasonable precision. Higher settings for TOP SIZE induce a graceful degradation in recall. Thus all experiments reported here used a TOP SIZE of two.</Paragraph>
    <Paragraph position="5"> Beyond TOP SIZE the only parameters available to the basic Vertical algorithm are L1 SIZE and RATIO, which provide only crude means to halt the search of bad paths. In particular, if a level one CIC, C, has more than L1 SIZE adherents, and has some parent which passes the RATIO cutoff, then some ancestor of C will be selected by the algorithm as a good CIC. Hence, the Vertical only algorithm ensures search gets off on the right foot by using the highest values for the L1 SIZE and RATIO parameters of any algorithm variant. Performance falls off quickly above L1 SIZE settings of 192, indicating that this parameter in this algorithm is sensitive to the size of the training corpus.</Paragraph>
    <Paragraph position="6"> In contrast, the horizontal blocking search algorithms have additional parameters available to cull out bad search paths, and can hence afford to use lower (and more stable) values for L1 SIZE and RATIO. Recall that the Left Blocking algorithm discards paths determined to be using a morpheme boundary too far to the right, while the Right Blocking algorithm discards paths using morpheme boundaries too far to the left. Notice that since, as reasoned in section 4.1, adherent count monotonically decreases as morpheme boundary links are followed to the left, if the L1 SIZE cutoff blocks a particular CIC, C, all CIC's to the left of C will also be blocked. From these facts it follows that a large L1 SIZE will reject some paths resulting from morpheme boundaries chosen too far to the left, which would otherwise have been pursued in the Left Blocking algorithm. The Right Blocking algorithm, however, receives no such benefit, and achieves its best performance by maximizing recall with a small L1 SIZE.</Paragraph>
    <Paragraph position="7"> Examining the best performing parameter values for the Right Blocking Recursive algorithm reveals a curious behavior in which low values for L1 SIZE and RATIO allow a permissive vertical search while stringent values of HORIZ RATIO and, particularly, HORIZ SIZE constrain the search. One explanation for these facts might be that following the monotonically increasing chain of CIC adherent sizes along right horizontal links allows the algorithm to  make intelligent blocking decisions backed by sufficient data.</Paragraph>
    <Paragraph position="8"> The best performing parameter values for the Full Horizontal Search are a compromise between the well performing values for the Left Blocking and those for the Right Blocking algorithms. This parameter value compromise does not draw benefit from the recursion in the Right Block Recursive algorithm, but instead employs Right Block as a replacement for the relatively higher L1 SIZE parameter in the Left Blocking algorithm.</Paragraph>
    <Paragraph position="9"> It is also interesting to examine CIC's selected by the search algorithms. Table 4 lists all of the CIC's selected by the conservative Vertical search algorithm together with a random sample of CIC's selected by Right Blocking Recursive, the algorithm which reached the highest F1 measure of any algorithm variant.</Paragraph>
    <Paragraph position="10"> Perhaps the most striking feature of Table 4 is the extent to which the CIC's overlap. Very few individual c-suffixes occur in only one CIC. Of all the CIC's in Table 4, only O.s and a.as.o.os, both among the CIC's selected by the Vertical algorithm, represent complete inflection classes in the standard IC's. The remaining CIC's are proper subsets of various verbal inflection classes. The overlapping nature of the selected CIC's suggests an additional step, which we do not investigate here, of conflating CIC's into a fewer number of meta-CIC's.</Paragraph>
    <Paragraph position="11"> The only verbal inflection class for which sub-sets are able to pass the large L1 SIZE cutoff imposed by the Vertical search algorithm is -ar, the most frequent of the three major inflection classes in Spanish. The Right Blocking Recursive algorithm on the other hand identifies significant portions of all three verbal inflection classes.</Paragraph>
    <Paragraph position="12"> The c-suffixes appearing in italics in Table 4 correspond to no suffix found in any standard IC.</Paragraph>
    <Paragraph position="13"> These alien c-suffixes fall into two categories.</Paragraph>
    <Paragraph position="14"> 1) The c-suffixes aciones, acion, and adores are noun forming derivational suffixes.</Paragraph>
    <Paragraph position="15"> 2) The remaining c-suffixes were formed by choosing a morpheme boundary too far to the right.</Paragraph>
    <Paragraph position="16"> It is the second type of mistake that the Left Blocking search algorithm was specifically designed to address. Unfortunately naively combining the Right Blocking Recursive with the Left Blocking algorithm did not improve performance.</Paragraph>
    <Paragraph position="17"> We expect that by using separate horizontal paVertical null ar er ir 23 of 23 Selected CIC's  sample of CIC's selected by the algorithm with best F1 measure, Right Blocking Recursive (right). For each CIC row, a dot is placed in the columns representing standard IC's for which that CIC is a subset. The c-suffixes in italics are in no standard IC.</Paragraph>
    <Section position="1" start_page="404" end_page="404" type="sub_section">
      <SectionTitle>
Right Blocking Recursive
</SectionTitle>
      <Paragraph position="0"> ar er ir 23 of 204 Selected CIC's  rameters for left blocking and for right blocking we could combine these two algorithms in a less constrained fashion that would result in better overall performance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML