File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0617_evalu.xml
Size: 9,184 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0617"> <Title>Morphology Induction From Term Clusters</Title> <Section position="6" start_page="189" end_page="189" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We evaluate by taking the highest-ranked trace, using the ordering heuristics described in the previous section, as the system's analysis of a given word. This analysis takes the form of a sequence of hypothetical wordforms, from a putative stem to the target wordform (e.g., decide, decision, decisions). The CELEX morphological database (Baayen et al., 1995) is used to produce a reference analysis, by tracing back from the target wordform through any inflectional affixation, then through successive derivational affixations until a stem is reached. Occasionally, this yields more than one analysis. In such cases, all analyses are retained, and the system's analysis is given the most optimistic score. In other words, if a CELEX analysis is found which matches the system's analysis, it is judged to be correct.</Paragraph> <Section position="1" start_page="189" end_page="189" type="sub_section"> <SectionTitle> 4.1 Results </SectionTitle> <Paragraph position="0"> In evaluating an analysis, we distinguish the following outcomes (ordered from most favorable to least): a10 Cor. The system's analysis matches CELEX's. a10 Over. The system's analysis contains all the wordforms in CELEX's, also contains additional wordforms, and each of the wordforms is a legitimate morph of the CELEX stem.</Paragraph> <Paragraph position="1"> a10 Under. The system's analysis contains some of the wordforms in CELEX's; it may contain additional wordforms which are legitimate morphs of the CELEX stem. This happens, for example, when the CELEX stem is unknown to the system.</Paragraph> <Paragraph position="2"> Note that we discard any wordforms which are not in CELEX. Depending on the vocabulary size, anywhere from 15% to 30% are missing. These are often proper nouns.</Paragraph> <Paragraph position="3"> In addition, we measure precision, recall, and F1 as in Schone and Jurafsky (2001). These metrics reflect the algorithm's ability to group known terms which are morphologically related. Groups Street Journal corpus.</Paragraph> <Paragraph position="4"> are formed by collecting all wordforms that, when analyzed, share a root form. We report these numbers as Prec, Rec, and F1.</Paragraph> <Paragraph position="5"> We performed the procedure outlined in Section 3.1 using the a0 most frequent terms from the Wall Street Journal corpus, for a0 ranging from 1000 to 20,000. The expense of performing these steps is modest compared with that of collecting term co-occurrence statistics and generating term clusters. Our perl implementation of this procedure consumes just over two minutes on a lightly loaded 2.5 GHz Intel machine running Linux, given a collection of 10,000 wordforms in 200 clusters.</Paragraph> <Paragraph position="6"> The header of each column in Table 5 displays the size of the vocabulary. The column labeled 10K+1K stands for an experiment designed to assess the ability of the algorithm to process novel terms. For this column, we derived the morphological automaton from the 10,000 most frequent terms, then used it to analyze the next 1000 terms.</Paragraph> <Paragraph position="7"> The surprising precision/recall scores in this column--scores that are high despite an actual degradation in performance--argues for caution in the use and interpretation of the precision/recall metrics in this context. The difficulty of the morphological conflation set task is a function of the size and constituency of a vocabulary. With a small sample of terms relatively low on the Zipf curve, high precision/recall scores mainly reflect the algorithm's ability to determine that most of the terms are not related--a Pyrrhic victory. Nevertheless, these metrics give us a point of comparison with Schone and Jurafsky (2001) who, using a vocabulary of English words occurring at least 10 times in a 6.7 million-word newswire corpus, report F1 of 88.1 for con- null flation sets based only on suffixation, and 84.5 for circumfixation. While a direct comparison would be dubious, the results in Table 5 are comparable to those of Schone and Jurafsky. (Note that we include both prefixation and suffixation in our algorithm and evaluation.) Not surprisingly, precision and recall degrade as the vocabulary size increases. The top rows of the table, however, suggest that performance is reasonable at small vocabulary sizes and robust across the columns, up to 20K, at which point the system increasingly generates incorrect analyses (more on this below).</Paragraph> </Section> <Section position="2" start_page="189" end_page="189" type="sub_section"> <SectionTitle> 4.2 Discussion </SectionTitle> <Paragraph position="0"> A primary advantage of basing the search for affixation patterns on term clusters is that the problem of non-morphological orthographic regularities is greatly mitigated. Nevertheless, as the vocabulary grows, the inadequacy of the simple frequency thresholds we employ becomes clear. In this section, we speculate briefly about how this difficulty might be overcome.</Paragraph> <Paragraph position="1"> At the 20K size, the system identifies and retains a number of non-morphological regularities. An example are the transforms s/$/e/ and s/$/o/, both of which align members of a name cluster with other members of the same cluster (Clark/Clarke, Brook/Brooke, Robert/Roberto, etc.). As a consequence, the system assigns the analysis tim => time to the word &quot;time&quot;, suggesting that it be placed in the name cluster.</Paragraph> <Paragraph position="2"> There are two ways in which we can attempt to suppress such analyses. One is to adjust parameters so that noise transforms are less likely. The procedure for acquiring candidate transforms, described in Section 3.2, discards any that match fewer than 3 stems. When we increase this parameter to 5 and run the 20K experiment again, the incorrect rate falls to 0.02 and F1 rises to 0.84. While this does not solve the larger problem of spurious transforms, it does indicate that a search for a more principled way to screen transforms should enhance performance.</Paragraph> <Paragraph position="3"> The other way to improve analyses is to corroborate predictions they make about the constituent wordforms. If the tim => time analysis is correct, then the word &quot;time&quot; should be at home in the name cluster. This is something we can check. Recall that in our framework both terms and clusters are associated with distributions over adjacent terms (or clusters). We can hope to improve precision by discarding analyses that assign a term to a cluster from which it is too distributionally distant. Applying such a filter in the 20K experiment, has a similar impact on performance as the transform filter of the previous paragraph, with F1 rising to 0.84.4 Several researchers have established the utility of a filter in which the broader context distributions surrounding two terms are compared, in an effort to insure that they are semantically compatible (Schone and Jurafsky, 2001; Yarowsky and Wicentowski, 2001). This would constitute a straightforward extension of our framework.</Paragraph> <Paragraph position="4"> Note that the system is often able to produce the correct analysis, but ordering heuristics described in Section 3.5 cause it to be discarded in favor of an incorrect one. The analyses us => using and use => using are an example, the former being the one favored for the word &quot;using&quot;. Note, though, that our automaton construction procedure discards a potentially useful piece of information-the amount of support each arc receives from the data (the number of stems it matches). This might be converted into something like a traversal probability and used in ordering analyses.</Paragraph> <Paragraph position="5"> Of course, a further shortcoming of our approach is its inability to account for irregular forms. It shares this limitation with all other approaches based on orthographic similarity (a notable exception is Yarowsky and Wicentowski (2001)). However, there is reason to believe that it could be extended to accommodate at least some irregular forms. We note, for example, the cluster pair 180/185, which is dominated by the transform s/e?$/ed/. Cluster 180 contains words like &quot;make&quot;, &quot;pay&quot;, and &quot;keep&quot;, while Cluster 185 contains &quot;made&quot;, &quot;paid&quot;, and &quot;kept&quot;. In other words, once a strong correspondence is found between two clusters, we can search for an alignment which covers the orphans in the respective clusters.</Paragraph> <Paragraph position="6"> 4Specifically, we take the Hellinger distance between the two distributions, scaled into the range a0 a1a3a2a5a4a7a6 , and discard those analyses for which the term is at a distance greater than 0.5 from the proposed cluster.</Paragraph> </Section> </Section> class="xml-element"></Paper>