File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0504_evalu.xml

Size: 4,308 bytes

Last Modified: 2025-10-06 13:59:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0504">
  <Title>The SED heuristic for morpheme discovery: a look at Swahili</Title>
  <Section position="6" start_page="32" end_page="33" type="evalu">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In this section, we present three sets of evaluations of the refinements of the SED heuristics described in the preceding section. We used a corpus of 7,180 distinct words occurring in 50,000 running words from a Swahili translation of the Bible obtained on the internet.</Paragraph>
    <Section position="1" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
4.1 Disambiguating FSAs
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the effects of the disambiguating of FSAs described in section 3.1, we compare precision and recall of the identification of morpheme boundaries using the SED method with and without the disambiguation procedure described above. In Figures 1 and 2, we graph precision and recall for the top 10% of the templates, displayed as the leftmost point, for the top 20% of the templates, displayed as the second point from the left; and so on, because the higher ranked FSAs are more intrinsically more reliable than the lower ranked ones. We see that disambiguation repairs almost 50% of the previous errors, and increases recalls by about 10%. With these increases in precision and recall, it is clear that the disambiguating step provides a considerably more accurate morpheme boundary discovery procedure.</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
4.2 Template collapsing
</SectionTitle>
      <Paragraph position="0"> The second refinement discussed above consists of finding pairs of similar templates, collapsing them as appropriate, and thus creating patterns that generate new words that did not participate in the formation of the original templates. These new words may or may not themselves appear in the corpus. We are, however, able to judge their morphological well-formedness by inspection. We list in Table 3 the entire list of eight templates that are collapsed in this step.</Paragraph>
      <Paragraph position="1"> All of the templates which are collapsed in this step are in fact of the same morphological structure (with one very minor exception  ): they are of the form subject marker + tense marker + stem, and the collapsing induced in this procedure correctly creates larger templates of precisely the same structure, generating new words not seen in the corpus that are in fact correct from our (non-native speaker) inspection. We submitted the new words to Yahoo to test the words &amp;quot;existence&amp;quot; by their existence on the internet, and actually found an average of 87% of the predicted words in a template; see the last column in Table 3 for details.</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
4.3 Reparsing
</SectionTitle>
      <Paragraph position="0"> After previous refinements, we obtain a number of robust FSAs, for example, those collapsed templates in Table 3. With them, we then search the corpus for those words that can only be partly fitted into these FSAs and generate associated stems. Table 4 shows the reparsed words that had not been parsed by earlier templates and also newly added stems for some robust FSAs (the four collapsed templates in Table 3). Stems such as anza 'begin' and fanya 'do' are thus added to the first template, and all words derived by prepending a tense marker and a subject marker are indeed accurate words. As the words in Table 4 suggest, the reparsing process adds new, common stems to the stem-column of the templates, thus making it  The exception involves the distinct morpheme po, a subordinate clause marker which must ultimately be analyzed as appearing in a distinct template column to the right of the tense markers.</Paragraph>
      <Paragraph position="1"> easier for the collapsing function to find similarities across related templates.</Paragraph>
      <Paragraph position="2"> In future work, we will take use the larger templates, populated with more stems, and input them to the collapsing function described in 3.2.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML