XML Viewer - w02-1610

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1610_evalu.xml
Size: 4,481 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1610">
  <Title>Learning Domain-Specific Transfer Rules: An Experiment with Korean to English Translation</Title>
  <Section position="6" start_page="1" end_page="1" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.1 Systems Compared
</SectionTitle>
      <Paragraph position="0"> Babelfish As a first baseline, we used Babelfish from Systran, a commercial large coverage MT system supporting Korean to English translation. This system was not trained on our corpus. null GIZA++/RW As a second baseline, we used an off-the-shelf statistical MT system, consisting of the ISI ReWrite Decoder (Germann et al., 2001) together with a translation model produced by GIZA++ (Och and Ney, 2000) and a language model produced by the CMU Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997). This system was trained on our corpus only.</Paragraph>
      <Paragraph position="1"> Lex Only As a third baseline, we used our system with the baseline transfer dictionary as the sole transfer resource.</Paragraph>
      <Paragraph position="2"> Lex+Induced We compared the three baseline systems against our complete system, using the baseline transfer dictionary augmented with the induced transfer rules.</Paragraph>
      <Paragraph position="3"> We ran each of the four systems on the test set of 50 Korean sentences described in Section 3.2 and compared the resulting translations using the automatic evaluation and the human evaluation described below.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Automatic Evaluation Results
</SectionTitle>
      <Paragraph position="0"> For the automatic evaluation, we used the Bleu metric from IBM (Papineni et al., 2001). The Bleu metric combines several modified N-gram precision measures (N=1to4),anduses brevity penalties to penalize translations that are shorter than the reference sentences.</Paragraph>
      <Paragraph position="1"> Table 5 shows the Bleu N-gram precision scores for each of the four systems. Our system (Lex+Induced) had better precision scores than each of the baseline systems, except in the case of 4grams, where it slightly trailed Babelfish. The statistical baseline system (GIZA++/RW) performed poorly, as might have been expected given the small amount of training data.</Paragraph>
      <Paragraph position="2"> Table 6 shows the Bleu overall precision scores.</Paragraph>
      <Paragraph position="3"> Our system (Lex+Induced) improved substantially over both the Lex Only and Babelfish baseline systems. The score for the statistical baseline system (GIZA++/RW) is not meaningful, due to the ab-</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.3 Human Evaluation Results
</SectionTitle>
      <Paragraph position="0"> For the human evaluation, we asked two English native speakers to rank the quality of the translation results produced by the Babelfish, Lex Only and Lex+Induced, with preference given to fidelity over fluency. (The translation results of the statistical system were not yet available when the evaluation was performed.) A rank of 1 was assigned to the best translation, a rank of two to the second best and a rank of 3 to the third, with ties allowed.</Paragraph>
      <Paragraph position="1"> Table 7 shows the pairwise comparisons of the three systems. The top section indicates that the Babelfish and Lex Only baseline systems are essentially tied, with neither system preferred more frequently than the other. In contrast, the middle and bottom sections show that our system improves substantially over both baseline systems; most strikingly, our system (Lex+Induced) was preferred almost 20% more frequently than the Babelfish base-line (46% to 27%, with ties 27% of the time).</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
System Pair Comparison Result
</SectionTitle>
      <Paragraph position="0"> Babelfish better than Lex Only 37% Lex Only better than Babelfish 36% Babelfish same as Lex Only 27% Babelfish better than Lex+Induced 27% Lex+Induced better than Babelfish 46% Babelfish same as Lex+Induced 27% Lex Only better than Lex+Induced 18% Lex+Induced better than Lex Only 41% Lex Only same as Lex+Induced 41%</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML