File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1030_evalu.xml

Size: 5,375 bytes

Last Modified: 2025-10-06 13:59:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1030">
  <Title>Reordering Constraints for Phrase-Based Statistical Machine Translation</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Corpus Statistics
</SectionTitle>
      <Paragraph position="0"> To investigate the effect of reordering constraints, we have chosen two Japanese-English tasks, because the word order in Japanese and English is rather different. The first task is the Basic Travel Expression Corpus (BTEC) task (Takezawa et al., 2002). The corpus statistics are shown in Table 1. This corpus consists of phrasebook entries.</Paragraph>
      <Paragraph position="1"> The second task is the Spoken Language DataBase (SLDB) task (Morimoto et al., 1994). This task consists of transcription of spoken dialogs in the domain of hotel reservation. Here, we use domain-specific training data in addition to the BTEC corpus. The corpus statistics of this additional corpus are shown in Table 2. The development corpus is the same for both tasks.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> WER (word error rate). The WER is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the reference sentence.</Paragraph>
      <Paragraph position="1"> PER (position-independent word error rate). A shortcoming of the WER is that it requires a perfect word order. The word order of an acceptable sentence can be different from that of the target sentence, so that the WER measure alone could be misleading. The PER compares the words in the two sentences ignoring the word order.</Paragraph>
      <Paragraph position="2"> BLEU. This score measures the precision of unigrams, bigrams, trigrams and fourgrams with respect to a reference translation with a penalty for too short sentences (Papineni et al., 2002). The BLEU score measures accuracy, i.e. large BLEU scores are better.</Paragraph>
      <Paragraph position="3"> NIST. This score is similar to BLEU. It is a weighted n-gram precision in combination with a penalty for too short sentences (Doddington, 2002). The NIST score measures accuracy, i.e. large NIST scores are better.</Paragraph>
      <Paragraph position="4"> Note that for each source sentence, we have as many as 16 references available. We compute all the preceding criteria with respect to multiple references.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 System Comparison
</SectionTitle>
      <Paragraph position="0"> In Table 3 and Table 4, we show the translation results for the BTEC task. First, we observe that the overall quality is rather high on this task. The average length of the used alignment templates is about five source words in all systems. The monotone search (mon) shows already good performance on short sentences with less than 10 words. We conclude that for short sentences the reordering is captured within the alignment templates. On the other hand, the monotone search degrades for long sentences with at least 10 words resulting in a WER of 16:6% for these sentences.</Paragraph>
      <Paragraph position="1"> We present the results for various nonmonotone search variants: the first one is with the IBM constraints (skip) as described in Section 3.1. We allow for skipping one or two phrases. Our experiments showed that if we set the maximum number of phrases to be skipped to three or more the translation results are equivalent to the search without any reordering constraints (free). The results for the ITG constraints as described in Section 3.2 are also presented.</Paragraph>
      <Paragraph position="2"> The unconstrained reorderings improve the total translation quality down to a WER of 11:5%. We see that especially the long sentences benefit from the reorderings resulting in an improvement from 16:6% to 13:8%. Comparing the results for the free reorderings and  for the BTEC task (510 sentences). Sentence lengths: short: &lt; 10 words, long: , 10 words; times in milliseconds per sentence.</Paragraph>
      <Paragraph position="3">  the ITG reorderings, we see that the ITG system always outperforms the unconstrained system. The improvement on the whole test set is statistically significant at the 95% level.1 In Table 5 and Table 6, we show the results for the SLDB task. First, we observe that the overall quality is lower than for the BTEC task. The SLDB task is a spoken language translation task and the training corpus for spoken language is rather small. This is also reflected in the average length of the used alignment templates that is about three source words compared to about five words for the BTEC task.</Paragraph>
      <Paragraph position="4"> The results on this task are similar to the results on the BTEC task. Again, the ITG constraints perform best. Here, the improvement compared to the unconstrained search is statistically significant at the 99% level. Compared to the monotone search, the BLEU score for the ITG constraints improves from 54.4% to 57.1%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML