XML Viewer - p05-1033

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1033_evalu.xml
Size: 5,677 bytes
Last Modified: 2025-10-06 13:59:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1033">
  <Title>A Hierarchical Phrase-Based Model for Statistical Machine Translation</Title>
  <Section position="6" start_page="267" end_page="268" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> Our experiments were on Mandarin-to-English translation. We compared a baseline system, the state-of-the-art phrase-based system Pharaoh (Koehn et al., 2003; Koehn, 2004a), against our system. For all three systems we trained the translation model on the FBIS corpus (7.2M+9.2M words); for the language model, we used the SRI Language Modeling Toolkit to train a trigram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998) on 155M words of English newswire text, mostly from the Xinhua portion of the Gigaword corpus. We used the 2002 NIST MT evaluation test set as our development set, and the 2003 test set as our test set. Our evaluation metric was BLEU (Papineni et al., 2002), as calculated by the NIST script (version 11a) with its default settings, which is to perform case-insensitive matching of n-grams up to n = 4, and to use the shortest (as opposed to nearest) reference sentence for the brevity penalty. The results of the experiments are summarized in Table 1.</Paragraph>
    <Section position="1" start_page="267" end_page="267" type="sub_section">
      <SectionTitle>
5.1 Baseline
</SectionTitle>
      <Paragraph position="0"> The baseline system we used for comparison was Pharaoh (Koehn et al., 2003; Koehn, 2004a), as publicly distributed. We used the default feature set: language model (same as above), p( -f  |-e), p(-e  |-f ), lexical weighting (both directions), distortion model, word penalty, and phrase penalty. We ran the trainer with its default settings (maximum phrase length 7), and then used Koehn's implementation of minimum-error-rate training (Och, 2003) to tune the feature weights to maximize the system's BLEU score on our development set, yielding the values shown in Table 2. Finally, we ran the decoder on the test set, pruning the phrase table with b = 100, pruning the chart with b = 100,b = 10[?]5, and limiting distortions to 4. These are the default settings, except for the phrase table's b, which was raised from 20, and the distortion limit. Both of these changes, made by Koehn's minimum-error-rate trainer by default, improve performance on the development set.</Paragraph>
      <Paragraph position="1">  after filtering for the development set. All have X for their left-hand sides.</Paragraph>
    </Section>
    <Section position="2" start_page="267" end_page="268" type="sub_section">
      <SectionTitle>
5.2 Hierarchical model
</SectionTitle>
      <Paragraph position="0"> We ran the training process of Section 3 on the same data, obtaining a grammar of 24M rules. When filtered for the development set, the grammar has 2.2M rules (see Figure 2 for examples). We then ran the minimum-error rate trainer with our decoder to tune the feature weights, yielding the values shown in Table 2. Note that lg penalizes the glue rule much less than lpp does ordinary rules. This suggests that the model will prefer serial combination of phrases, unless some other factor supports the use of hierarchical phrases (e.g., a better language model score). We then tested our system, using the settings described above.4 Our system achieves an absolute improvement of 0.02 over the baseline (7.5% relative), without using any additional training data. This difference is statistically significant (p &lt; 0.01).5 See Table 1, which also shows that the relative gain is higher for higher n-grams.</Paragraph>
      <Paragraph position="1"> 4Note that we gave Pharaoh wider beam settings than we used on our own decoder; on the other hand, since our decoder's chart has more cells, its b limits do not need to be as high.  which uses bootstrap resampling (Koehn, 2004b); it was modified to conform to NIST's current definition of the BLEU brevity penalty.</Paragraph>
      <Paragraph position="2">  to one). Word = word penalty; Phr = phrase penalty. Note that we have inverted the sense of Pharaoh's phrase penalty so that a positive weight indicates a penalty.</Paragraph>
    </Section>
    <Section position="3" start_page="268" end_page="268" type="sub_section">
      <SectionTitle>
5.3 Adding a constituent feature
</SectionTitle>
      <Paragraph position="0"> The use of hierarchical structures opens the possibility of making the model sensitive to syntactic structure. Koehn et al. (2003) mention German &lt;es gibt, there is&gt; as an example of a good phrase pair which is not a syntactic phrase pair, and report that favoring syntactic phrases does not improve accuracy. But in our model, the rule (19) X - &lt;es gibt X 1 , there is X 1 &gt; would indeed respect syntactic phrases, because it builds a pair of Ss out of a pair of NPs. Thus, favoring subtrees in our model that are syntactic phrases might provide a fairer way of testing the hypothesis that syntactic phrases are better phrases.</Paragraph>
      <Paragraph position="1"> This feature adds a factor to (17),</Paragraph>
      <Paragraph position="3"> as determined by a statistical tree-substitutiongrammar parser (Bikel and Chiang, 2000), trained on the Penn Chinese Treebank, version 3 (250k words). Note that the parser was run only on the test data and not the (much larger) training data. Rerunning the minimum-error-rate trainer with the new feature yielded the feature weights shown in Table 2.</Paragraph>
      <Paragraph position="4"> Although the feature improved accuracy on the development set (from 0.314 to 0.322), it gave no statistically significant improvement on the test set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML