File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2036_intro.xml

Size: 2,596 bytes

Last Modified: 2025-10-06 14:01:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2036">
  <Title>References</Title>
  <Section position="3" start_page="3" end_page="3" type="intro">
    <SectionTitle>
2 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The translation system is tested on a Chinese-to-English translation task. The training data come from several news sources. For testing, we use the DARPA/NIST MT 2001 dry-run testing data, which consists of 793 sentences with 20; 333 words arranged in 80 documents.2 The training data is provided by the LDC and labeled by NIST as the Large Data condition for the MT 2002 evaluation. The Chinese sentences are segmented into words.</Paragraph>
    <Paragraph position="1"> The training data contains 23:7 million Chinese and 25:3 million English words.</Paragraph>
    <Paragraph position="2"> Experimental results are presented in Table 1 and Table 2. Table 1 shows the effect of the unigram threshold. The second column shows the number of blocks selected.</Paragraph>
    <Paragraph position="3"> The third column reports the BLEU score (Papineni et al., 2002) along with 95% confidence interval. We use IBM 2We did not use the first 25 documents of the 105-document dry-run test set because they were used as a development test set before the dry-run and were subsequently added to our training data.</Paragraph>
    <Paragraph position="4">  score. The maximum phrase length is 8.</Paragraph>
    <Paragraph position="5">  Model 1 as a baseline model which is similar to our block model: neither model uses distortion or alignment probabilities. The best results are obtained for the N2 and the N3 sets.</Paragraph>
    <Paragraph position="6"> The N3 set uses only 1:22 million blocks in contrast to N2 which has 4:23 million blocks. This indicates that the number of blocks can be reduced drastically without affecting the translation performance significantly. Table 2 shows the effect of the maximum phrase length on the BLEU score for the N2 block set. Including blocks with longer phrases actually helps to improve performance, although length 4 already obtains good results.</Paragraph>
    <Paragraph position="7"> We also ran the N2 on the June 2002 DARPA TIDES Large Data evaluation test set. Six research sites and four commercial off-the-shelf systems were evaluated in Large Data track. A majority of the systems were phrase-based translation systems. For comparison with other sites, we quote the NIST score (Doddington, 2002) on this test set: N2 system scores 7.44 whereas the official top two systems scored 7.65 and 7.34 respectively.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML