File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0832_evalu.xml

Size: 5,611 bytes

Last Modified: 2025-10-06 13:59:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0832">
  <Title>Gaming Fluency: Evaluating the Bounds and Expectations of Segment-based Translation Memory</Title>
  <Section position="5" start_page="178" end_page="180" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
4.1 An upper bound for whole-sentence
TM
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the maximum possible bleu score that can an oracle can achieve by selecting the best English-side segment from the parallel text. The upper bound achieved here is a bleu score of 17.7, and this number is higher than the best performing system in the corresponding NIST evaluation.</Paragraph>
      <Paragraph position="1"> Note the log-linear growth in the resulting  bleu score of the TM with increasing database size. As the database is increased by a factor of ten, the TM gains approximately 5 points of bleu. While this trend has a natural limit at 20 orders of magnitude, it is unlikely that this amount of text, let alone parallel text, will be a indexed in the foreseeable future. This rate is more useful in interpolation, giving an idea of how much could be gained from adding to corpora that are smaller than 7.5 million segments.</Paragraph>
    </Section>
    <Section position="2" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
4.2 The effect of ngram size on Chinese
</SectionTitle>
      <Paragraph position="0"> tf-idf retrieval Figure 2 shows that our best performance is realized when IR queries are composed of cumulative 4-grams (i.e. unigrams + bigrams + trigrams + 4-grams). As hypothesized, while longer sequences are not important in document retrieval in Chinese IR, they convey information that is useful in segment retrieval in the translation memory. For the remainder of the experiments, we restrict ourselves to cumulative 4-gram queries.</Paragraph>
      <Paragraph position="1"> Note that the 4-gram result here (bleu of 5.87) provides the baseline system performance measure as well as the value when the segments are reranked according to tf-idf .</Paragraph>
    </Section>
    <Section position="3" start_page="178" end_page="180" type="sub_section">
      <SectionTitle>
4.3 Upper bounds for tf-idf
</SectionTitle>
      <Paragraph position="0"> Figure 3 gives the n-best list rescoring bounds.</Paragraph>
      <Paragraph position="1"> The upper bound continues to increase up to the top 1000 results. The plateau achieved after 1000 IR results suggests that is little to be gained from further IR engine retrieval.</Paragraph>
      <Paragraph position="2"> Note the log-linear growth in the bleu score the oracle achieves as the n-best list extends on the left side of the figure. As the list length is increased by a factor of ten, the oracle upper bound on performance increases by roughly 3 points of bleu. Of course, for a system to perform as well as the oracle does becomes progressively harder as the n-best list size increases. Comparing this result with the experiment in section 4.1 indicates that making the oracle choose among Chinese source language IR results and limiting its view to the 1000 results given by the IR engine incurs only a minor reduction of the oracle's bleu score, from 17.7 to  16.3. This is one way to measure the impact of crossing this particular language barrier and using IR rather than exhaustive search.</Paragraph>
      <Paragraph position="3">  numbers of translation pairs returned by IR engine, where the optimal segment is chosen from the results by an oracle.</Paragraph>
    </Section>
    <Section position="4" start_page="180" end_page="180" type="sub_section">
      <SectionTitle>
4.4 Using automated MT metrics to
</SectionTitle>
      <Paragraph position="0"> pick the best TM sentence Each metric was run on the top 1000 results from the IR engine, on cumulative 4-gram queries. Each metric was given the (Chinese) evaluation corpus segment as the single reference, and scored the Chinese side of each of the 1000 resulting translation pairs against that reference. The hypothesis document for each metric consisted of the English side of the translation pair with the best score for each segment. These documents were scored with bleu against the reference corpus. Ties (e.g. cases where a metric gave all 1000 pairs the same score) were broken with tf-idf.</Paragraph>
      <Paragraph position="1"> Results of the rescoring experiment run on  picking the best translation from 100 translation pairs returned by the IR engine.</Paragraph>
      <Paragraph position="2"> an n-best list of size 100 are given in Table 3. Choosing from 1000 pairs did not give better results. Choosing from only 10 gave worse results. The random baseline given in the table represents the expected score from choosing randomly among the top 100 IR returns. While the scores of the individual metrics aside from per and bleu reveal no differences, bleu and the combination metric performed better than the individual metrics.</Paragraph>
      <Paragraph position="3"> Surprisingly, tf-idf was outperformed only by bleu and the combination metric. While we hoped to gain much more from n-bestlist rescoring on this task, reaching toward the limits discovered in section 4.3, the combination metric was less than 0.5 bleu points below the lower range of systems that were entered in the NIST 2002 evals. The bleu scores of research systems in that competition roughly ranged between 7 and 15. Of course, each of the segments produced by the TM exhibit perfect fluency.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML