File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-5003_evalu.xml

Size: 6,115 bytes

Last Modified: 2025-10-06 13:59:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5003">
  <Title>Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence</Title>
  <Section position="6" start_page="21" end_page="22" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.1 MSRP Corpus
</SectionTitle>
      <Paragraph position="0"> The results for the jackknifing experiments are shown in Table 2 and the results using the provided training and test sets are shown in Table 3.</Paragraph>
      <Paragraph position="1"> In the tables the rows labeled &amp;quot;PER POS+&amp;quot;, refer to models built using feature vectors made by combining both the PER and POS+ feature vectors. The rows labeled POS refer to models built from the combination of features from the POS+ and POS- models. The rows labeled ALL refer to models built from combining all of the features used in these experiments.</Paragraph>
      <Paragraph position="2"> The results show that decomposing the PER edit distance score into components for each POS tag is not able to better the classification performance of PER. The accuracy (jackknifing) for PER alone was 71.25% and the accuracy for the analogous technique which divides this information in contributions for each POS tag (POS-) was 70.99%. However, when the features from PER and POS- are combined there is an improvement in performance (to 72.71%) indicating that the components for each POS tag are useful, but only in addition to the more primitive feature encoding the total edit distance. Moreover, comparing the results from POS-, POS+ and POS it is clear that there lot to be gained by considering the contributions from both the matching words and the non-matching words. Using both together gives a classification performance of 74.2% whereas using either component in isolation can give a performance no better than 71.5%.</Paragraph>
      <Paragraph position="3"> The one of the worst performing systems was that based on the WER score. However, it is possible that the way the sentences were selected handicapped this system, since only sentences pairs with a word-basedLevenshtein distance of 8 or higher were included in the corpus. Choosing sentence pairs with larger edit distances makes large structural differences more likely, and the  editingeffortneededtocorrectsuchstructuraldifferences may obscure the lexical comparison that this score relies upon.</Paragraph>
      <Paragraph position="4"> The results for the BLEU score were unexpected because the performance degrades as the order of n-gram considered increases. This effect is much less apparent in the NIST scores where the performance degrades but to a lesser extent.</Paragraph>
      <Paragraph position="5"> Paraphrases exhibit variety in their grammatical structure and perhaps changes in word ordering can explain this effect. If so, the geometric mean employed in the BLEU score would make the effect of higher order n-grams considerably more detrimental than with the arithmetic mean used in the NIST score.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.2 PASCAL Challenge Corpus
</SectionTitle>
      <Paragraph position="0"> The results for the PASCAL corpus are given in Table 4. As expected our results are consistent with those of (Perez and Alfonseca, 2005). The 5% overall gain in accuracy may be accounted for by the stemming and synonym extensions to our technique and the fact that we used BLEU1.</Paragraph>
      <Paragraph position="1"> Our approach also differs by being symmetrical over source and reference sentences, however it is not clear whether this would improve performance. The number of test examples for the sub-experiments for each task is low (50 to 150), therefore the results here are likely to be noisy, but it is apparent from our results that the CD task is the most suitable for approaches based on word/n-gram matching. Our POS technique performed well on overall and particularly well on theCDandMTtasks, buttheoverallperformance improvement relative to the other techniques is not as clear-cut. We believe this is due to difficulties arising from the asymmetrical nature of the data, and we explore this in the next section.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="22" type="sub_section">
      <SectionTitle>
5.3 Sentence length similarity
</SectionTitle>
      <Paragraph position="0"> In this experiment we investigate whether there is any advantage to be gained by using these techniques on corpora consisting of sentence pairs of similar length. Both the BLEU and NIST scores use some form of count of the total number of n-grams in the denominator of their n-gram precision formulae. When the sentences differ in length, the total number of n-grams is likely to be large in relation to the number of matching n-grams since this is bounded by the number of n-grams in the shorter sentence. This may result in an increase in the 'noise' in the score due to variations in sentence length similarity, degrading its effectiveness. To address the more general issue of whether sentence length similarity has an impact on the effectiveness of these techniques we  sorted the sentences pairs of the MSRP corpus according to the length difference ratio (LDR) defined in Section 3, and partitioned the sorted corpusintotwo: lowandhighLDR.Wethenselected as many sentences as possible from the corpus such that the training and test sets for each data set (high and low LDR) contained the same numberpositiveandnegativeexamples. Thisgavetwo sets (high and low LDR) of 1008 training examples and 438 test examples, all training and test data consisiting of 50% positive and 50% negative examples. The results are shown in Table 5. Theexperimentalresultsvalidateourconcerns. In all of the cases the performance was higher on the data with low LDR. Moreover, the effect was mostfortheBLEUandNISTscoresforwhichwe have an explanation of the cause.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML