File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1047_evalu.xml

Size: 7,010 bytes

Last Modified: 2025-10-06 14:00:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1047">
  <Title>Decoding Algorithm in Statistical Machine Translation</Title>
  <Section position="7" start_page="369" end_page="371" type="evalu">
    <SectionTitle>
5 Performance
</SectionTitle>
    <Paragraph position="0"> We tested the performance of the decoders with the scheduling corpus(Suhm et al., 1995). Around 30,000 parallel sentences (400,000 words altogether for both languages) were used to train the IBM model 2 and the simplified model with the EM algorithm. A larger English monolingual corpus with around 0.5 million words was used to train a bi-gram for language modelling. The lexicon contains 2,800 English and 4,800 German words in morphologically inflected form. We did not do any preprocessing/analysis of the data as reported in (Brown et al., 1992).</Paragraph>
    <Section position="1" start_page="369" end_page="369" type="sub_section">
      <SectionTitle>
5.1 Decoder Success Rate
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the success rate of three models/decoders. As we mentioned before, the comparison between hypotheses of different sentence length made the single stack search for the IBM model 2 fail (return without a result) on a majority of the test sentences. While the multi-stack decoder improved this, the simplified model/decoder produced an output for all the 120 test sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="369" end_page="370" type="sub_section">
      <SectionTitle>
5.2 Translation Accuracy
</SectionTitle>
      <Paragraph position="0"> Unlike the case in speech recognition, it is quite arguable what &amp;quot;accurate translations&amp;quot; means. In speech recognition an output can be compared with the sample transcript of the test data. In machine translation, a sentence may have several legitimate translations. It is difficult to compare an output from a decoder with a designated translation. Instead, we used human subjects to judge the machine-made translations. The translations are classified into three categories 1.</Paragraph>
      <Paragraph position="1">  1. Correct translations: translations that are grammatical and convey the same meaning as the inputs.</Paragraph>
      <Paragraph position="2"> 2. Okay translations: translations that convey the same meaning but with small grammatical mistakes or translations that convey most but not the entire meaning of the input.</Paragraph>
      <Paragraph position="3"> 3. Incorrect translations: Translations that are  ungrammatical or convey little meaningful information or the information is different from the input.</Paragraph>
      <Paragraph position="4"> Examples of correct, okay, and incorrect translations are shown in Table 2.</Paragraph>
      <Paragraph position="5"> Table 3 shows the statistics of the translation results. The accuracy was calculate by crediting a correct translation 1 point and an okay translation 1/2 point.</Paragraph>
      <Paragraph position="6"> There are two different kinds of errors in statistical machine translation. A modeling erivr occurs when the model assigns a higher score to an incorrect translation than a correct one. We cannot do anything about this with the decoder. A decoding 1 This is roughly the same as the classification in IBM statistical translation, except we do not have &amp;quot;legitimate translation that conveys different meaning from the input&amp;quot; -- we did not observed this case in our outputs.  ich habe ein Meeting yon halb zehn bis um zwSlf I have a meeting from nine thirty to twelve I have a meeting from nine thirty to twelve versuchen wir sollten es vielleicht mit einem anderen Termin we might want to try for some other time we should try another time ich glaube nicht diis ich noch irgend etwas im Januar frei habe I do not think I have got anything open m January I think I will not free in January ich glaube wit sollten em weiteres Meeting vereinbaren I think we have to have another meeting I think we should fix a meeting schlagen Sie doch einen Termin vor why don't you suggest a time why you an appointment ich habe Zeit fiir den Rest des Tages I am free the rest of it I have time for the rest of July  input German sentence, the second line is the human made (target) translation for that input sentence, and the third line is the output from the decoder. error or search error happens when the search algorithm misses a correct translation with a higher score.</Paragraph>
      <Paragraph position="7"> When evaluating a decoding algorithm, it would be attractive if we can tell how many errors are caused by the decoder. Unfortunately, this is not attainable. Suppose that we are going to translate a German sentence g, and we know from the sample that e is one of its possible English translations. The decoder outputs an incorrect e ~ as the translation of g. If the score of e' is lower than that of e, we know that a search error has occurred. On the other hand, if the score of e' is higher, we cannot decide if it is a modeling error or not, since there may still be other legitimate translations with a score higher than e ~ -- we just do not know what they are.</Paragraph>
      <Paragraph position="8"> Although we cannot distinguish a modeling error from a search error, the comparison between the decoder output's score and that of a sample translation can still reveal some information about the performance of the decoder. If we know that the decoder can find a sentence with a better score than a &amp;quot;correct&amp;quot; translation, we will be more confident that the decoder is less prone to cause errors. Table 4 shows the comparison between the score of the outputs from the decoder and the score of the sample translations when the outputs are incorrect. In most cases, the incorrect outputs have a higher score than the sample translations. Again, we consider a &amp;quot;okay&amp;quot; translation a half error here.</Paragraph>
      <Paragraph position="9"> This result hints that model deficiencies may be a major source of errors. The models we used here are very simple. With a more sophisticated model, more training data, and possibly some preprocessing, the total error rate is expected to decrease.</Paragraph>
    </Section>
    <Section position="3" start_page="370" end_page="371" type="sub_section">
      <SectionTitle>
5.3 Decoding Speed
</SectionTitle>
      <Paragraph position="0"> Another important issue is the efficiency of the decoder. Figure 3 plots the average number of states being extended by the decoders. It is grouped according to the input sentence length, and evaluated on those sentences on which the decoder succeeded.</Paragraph>
      <Paragraph position="1"> The average number of states being extended in the model 2 single stack search is not available for long sentences, since the decoder failed on most of the long sentences.</Paragraph>
      <Paragraph position="2"> The figure shows that the simplified model/decoder works much more efficiently than the other mod- null Length els/decoders.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML