File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1038_evalu.xml
Size: 4,931 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1038"> <Title>Discriminative Training and Maximum Entropy Models for Statistical Machine Translation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We present results on the VERBMOBIL task, which is a speech translation task in the domain of appointment scheduling, travel planning, and hotel reservation (Wahlster, 1993). Table 1 shows the corpus statistics of this task. We use a training corpus, which is used to train the alignment template model and the language models, a development corpus, which is used to estimate the model scaling factors, and a test corpus.</Paragraph> <Paragraph position="1"> exist one generally accepted criterion for the evaluation of the experimental results. Therefore, we use a large variety of different criteria and show that the obtained results improve on most or all of these criteria. In all experiments, we use the following six error criteria: + SER (sentence error rate): The SER is computed as the number of times that the generated sentence corresponds exactly to one of the reference translations used for the maximum entropy training.</Paragraph> <Paragraph position="2"> + WER (word error rate): The WER is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the target sentence.</Paragraph> <Paragraph position="3"> + PER (position-independent WER): A shortcoming of the WER is the fact that it requires a perfect word order. The word order of an acceptable sentence can be different from that of the target sentence, so that the WER measure alone could be misleading. To overcome this problem, we introduce as additional measure the position-independent word error rate (PER). This measure compares the words in the two sentences ignoring the word order.</Paragraph> <Paragraph position="4"> + mWER (multi-reference word error rate): For each test sentence, there is not only used a single reference translation, as for the WER, but a whole set of reference translations. For each translation hypothesis, the edit distance to the most similar sentence is calculated (Niessen et al., 2000).</Paragraph> <Paragraph position="5"> + BLEU score: This score measures the precision of unigrams, bigrams, trigrams and fourgrams with respect to a whole set of reference translations with a penalty for too short sentences (Papineni et al., 2001). Unlike all other evaluation criteria used here, BLEU measures accuracy, i.e. the opposite of error rate. Hence, large BLEU scores are better.</Paragraph> <Paragraph position="6"> + SSER (subjective sentence error rate): For a more detailed analysis, subjective judgments by test persons are necessary. Each translated sentence was judged by a human examiner according to an error scale from 0.0 to 1.0 (Niessen et al., 2000).</Paragraph> <Paragraph position="7"> + IER (information item error rate): The test sentences are segmented into information items.</Paragraph> <Paragraph position="8"> For each of them, if the intended information is conveyed and there are no syntactic errors, the sentence is counted as correct (Niessen et al., 2000).</Paragraph> <Paragraph position="9"> In the following, we present the results of this approach. Table 2 shows the results if we use a direct translation model (Eq. 6).</Paragraph> <Paragraph position="10"> As baseline features, we use a normal word tri-gram language model and the three component models of the alignment templates. The first row shows the results using only the four baseline features with</Paragraph> <Paragraph position="12"> result if we train the model scaling factors. We see a systematic improvement on all error rates. The following three rows show the results if we add the word penalty, an additional class-based five-gram CLM: class-based language model (five-gram), MX: conventional dictionary). objective criteria [%] subjective criteria [%] GIS algorithm for maximum entropy training of alignment templates.</Paragraph> <Paragraph position="13"> language model and the conventional dictionary features. We observe improved error rates for using the word penalty and the class-based language model as additional features.</Paragraph> <Paragraph position="14"> Figure 3 show how the sentence error rate (SER) on the test corpus improves during the iterations of the GIS algorithm. We see that the sentence error rates converges after about 4000 iterations. We do not observe significant overfitting.</Paragraph> <Paragraph position="15"> Table 3 shows the resulting normalized model scaling factors. Multiplying each model scaling factor by a constant positive value does not affect the decision rule. We see that adding new features also has an effect on the other model scaling factors.</Paragraph> </Section> class="xml-element"></Paper>