XML Viewer - j03-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/j03-1002_evalu.xml
Size: 14,118 bytes
Last Modified: 2025-10-06 13:58:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-1002">
  <Title>c(c) 2003 Association for Computational Linguistics A Systematic Comparison of Various Statistical Alignment Models</Title>
  <Section position="9" start_page="32" end_page="45" type="evalu">
    <SectionTitle>
6. Experiments
</SectionTitle>
    <Paragraph position="0"> We present in this section results of experiments involving the Verbmobil and Hansards tasks. The Verbmobil task (Wahlster 2000) is a (German-English) speech translation task  in the domain of appointment scheduling, travel planning, and hotel reservation. The bilingual sentences used in training are correct transcriptions of spoken dialogues. However, they include spontaneous speech effects such as hesitations, false starts, and ungrammatical phrases. The French-English Hansards task consists of the debates in the Canadian parliament. This task has a very large vocabulary of about 100,000 French words and 80,000 English words.</Paragraph>
    <Paragraph position="1">  Statistics for the two corpora are shown in Tables 2 and 3. The number of running words and the vocabularies are based on full-form words and the punctuation marks. We produced smaller training corpora by randomly choosing 500, 2,000 and 8,000 sentences from the Verbmobil task and 500, 8,000, and 128,000 sentences from the Hansards task.</Paragraph>
    <Paragraph position="2"> For both tasks, we manually aligned a randomly chosen subset of the training corpus. From this subset of the corpus, the first 100 sentences are used as the development corpus to optimize the model parameters that are not trained via the EM 4 We do not use the Blinker annotated corpus described in Melamed (1998), since the domain is very special (the Bible) and a different annotation methodology is used.</Paragraph>
    <Paragraph position="3">  16.4 11.7 8.0 5.7 algorithm (e.g., the smoothing parameters). The remaining sentences are used as the test corpus.</Paragraph>
    <Paragraph position="4"> The sequence of models used and the number of training iterations used for each model is referred to in the following as the training scheme. Our standard training scheme on Verbmobil is 1  . This notation indicates that five iterations of Model 1, five iterations of HMM, three iterations of Model 3, three iterations of Model 4, and three iterations of Model 6 are performed. On Hansards, we use  . This training scheme typically gives very good results and does not lead to overfitting. We use the slightly modified versions of Model 3 and Model 4 described in Section 3.2 and smooth the fertility and the alignment parameters. In the E-step of the EM algorithm for the fertility-based alignment models, we use the Viterbi alignment and its neighborhood. Unless stated otherwise, no bilingual dictionary is used in training.</Paragraph>
    <Section position="1" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
6.1 Models and Training Schemes
</SectionTitle>
      <Paragraph position="0"> Tables 4 and 5 compare the alignment quality achieved using various models and training schemes. In general, we observe that the refined models (Models 4, 5, and 6) yield significantly better results than the simple Model 1 or Dice coefficient. Typically, the best results are obtained with Model 6. This holds across a wide range of sizes for the training corpus, from an extremely small training corpus of only 500 sentences up to a training corpus of 1.5 million sentences. The improvement that results from using a larger training corpus is more significant, however, if more refined models are used. Interestingly, even on a tiny corpus of only 500 sentences, alignment error rates under 30% are achieved for all models, and the best models have error rates somewhat under 20%.</Paragraph>
      <Paragraph position="1"> We observe that the alignment quality obtained with a specific model heavily depends on the training scheme that is used to bootstrap the model.</Paragraph>
      <Paragraph position="2">  25.9 20.3 12.5 8.7 Figure 3 Comparison of alignment error rate (in percent) for Model 1 and Dice coefficient (left: 34K Verbmobil task, right: 128K Hansards task).</Paragraph>
    </Section>
    <Section position="2" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
6.2 Heuristic Models versus Model 1
</SectionTitle>
      <Paragraph position="0"> We pointed out in Section 2 that from a theoretical viewpoint, the main advantage of statistical alignment models in comparison to heuristic models is the well-founded mathematical theory that underlies their parameter estimation. Tables 4 and 5 show that the statistical alignment models significantly outperform the heuristic Dice coefficient and the heuristic Dice coefficient with competitive linking (Dice+C). Even the simple Model 1 achieves better results than the two Dice coefficient models.</Paragraph>
      <Paragraph position="1"> It is instructive to analyze the alignment quality obtained in the EM training of Model 1. Figure 3 shows the alignment quality over the iteration numbers of Model 1.</Paragraph>
      <Paragraph position="2"> We see that the first iteration of Model 1 achieves significantly worse results than the Dice coefficient, but by only the second iteration, Model 1 gives better results than the Dice coefficient.</Paragraph>
      <Paragraph position="3">  An important result of these experiments is that the hidden Markov alignment model achieves significantly better results than Model 2. We attribute this to the fact that the HMM is a homogeneous first-order alignment model, and such models are able to better represent the locality and monotonicity properties of natural languages. Both models have the important property of allowing an efficient implementation of the EM algorithm (Section 3). On the largest Verbmobil task, the HMM achieves an improvement of 3.8% over Model 2. On the largest Hansards task, the improvement is 8.7%. Interestingly, this advantage continues to hold after bootstrapping more refined models. On Model 4, the improvement is 1.4% and 4.8%, respectively.</Paragraph>
      <Paragraph position="4"> We conclude that it is important to bootstrap the refined alignment models with good initial parameters. Obviously, if we use Model 2 for bootstrapping, we eventually obtain a poor local optimum.</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
6.4 The Number of Alignments in Training
</SectionTitle>
      <Paragraph position="0"> In Tables 6 and 7, we compare the results obtained by using different numbers of alignments in the training of the fertility-based alignment models. We compare the three different approaches described in Section 3: using only the Viterbi alignment, using in addition the neighborhood of the Viterbi alignment, and using the pegged alignments. To reduce the training time, we restrict the number of pegged alignments by using only those in which Pr(f, a  |e) is not much smaller than the probability of the Viterbi alignment. This reduces the training time drastically. For the large Hansards  corpus, however, there still is an unacceptably large training time. Therefore, we report the results for only up to 128,000 training sentences.</Paragraph>
      <Paragraph position="1"> The effect of pegging strongly depends on the quality of the starting point used for training the fertility-based alignment models. If we use Model 2 as the starting point, we observe a significant improvement when we use the neighborhood alignments and the pegged alignments. If we use only the Viterbi alignment, the results are significantly worse than using additionally the neighborhood of the Viterbi alignment. If we use HMM as the starting point, we observe a much smaller effect. We conclude that using more alignments in training is a way to avoid a poor local optimum.</Paragraph>
      <Paragraph position="2"> Table 8 shows the computing time for performing one iteration of the EM algorithm. Using a larger set of alignments increases the training time for Model 4 and Model 5 significantly. Since using the pegging alignments yields only a moderate improvement in performance, all following results are obtained by using the neighborhood of the Viterbi alignment without pegging.</Paragraph>
    </Section>
    <Section position="4" start_page="39" end_page="41" type="sub_section">
      <SectionTitle>
6.5 Effect of Smoothing
</SectionTitle>
      <Paragraph position="0"> Tables 9 and 10 show the effect on the alignment error rate of smoothing the alignment and fertility probabilities. We observe a significant improvement when we smooth the alignment probabilities and a minor improvement when we smooth the fertility probabilities. An analysis of the alignments shows that smoothing the fertility probabilities significantly reduces the frequently occurring problem of rare words forming &amp;quot;garbage collectors&amp;quot; in that they tend to align with too many words in the other language (Brown, Della Pietra, Della Pietra, Goldsmith, et al. 1993).</Paragraph>
      <Paragraph position="1"> Without smoothing, we observe early overfitting: The alignment error rate increases after the second iteration of HMM, as shown in Figure 4. On the Verbmobil task, the best alignment error rate is obtained in the second iteration. On the Hansards task, the best alignment error rate is obtained in the sixth iteration. In iterations subsequent to the second on the Verbmobil task and the sixth on the Hansards task, the alignment error rate increases significantly. With smoothing of the alignment param- null Computational Linguistics Volume 29, Number 1 Figure 4 Overfitting on the training data with the hidden Markov alignment model using various smoothing parameters (top: 34K Verbmobil task, bottom: 128K Hansards task).</Paragraph>
      <Paragraph position="2">  eters, we obtain a lower alignment error rate, overfitting occurs later in the process, and its effect is smaller.</Paragraph>
    </Section>
    <Section position="5" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
6.6 Alignment Models Depending on Word Classes
</SectionTitle>
      <Paragraph position="0"> Tables 11 and 12 show the effects of including a dependence on word classes in the alignment model, as described in Section 2.3. The word classes are always trained on the same subset of the training corpus as is used for the training of the alignment models. We observe no significant improvement in performance as a result of including dependence on word classes when a small training corpus is used. A possible reason for this lack of improvement is that either the word classes themselves or the resulting large number of alignment parameters cannot be estimated reliably using a small training corpus. When a large training corpus is used, however, there is a clear improvement in performance on both the Verbmobil and the Hansards tasks.</Paragraph>
    </Section>
    <Section position="6" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
6.7 Using a Conventional Bilingual Dictionary
</SectionTitle>
      <Paragraph position="0"> Tables 13 and 14 show the effect of using a conventional bilingual dictionary in training on the Verbmobil and Hansards tasks, respectively. We compare the two methods for using the dictionary described in Section 3.4. We observe that the method with a fixed  = 16 gives the best results. The method with a varying u gives worse results, but this method has one fewer parameter to be optimized on held-out data. On small corpora, there is an improvement of up to 6.7% on the Verbmobil task and 3.2% on the Hansards task, but when a larger training corpus is used, the improvements are reduced to 1.1% and 0.4%, respectively. Interestingly, the amount of the overall improvement contributed by the use of a conventional dictionary is small compared to the improvement achieved through the use of better alignment models.</Paragraph>
    </Section>
    <Section position="7" start_page="42" end_page="45" type="sub_section">
      <SectionTitle>
6.8 Generalized Alignments
</SectionTitle>
      <Paragraph position="0"> In this section, we compare the results obtained using different translation directions and using the symmetrization methods described in Section 4. Tables 15 and 16 show precision, recall, and alignment error rate for the last iteration of Model 6 for both translation directions. In this experiment, we use the conventional dictionary as well.</Paragraph>
      <Paragraph position="1"> Particularly for the Verbmobil task, with the language pair German-English, we observe that for German as the source language the alignment error rate is much higher than for English as source language. A possible reason for this difference in the alignment error rates is that the baseline alignment representation as a vector a  The effect of merging alignments by forming the intersection, the union, or the refined combination of the Viterbi alignments in both translation directions is shown in Tables 17 and 18. Figure 5 shows the corresponding precision/recall graphs. By using the refined combination, we can increase precision and recall on the Hansards task. The lowest alignment error rate on the Hansards task is obtained by using the intersection  Och and Ney Comparison of Statistical Alignment Models method. By forming a union or intersection of the alignments, we can obtain very high recall or precision values on both the Hansards task and the Verbmobil task.</Paragraph>
    </Section>
    <Section position="8" start_page="45" end_page="45" type="sub_section">
      <SectionTitle>
6.9 Effect of Alignment Quality on Translation Quality
</SectionTitle>
      <Paragraph position="0"> Alignment models similar to those studied in this article have been used as a starting point for refined phrase-based statistical machine translation systems (Alshawi, Bangalore, and Douglas 1998; Och, Tillmann, and Ney 1999; Ney et al. 2000). In Och and Ney (2000), the overall result of the experimental evaluation has been that an improved alignment quality yields an improved subjective quality of the statistical machine translation system as well.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML