XML Viewer - n04-1033

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1033_metho.xml
Size: 21,485 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1033">
  <Title>Improvements in Phrase-Based Statistical Machine Translation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Refinements
</SectionTitle>
    <Paragraph position="0"> In this section, we will describe refinements of the phrase-based translation model. First, we will describe two heuristics: word penalty and phrase penalty. Second, we will describe a single-word based lexicon model.</Paragraph>
    <Paragraph position="1"> This will be used to smooth the phrase translation probabilities. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Simple Heuristics
</SectionTitle>
      <Paragraph position="0"> In addition to the baseline model, we use two simple heuristics, namely word penalty and phrase penalty:</Paragraph>
      <Paragraph position="2"> The word penalty feature is simply the target sentence length. In combination with the scaling factor this results in a constant cost per produced target language word. With this feature, we are able to adjust the sentence length. If we set a negative scaling factor, longer sentences are more penalized than shorter ones, and the system will favor shorter translations. Alternatively, by using a positive scaling factors, the system will favor longer translations.</Paragraph>
      <Paragraph position="3"> Similar to the word penalty, the phrase penalty feature results in a constant cost per produced phrase. The phrase penalty is used to adjust the average length of the phrases. A negative weight, meaning real costs per phrase, results in a preference for longer phrases. A positive weight, meaning a bonus per phrase, results in a preference for shorter phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Word-based Lexicon
</SectionTitle>
      <Paragraph position="0"> We are using relative frequencies to estimate the phrase translation probabilities. Most of the longer phrases are seen only once in the training corpus. Therefore, pure relative frequencies overestimate the probability of those phrases. To overcome this problem, we use a word-based lexicon model to smooth the phrase translation probabilities. For a source word f and a target phrase ~e = ei2i1, we use the following approximation:</Paragraph>
      <Paragraph position="2"> This models a disjunctive interaction, also called noisy-OR gate (Pearl, 1988). The idea is that there are multiple independent causes ei2i1 that can generate an event f. It can be easily integrated into the search algorithm. The corresponding feature function is:</Paragraph>
      <Paragraph position="4"> Here, jk and ik denote the final position of phrase number k in the source and the target sentence, respectively, and we define j0 := 0 and i0 := 0.</Paragraph>
      <Paragraph position="5"> To estimate the single-word based translation probabilities p(fje), we use smoothed relative frequencies. The smoothing method we apply is absolute discounting with interpolation:</Paragraph>
      <Paragraph position="7"> This method is well known from language modeling (Ney et al., 1997). Here, d is the nonnegative discounting parameter, fi(e) is a normalization constant and fl is the normalized backing-off distribution. To compute the counts, we use the same word alignment matrix as for the extraction of the bilingual phrases. The symbol N(e) denotes the unigram count of a word e and N(f;e) denotes the count of the event that the target language word e is aligned to the source language word f. If one occurrence of e has N &gt; 1 aligned source words, each of them contributes with a count of 1=N. The formula for fi(e) is:</Paragraph>
      <Paragraph position="9"> This formula is a generalization of the one typically used in publications on language modeling. This generalization is necessary, because the lexicon counts may be fractional whereas in language modeling typically integer counts are used. Additionally, we want to allow discounting values d greater than one. One effect of the discounting parameter d is that all lexicon entries with a count less than d are discarded and the freed probability mass is redistributed among the other entries.</Paragraph>
      <Paragraph position="10"> As backing-off distribution fl(f), we consider two alternatives. The first one is a uniform distribution and the second one is the unigram distribution:</Paragraph>
      <Paragraph position="12"> Here, Vf denotes the vocabulary size of the source language and N(f) denotes the unigram count of a source word f.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Monotone Search
</SectionTitle>
    <Paragraph position="0"> The monotone search can be efficiently computed with dynamic programming. The resulting complexity is linear in the sentence length. We present the formulae for a bigram language model. This is only for notational convenience. The generalization to a higher order language model is straightforward. For the maximization problem in (11), we define the quantity Q(j;e) as the maximum probability of a phrase sequence that ends with the lan- null guage word e and covers positions 1 to j of the source sentence. Q(J + 1;$) is the probability of the optimum translation. The $ symbol is the sentence boundary marker. We obtain the following dynamic programming recursion.</Paragraph>
    <Paragraph position="2"> Here, M denotes the maximum phrase length in the source language. During the search, we store back-pointers to the maximizing arguments. After performing the search, we can generate the optimum translation. The resulting algorithm has a worst-case complexity of O(J C/M C/Ve C/E). Here, Ve denotes the vocabulary size of the target language and E denotes the maximum number of phrase translation candidates for a source language phrase. Using efficient data structures and taking into account that not all possible target language phrases can occur in translating a specific source language sentence, we can perform a very efficient search.</Paragraph>
    <Paragraph position="3"> This monotone algorithm is especially useful for language pairs that have a similar word order, e.g. Spanish-English or French-English.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Corpus Statistics
</SectionTitle>
    <Paragraph position="0"> In the following sections, we will present results on three tasks: Verbmobil, Xerox and Canadian Hansards. Therefore, we will show the corpus statistics for each of these tasks in this section. The training corpus (Train) of each task is used to train a word alignment and then extract the bilingual phrases and the word-based lexicon. The remaining free parameters, e.g. the model scaling factors, are optimized on the development corpus (Dev). The resulting system is then evaluated on the test corpus (Test).</Paragraph>
    <Paragraph position="1"> Verbmobil Task. The first task we will present results on is the German-English Verbmobil task (Wahlster, 2000). The domain of this corpus is appointment scheduling, travel planning, and hotel reservation. It consists of transcriptions of spontaneous speech. Table 1 shows the corpus statistics of this task.</Paragraph>
    <Paragraph position="2"> Xerox task. Additionally, we carried out experiments on the Spanish-English Xerox task. The corpus consists of technical manuals. This is a rather limited domain task.</Paragraph>
    <Paragraph position="3"> Table 2 shows the training, development and test corpus statistics.</Paragraph>
    <Paragraph position="4"> Canadian Hansards task. Further experiments were carried out on the French-English Canadian Hansards  task. This task contains the proceedings of the Canadian parliament. About 3 million parallel sentences of this bilingual data have been made available by the Linguistic Data Consortium (LDC). Here, we use a subset of the data containing only sentences with a maximum length of 30 words. This task covers a large variety of topics, so this is an open-domain corpus. This is also reflected by the large vocabulary size. Table 3 shows the training and test corpus statistics.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Degree of Monotonicity
</SectionTitle>
    <Paragraph position="0"> In this section, we will investigate the effect of the monotonicity constraint. Therefore, we compute how many of the training corpus sentence pairs can be produced with the monotone phrase-based search. We compare this to the number of sentence pairs that can be produced with a nonmonotone phrase-based search. To make these numbers more realistic, we use leaving-one-out. Thus phrases that are extracted from a specific sentence pair are not used to check its monotonicity. With leaving-one-out it is possible that even the nonmonotone search cannot generate a sentence pair. This happens if a sentence pair contains a word that occurs only once in the training corpus. All phrases that might produce this singleton are excluded because of the leaving-one-out principle. Note  that all these monotonicity consideration are done at the phrase level. Within the phrases arbitrary reorderings are allowed. The only restriction is that they occur in the training corpus.</Paragraph>
    <Paragraph position="1"> Table 4 shows the percentage of the training corpus that can be generated with monotone and nonmonotone phrase-based search. The number of sentence pairs that can be produced with the nonmonotone search gives an estimate of the upper bound for the sentence error rate of the phrase-based system that is trained on the given data. The same considerations hold for the monotone search.</Paragraph>
    <Paragraph position="2"> The maximum source phrase length for the Verbmobil task and the Xerox task is 12, whereas for the Canadian Hansards task we use a maximum of 4, because of the large corpus size. This explains the rather low coverage on the Canadian Hansards task for both the nonmonotone and the monotone search.</Paragraph>
    <Paragraph position="3"> For the Xerox task, the nonmonotone search can produce 75:1% of the sentence pairs whereas the monotone can produce 65:3%. The ratio of the two numbers measures how much the system deteriorates by using the monotone search and will be called the degree of monotonicity. For the Xerox task, the degree of monotonicity is 87:0%. This means the monotone search can produce 87:0% of the sentence pairs that can be produced with the nonmonotone search. We see that for the Spanish-English Xerox task and for the French-English Canadian Hansards task, the degree of monotonicity is rather high.</Paragraph>
    <Paragraph position="4"> For the German-English Verbmobil task it is significantly lower. This may be caused by the rather free word order in German and the long range reorderings that are necessary to translate the verb group.</Paragraph>
    <Paragraph position="5"> It should be pointed out that in practice the monotone search will perform better than what the preceding estimates indicate. The reason is that we assumed a perfect nonmonotone search, which is difficult to achieve in practice. This is not only a hard search problem, but also a complicated modeling problem. We will see in the next section that the monotone search will perform very well on both the Xerox task and the Canadian Hansards task.</Paragraph>
    <Paragraph position="6">  deg. of mon. 72.6 87.0 86.3</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Translation Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> So far, in machine translation research a single generally accepted criterion for the evaluation of the experimental results does not exist. Therefore, we use a variety of different criteria.</Paragraph>
      <Paragraph position="1"> + WER (word error rate): The WER is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the reference sentence.</Paragraph>
      <Paragraph position="2"> + PER (position-independent word error rate): A shortcoming of the WER is that it requires a perfect word order. The word order of an acceptable sentence can be different from that of the target sentence, so that the WER measure alone could be misleading. The PER compares the words in the two sentences ignoring the word order.</Paragraph>
      <Paragraph position="3"> + BLEU score: This score measures the precision of unigrams, bigrams, trigrams and fourgrams with respect to a reference translation with a penalty for too short sentences (Papineni et al., 2001). BLEU measures accuracy, i.e. large BLEU scores are better.</Paragraph>
      <Paragraph position="4"> + NIST score: This score is similar to BLEU. It is a weighted n-gram precision in combination with a penalty for too short sentences (Doddington, 2002). NIST measures accuracy, i.e. large NIST scores are better. For the Verbmobil task, we have multiple references available. Therefore on this task, we compute all the preceding criteria with respect to multiple references. To indicate this, we will precede the acronyms with an m (multiple) if multiple references are used. For the other two tasks, only single references are used.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Translation Systems
</SectionTitle>
      <Paragraph position="0"> In this section, we will describe the systems that were used. On the one hand, we have three different variants of the single-word based model IBM4. On the other hand, we have two phrase-based systems, namely the alignment templates and the one described in this work.</Paragraph>
      <Paragraph position="1"> Single-Word Based Systems (SWB). First, there is a monotone search variant (Mon) that translates each word of the source sentence from left to right. The second variant allows reordering according to the so-called IBM constraints (Berger et al., 1996). Thus up to three words may be skipped and translated later. This system will be denoted by IBM. The third variant implements special German-English reordering constraints. These constraints are represented by a finite state automaton and optimized to handle the reorderings of the German verb group. The abbreviation for this variant is GE. It is only used for the German-English Verbmobil task. This is just an extremely brief description of these systems. For details, see (Tillmann and Ney, 2003).</Paragraph>
      <Paragraph position="2"> Phrase-Based System (PB). For the phrase-based system, we use the following feature functions: a trigram language model, the phrase translation model and the word-based lexicon model. The latter two feature functions are used for both directions: p(fje) and p(ejf).</Paragraph>
      <Paragraph position="3"> Additionally, we use the word and phrase penalty feature functions. The model scaling factors are optimized on the development corpus with respect to mWER similar to (Och, 2003). We use the Downhill Simplex algorithm from (Press et al., 2002). We do not perform the optimization on N-best lists but we retranslate the whole development corpus for each iteration of the optimization algorithm. This is feasible because this system is extremely fast. It takes only a few seconds to translate the whole development corpus for the Verbmobil task and the Xerox task; for details see Section 8. In the experiments, the Downhill Simplex algorithm converged after about 200 iterations. This method has the advantage that it is not limited to the model scaling factors as the method described in (Och, 2003). It is also possible to optimize any other parameter, e.g. the discounting parameter for the lexicon smoothing.</Paragraph>
      <Paragraph position="4"> Alignment Template System (AT). The alignment template system (Och et al., 1999) is similar to the system described in this work. One difference is that the alignment templates are not defined at the word level but at a word class level. In addition to the word-based tri-gram model, the alignment template system uses a class-based fivegram language model. The search algorithm of the alignment templates allows arbitrary reorderings of the templates. It penalizes reorderings with costs that are linear in the jump width. To make the results as comparable as possible, the alignment template system and the phrase-based system start from the same word alignment.</Paragraph>
      <Paragraph position="5"> The alignment template system uses discriminative training of the model scaling factors as described in (Och and Ney, 2002).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.3 Verbmobil Task
</SectionTitle>
      <Paragraph position="0"> We start with the Verbmobil results. We studied smoothing the lexicon probabilities as described in Section 3.2.</Paragraph>
      <Paragraph position="1"> The results are summarized in Table 5. We see that the  uniform smoothing method improves translation quality. There is only a minor improvement, but it is consistent among all evaluation criteria. It is statistically significant at the 94% level. The unigram method hurts performance. There is a degradation of the mWER of 0:9%. In the following, all phrase-based systems use the uniform smoothing method.</Paragraph>
      <Paragraph position="2"> The translation results of the different systems are shown in Table 6. Obviously, the monotone phrase-based system outperforms the monotone single-word based system. The result of the phrase-based system is comparable to the nonmonotone single-word based search with the IBM constraints. With respect to the mPER, the PB system clearly outperforms all single-word based systems. If we compare the monotone phrase-based system with the nonmonotone alignment template system, we see that the mPERs are similar. Thus the lexical choice of words is of the same quality. Regarding the other evaluation criteria, which take the word order into account, the non-monotone search of the alignment templates has a clear advantage. This was already indicated by the low degree of monotonicity on this task. The rather free word order in German and the long range dependencies of the verb group make reorderings necessary.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.4 Xerox task
</SectionTitle>
      <Paragraph position="0"> The translation results for the Xerox task are shown in Table 7. Here, we see that both phrase-based systems clearly outperform the single-word based systems. The PB system performs best on this task. Compared to the AT system, the BLEU score improves by 4.1% absolute.</Paragraph>
      <Paragraph position="1"> The improvement of the PB system with respect to the AT system is statistically significant at the 99% level.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.5 Canadian Hansards task
</SectionTitle>
      <Paragraph position="0"> The translation results for the Canadian Hansards task are shown in Table 8. As on the Xerox task, the phrase-based systems perform better than the single-word based systems. The monotone phrase-based system yields even better results than the alignment template system. This improvement is consistent among all evaluation criteria and it is statistically significant at the 99% level.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Efficiency
</SectionTitle>
    <Paragraph position="0"> In this section, we analyze the translation speed of the phrase-based translation system. All experiments were carried out on an AMD Athlon with 2.2GHz. Note that the systems were not optimized for speed. We used the best performing systems to measure the translation times.</Paragraph>
    <Paragraph position="1"> The translation speed of the monotone phrase-based system for all three tasks is shown in Table 9. For the Xerox task, the translation process takes less than 7 seconds for the whole 10K words test set. For the Verbmobil task, the system is even slightly faster. It takes about 1.6 seconds to translate the whole test set. For the Canadian Hansards task, the translation process is much slower, but the average time per sentence is still less than 1 second.</Paragraph>
    <Paragraph position="2"> We think that this slowdown can be attributed to the large training corpus. The system loads only phrase pairs into memory if the source phrase occurs in the test corpus.</Paragraph>
    <Paragraph position="3"> Therefore, the large test corpus size for this task also affects the translation speed.</Paragraph>
    <Paragraph position="4"> In Fig. 1, we see the average translation time per sentence as a function of the sentence length. The translation times were measured for the translation of the 5432 test sentences of the Canadian Hansards task. We see a clear linear dependency. Even for sentences of thirty words, the translation takes only about 1.5 seconds.</Paragraph>
    <Paragraph position="5">  tion of the sentence length for the Canadian Hansards task (5432 test sentences).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML