XML Viewer - j03-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/j03-1005_metho.xml
Size: 23,414 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-1005">
  <Title>c(c) 2003 Association for Computational Linguistics Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation</Title>
  <Section position="4" start_page="112" end_page="118" type="metho">
    <SectionTitle>
3 In Berger et al. (1996), a morphological analysis is carried out and word morphemes are processed
</SectionTitle>
    <Paragraph position="0"> during the search. Here, we process only full-form words.</Paragraph>
    <Paragraph position="1">  Illustration of the IBM-style reordering constraint. introduced in the Appendix. An upper bound of O(E  ) for the word reordering complexity is given in Tillmann (2001).</Paragraph>
    <Section position="1" start_page="113" end_page="114" type="sub_section">
      <SectionTitle>
3.7 Empirical Complexity Calculations
</SectionTitle>
      <Paragraph position="0"> In order to demonstrate the complexity of the proposed reordering constraints, we have modified our translation algorithm to show, for the different reordering constraints, the overall number of successor states generated by the algorithm given in  tion task in which a pseudo-source word x is translated into the identically pseudotarget word x. No actual optimization is carried out; the total number of successors is simply counted as the algorithm proceeds through subsets of increasing cardinality. The complexity differences for the different reordering constraints result from the different number of coverage subsets C and corresponding reordering states S allowed. For the different reordering constraints we obtain the following results (the abbrevia- null tions MON, GE, EG, and S3 are taken from the Appendix): * MON: For this reordering restriction, a partial hypothesis is always extended by the position l min (C), hence the number of processed arcs is J.</Paragraph>
      <Paragraph position="1"> * GE, EG: These two reordering constraints are very similar in terms of complexity: The number of word reorderings is heavily restricted in each. Actually, since the distance restrictions (expressed by the variables widthskip and widthmove) apply, the complexity is linear in the length of the input sentence J.</Paragraph>
      <Paragraph position="2"> * S3: The S3 reordering constraint has a complexity close to J  . Since no distance restrictions for the skipped positions apply, the overall search space is significantly larger than for the GE or EG restriction.</Paragraph>
    </Section>
    <Section position="2" start_page="114" end_page="116" type="sub_section">
      <SectionTitle>
3.8 Beam Search Pruning Techniques
</SectionTitle>
      <Paragraph position="0"> To speed up the search, a beam search strategy is used. There is a direct analogy to the data-driven search organization used in continuous-speech recognition (Ney et al.</Paragraph>
      <Paragraph position="1"> 1992). The full DP search algorithm proceeds cardinality-synchronously over subsets of source sentence positions of increasing cardinality. Using the beam search concept, the search can be focused on the most likely hypotheses. The hypotheses Q  e prime(e,C, j) are distinguished according to the coverage set C, with two kinds of pruning based on this coverage set: 1. The coverage pruning is carried out separately for each coverage set C. 2. The cardinality pruning is carried out jointly for all coverage sets C with  the same cardinality c = c(C).</Paragraph>
      <Paragraph position="2"> After the pruning is carried out, we retain for further consideration only hypotheses with a probability close to the maximum probability. The number of surviving hypotheses is controlled by four kinds of thresholds:  Tillmann and Ney DP Beam Search for Statistical MT this adjustment, for a source word f at an uncovered source position, we precompute an upper bound -p(f) for the product of language model and lexicon probability:  ) that have actually been seen in the training data. Additionally, the observation pruning described below is applied to the possible translations e of a source word f . The upper bound is used in the beam search concept to increase the comparability between hypotheses covering different coverage sets. Even more benefit from the upper bound -p(f) can be expected if the distortion and the fertility probabilities are taken into account (Tillmann 2001). Using the definition of -p(f), the following modified probability</Paragraph>
      <Paragraph position="4"> For the translation experiments, equation (3) is recursively evaluated over subsets of source positions of equal cardinality. For reasons of brevity, we omit the state description S in equation (3), since no separate pruning according to the states S is carried out.</Paragraph>
      <Paragraph position="5"> The set of surviving hypotheses for each cardinality c is referred to as the beam. The size of the beam for cardinality c depends on the ambiguity of the translation task for that cardinality. To fully exploit the speedup of the DP beam search, the search space is dynamically constructed as described in Tillmann, Vogel, Ney, Zubiaga, and Sawaf (1997), rather than using a static search space. To carry out the pruning, the maximum probabilities with respect to each coverage set C and cardinality c are computed: * Coverage pruning: Hypotheses are distinguished according to the subset of covered positions C. The probability</Paragraph>
      <Paragraph position="7"> to prune active hypotheses. We call this pruning translation pruning. Hypotheses are pruned according to their translation probability:  Computational Linguistics Volume 29, Number 1 coverage and the cardinality threshold are constant for different coverage sets C and cardinalities c. Together with the translation pruning, histogram pruning is carried out: The overall number N(C) of active hypotheses for the coverage set C and the overall number N(c) of active hypotheses for all subsets of a given cardinality may not exceed a given number; again, different numbers are used for coverage and cardinality pruning. The coverage histogram pruning is denoted by n  If the numbers of active hypotheses for each coverage set C and cardinality c, N(C) and N(c), exceed the above thresholds, only the partial hypotheses with the highest translation probabilities are retained (e.g., we may use n C = 1,000 for the coverage histogram pruning).</Paragraph>
      <Paragraph position="8"> The third type of pruning conducted observation pruning: The number of words that may be produced by a source word f is limited. For each source language word f the list of its possible translations e is sorted according to  target words e are hypothesized during the search process (e.g., during the experiments to hypothesize, the best n o = 50 words was sufficient.</Paragraph>
    </Section>
    <Section position="3" start_page="116" end_page="118" type="sub_section">
      <SectionTitle>
3.9 Beam Search Implementation
</SectionTitle>
      <Paragraph position="0"> In this section, we describe the implementation of the beam search algorithm presented in the previous sections and show how it is applied to the full set of IBM-4 model parameters.</Paragraph>
      <Paragraph position="1"> 3.9.1 Baseline DP Implementation. The implementation described here is similar to that used in beam search speech recognition systems, as presented in Ney et al. (1992). The similarities are given mainly in the following: * The implementation is data driven. Both its time and memory requirements are strictly linear in the number of path hypotheses (disregarding the sorting steps explained in this section).</Paragraph>
      <Paragraph position="2"> * The search procedure is developed to work most efficiently when the input sentences are processed mainly monotonically from left to right.</Paragraph>
      <Paragraph position="3"> The algorithm works cardinality-synchronously, meaning that all the hypotheses that are processed cover subsets of source sentence positions of equal cardinality c.</Paragraph>
      <Paragraph position="4"> * Since full search is prohibitive, we use a beam search concept, as in speech recognition. We use appropriate pruning techniques in connection with our cardinality-synchronous search procedure.</Paragraph>
      <Paragraph position="5"> Table 4 shows a two-list implementation of the search algorithm given in Table 2 in which the beam pruning is included. The two lists are referred to as S and S  new : S is the list of hypotheses that are currently expanded, and S new is the list of newly  tence positions of increasing cardinality. The search starts with S = {($, $,{[?]},0)}, where $ denotes the sentence start symbol for the immediate two predecessor words and {[?]} denotes the empty coverage set, in which no source position is covered yet. For the initial search state, the position last covered is set to 0. A set S of active hypotheses is expanded for each cardinality c using lexicon model, language model, and distortion model probabilities. The newly generated hypotheses are added to the hypothesis set S new ; for hypotheses that are not distinguished according to our DP approach, only the best partial hypothesis is retained for further consideration. This so-called recombination is implemented as a set of simple lookup and update operations on the set S new of partial hypotheses. During the partial hypothesis extensions, an anticipated pruning is carried out: Hypotheses are discarded before they are considered for recombination and are never added to S new . (The anticipated pruning is not shown in Table 4. It is based on the pruning thresholds described in Section 3.8.) After the extension of all partial hypotheses in S, a pruning step is carried out for the hypotheses in the newly generated set S new . The pruning is based on two simple sorting steps on the list of partial hypotheses S new . (Instead of sorting the partial hypotheses, we might have used hashing.) First, the partial hypotheses are sorted according to their translation scores (within the implementation, all probabilities are converted into translation scores by taking the negative logarithm [?]log()). Cardinality pruning can then be carried out simply by running down the list of hypotheses, starting with the maximum-probability hypothesis, and applying the cardinality thresholds. Then, the partial hypotheses are sorted a second time according to their coverage set C and their translation score. After this sorting step, all partial hypotheses that cover the same subset of source sentence positions are located in consecutive fragments in the overall list of partial hypotheses. Coverage pruning is carried out in a single run over the list of partial hypotheses: For each fragment corresponding to the same coverage set C, the coverage pruning threshold is applied. The partial hypotheses that survive the two pruning stages are then written into the so-called bookkeeping array (Ney et al. 1992). For the next expansion step, the set S is set to the newly generated list of hypotheses. Finally, the target translation is constructed from the bookkeeping array.</Paragraph>
      <Paragraph position="6">  search approach can be carried out using the full set of IBM-4 parameters. (More details can be found in Tillmann [2001] or in the cited papers.) First, the full set of IBM-4 parameters does not make the simplifying assumption given in Section 3.1, namely, that source and target sentences are of equal length: Either a target word e may be aligned with several source words (its fertility is greater than one) or a single source word may produce zero, one, or two target words, as described in Berger et al. (1996), or both. Zero target words are generated if f is aligned to the &amp;quot;null&amp;quot; word  ) and the language model probability; no lexicon probability is used. During the experiments, we restrict ourselves to triples of target words (e, e</Paragraph>
      <Paragraph position="8"> ) actually seen in the training data. This approach is used for the French-to-English translation experiments presented in this article.</Paragraph>
      <Paragraph position="9"> Another approach for mapping a single source language word to several target language words involves preprocessing by the word-joining algorithm given in Tillmann (2001), which is similar to the approach presented in Och, Tillmann, and Ney (1999). Target words are joined during a training phase, and several joined target language words are dealt with as a new lexicon entry. This approach is used for the German-to-English translation experiments presented in this article.</Paragraph>
      <Paragraph position="10"> In order to deal with the IBM-4 fertility parameters within the DP-based concept, we adopt the distinction between open and closed hypotheses given in Berger et al. (1996). A hypothesis is said to be open if it is to be aligned with more source positions than it currently is (i.e., at least two). Otherwise it is called closed. The difference between open and closed is used to process the input sentence one position a time (for details see Tillmann 2001). The word reordering restrictions and the beam search pruning techniques are directly carried over to the full set of IBM-4 parameters, since they are based on restrictions on the coverage vectors C only.</Paragraph>
      <Paragraph position="11"> To ensure its correctness, the implementation was tested by carrying out forced alignments on 500 German-to-English training sentence pairs. In a forced alignment,  language model probability is divided out, and the resulting probability is compared to the Viterbi probability as obtained by the training procedure. For 499 training sentences the Viterbi alignment probability as obtained by the forced-alignment search was exactly the same as the one produced by the training procedure. In one case the forced-alignment search did obtain a better Viterbi probability than the training procedure.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="118" end_page="121" type="metho">
    <SectionTitle>
4. Experimental Results
</SectionTitle>
    <Paragraph position="0"> Translation experiments are carried out for the translation directions German to English and English to German (Verbmobil task) and for the translation directions French to English and English to French (Canadian Hansards task). Section 4.1 reports on the performance measures used. Section 4.2 shows translation results for the Verbmobil task. Sections 4.2.1 and 4.2.2 describe that task and the preprocessing steps applied.</Paragraph>
    <Paragraph position="1"> In Sections 4.2.3 through 4.2.5, the efficiency of the beam search pruning techniques is shown for German-to-English translation, as the most detailed experiments are conducted for that direction. Section 4.2.6 gives translation results for the translation direction English to German. In Section 4.3, translation results for the Canadian Hansards task are reported.</Paragraph>
    <Paragraph position="2">  Tillmann and Ney DP Beam Search for Statistical MT</Paragraph>
    <Section position="1" start_page="119" end_page="120" type="sub_section">
      <SectionTitle>
4.1 Performance Measures for Translation Experiments
</SectionTitle>
      <Paragraph position="0"> To measure the performance of the translation methods, we use three types of automatic and easy-to-use measures of the translation errors. Additionally, a subjective evaluation involving human judges is carried out (Niessen et al. 2000). The following evaluation criteria are employed: * WER (word error rate): The WER is computed as the minimum number of substitution, insertion, and deletion operations that have to be performed to convert the generated string into the reference target string.</Paragraph>
      <Paragraph position="1"> This performance criterion is widely used in speech recognition. The minimum is computed using a DP algorithm and is typically referred to as edit or Levenshtein distance.</Paragraph>
      <Paragraph position="2"> * mWER (multireference WER): We use the Levenshtein distance between the automatic translation and several reference translations as a measure of the translation errors. For example, on the Verbmobil TEST-331 test set, an average of six reference translations per automatic translation are available. The Levenshtein distance between the automatic translation and each of the reference translations is computed, and the minimum Levenshtein distance is taken. The resulting measure, the mWER, is more robust than the WER, which takes into account only a single reference translation.</Paragraph>
      <Paragraph position="3"> * PER (position-independent word error rate): In the case in which only a single reference translation per sentence is available, we introduce as an additional measure the position-independent word error rate (PER). This measure compares the words in the two sentences without taking the word order into account. Words in the reference translation that have no counterpart in the translated sentence are counted as substitution errors. Depending on whether the translated sentence is longer or shorter than the reference translation, the remaining words result in either insertion (if the translated sentence is longer) or deletion (if the translated sentence is shorter) errors. The PER is guaranteed to be less than or equal to the WER. The PER is more robust than the WER since it ignores translation errors due to different word order in the translated and reference sentences.</Paragraph>
      <Paragraph position="4"> * SSER (subjective sentence error rate): For a more fine-grained evaluation of the translation results and to check the validity of the automatic evaluation measures subjective judgments by test persons are carried out (Niessen et al. 2000). The following scale for the error count per sentence is used in these subjective evaluations:  Each translated sentence is judged by a human examiner according to the above error scale; several human judges may be involved in judging the same translated sentence. Subjective evaluation is carried out only for the Verbmobil TEST-147 test set.</Paragraph>
    </Section>
    <Section position="2" start_page="120" end_page="121" type="sub_section">
      <SectionTitle>
4.2 Verbmobil Translation Experiments
</SectionTitle>
      <Paragraph position="0"> (Wahlster 2000). In that task, the goal is the translation of spontaneous speech in face-to-face situations for an appointment scheduling domain. We carry out experiments for both translation directions: German to English and English to German. Although the Verbmobil task is still a limited-domain task, it is rather difficult in terms of vocabulary size, namely, about 5,000 words or more for each of the two languages; second, the syntactic structures of the sentences are rather unrestricted. Although the ultimate goal of the Verbmobil project is the translation of spoken language, the input used for the translation experiments reported on in this article is mainly the (more or less) correct orthographic transcription of the spoken sentences. Thus, the effects of spontaneous speech are present in the corpus; the effect of speech recognition errors, however, is not covered. The corpus consists of 58,073 training pairs; its characteristics are given in Table 5. For the translation experiments, a trigram language model with a perplexity of 28.1 is used. The following two test corpora are used for the translation experiments: TEST-331: This test set consists of 331 test sentences. Only automatic evaluation is carried out on this test corpus: The WER and the mWER are computed. For each test sentence in the source language there is a range of acceptable reference translations (six on average) provided by a human translator, who is asked to produce word-to-word translations wherever it is possible. Part of the reference sentences are obtained by correcting automatic translations of the test sentences that are produced using the approach presented in this article with different reordering constraints. The other part is produced from the source sentences without looking at any of their translations. The TEST-331 test set is used as held-out data for parameter optimization (for the language mode scaling factor and for the distortion model scaling factor). Furthermore, the beam search experiments in which the effect of the different pruning thresholds is demonstrated are carried out on the TEST-331 test set.</Paragraph>
      <Paragraph position="1"> TEST-147: The second, separate test set consists of 147 test sentences. Translation results are given in terms of mWER and SSER. No parameter optimization  Tillmann and Ney DP Beam Search for Statistical MT is carried out on the TEST-147 test set; the parameter values as obtained from the experiments on the TEST-331 test set are used.</Paragraph>
      <Paragraph position="2">  preprocessing steps are carried out: Categorization: We use some categorization, which consists of replacing a single word by a category. The only words that are replaced by a category label are proper nouns denoting German cities. Using the new labeled corpus, all probability models are trained anew. To produce translations in the &amp;quot;normal&amp;quot; language, the categories are translated by rule and are inserted into the target sentence.</Paragraph>
      <Paragraph position="3"> Word joining: Target language words are joined using a method similar to the one described in Och, Tillmann, and Ney (1999). Words are joined to handle cases like the German compound noun &amp;quot;Zahnarzttermin&amp;quot; for the English &amp;quot;dentist's appointment,&amp;quot; because a single word has to be mapped to two or more target words. The word joining is applied only to the target language words; the source language sentences remain unchanged. During the search process several joined target language words may be generated by a single source language word.</Paragraph>
      <Paragraph position="4"> Manual lexicon: To account for unseen words in the test sentences and to obtain a greater number of focused translation probabilities p(f  |e), we use a bilingual German-English dictionary. For each word e in the target vocabulary, we create a list of source translations f according to this dictionary. The translation probability p  is the number of source words listed as translations of the target word e. The dictionary probability p dic (f  |e) is linearly combined with the automatically trained translation probabilities p aut (f  |e) to obtain smoothed probabilities p(f  |e):</Paragraph>
      <Paragraph position="6"> For the translation experiments, the value of the interpolation parameter is fixed at l = 0.5.</Paragraph>
      <Paragraph position="7">  is applied, a language model scaling factor a</Paragraph>
      <Paragraph position="9"> [?] 15.</Paragraph>
      <Paragraph position="10"> This scaling factor is employed because the language model probabilities are more reliably estimated than the acoustic probabilities. Following this use of a language model scaling factor in speech recognition, such a factor is introduced into statistical MT, too. The optimization criterion in equation (1) is modified as follows:  effect of the language model scaling factor a</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="121" end_page="121" type="metho">
    <SectionTitle>
LM
</SectionTitle>
    <Paragraph position="0"> is studied on the TEST-331 test set. A minimum mWER is obtained for a</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML