File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-4002_metho.xml

Size: 43,791 bytes

Last Modified: 2025-10-06 14:08:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-4002">
  <Title>c(c) 2004 Association for Computational Linguistics The Alignment Template Approach to Statistical Machine Translation</Title>
  <Section position="5" start_page="427" end_page="431" type="metho">
    <SectionTitle>
4. Translation Model
</SectionTitle>
    <Paragraph position="0"> To describe our translation model based on the alignment templates described in the previous section in a formal way, we first decompose both the source sentence f</Paragraph>
    <Paragraph position="2"> Note that there are a large number of possible segmentations of a sentence pair into K phrase pairs. In the following, we will describe the model for a specific segmentation.</Paragraph>
    <Paragraph position="3"> Eventually, however, a model can be described in which the specific segmentation is not known when new text is translated. Hence, as part of the overall search process (Section 5), we also search for the optimal segmentation.</Paragraph>
    <Paragraph position="4"> To allow possible reordering of phrases, we introduce an alignment on the phrase level p  permutation of the phrase positions 1, ..., K and indicates that the phrases ~e</Paragraph>
    <Paragraph position="6"> are translations of one another. We assume that for the translation between these phrases a specific alignment template z</Paragraph>
    <Paragraph position="8"> Hence, our model has the following hidden variables:</Paragraph>
    <Paragraph position="10"> Figure 5 gives an example of the word alignment and phrase alignment of a German-English sentence pair.</Paragraph>
    <Paragraph position="11"> We describe our model using a log-linear modeling approach. Hence, all knowledge sources are described as feature functions that include the given source language string f  Computational Linguistics Volume 30, Number 4 Figure 5 Example of segmentation of German sentence and its English translation into alignment templates.</Paragraph>
    <Paragraph position="12"> Figure 6 Dependencies in the alignment template model.  Och and Ney The Alignment Template Approach to Statistical Machine Translation</Paragraph>
    <Section position="1" start_page="429" end_page="431" type="sub_section">
      <SectionTitle>
4.1 Feature Functions
</SectionTitle>
      <Paragraph position="0"> use the probability p(z | ~ f) defined in Section 3. We establish a corresponding feature function by multiplying the probability of all used alignment templates and taking the logarithm:</Paragraph>
      <Paragraph position="2"> in the source language sentence and j</Paragraph>
      <Paragraph position="4"> is the position of the last word of that alignment template.</Paragraph>
      <Paragraph position="5"> Note that this feature function requires that a translation of a new sentence be composed of a set of alignment templates that covers both the source sentence and the produced translation. There is no notion of &amp;quot;empty phrase&amp;quot; that corresponds to the &amp;quot;empty word&amp;quot; in word-based statistical alignment models. The alignment on the phrase level is actually a permutation, and no insertions or deletions are allowed.</Paragraph>
      <Paragraph position="6">  4.1.2 Word Selection. For scoring the use of target language words, we use a lexicon probability p(e  |f), which is estimated using relative frequencies as described in Section 3.2. The target word e depends on the aligned source words. If we denote the resulting word alignment matrix by A := A</Paragraph>
      <Paragraph position="8"> and the predicted word class for word</Paragraph>
      <Paragraph position="10"> , then the feature function h WRD is defined as follows:</Paragraph>
      <Paragraph position="12"> which is constrained to predict only words that are in the predicted word class E</Paragraph>
      <Paragraph position="14"> A disadvantage of this model is that the word order is ignored in the translation model. The translations the day after tomorrow or after the day tomorrow for the German word &amp;quot;ubermorgen receive an identical probability. Yet the first one should obtain a significantly higher probability. Hence, we also include a dependence on the word positions in the lexicon model p(e  |f , i, j):</Paragraph>
      <Paragraph position="16"> and on the number of the preceding English words aligned with</Paragraph>
      <Paragraph position="18"> . This model distinguishes the positions within a phrasal translation. The number of parameters of p(e  |f , i, j) is significantly higher than that of p(e  |f) alone. Hence, there is a data estimation problem especially for words that rarely occur. Therefore, we linearly interpolate the models p(e  |f) and p(e  |f , i, j).</Paragraph>
      <Paragraph position="19">  very often a monotone alignment is a correct alignment. Hence, the feature function  is defined to equal J. The above-stated sum includes k = K + 1 to include the distance from the end position of the last phrase to the end of sentence.</Paragraph>
      <Paragraph position="20"> The sequence of K = 6 alignment templates in Figure 5 corresponds to the following sum of seven jump distances: 0 + 0 + 1 + 3 + 2 + 0 + 0 = 6.  dard backing-off word-based trigram language model (Ney, Generet, and Wessel 1995):</Paragraph>
      <Paragraph position="22"> The use of the language model feature in equation (18) helps take long-range dependencies better into account.</Paragraph>
      <Paragraph position="23">  also use as a feature the number of produced target language words (i.e., the length of the produced target language sentence):</Paragraph>
      <Paragraph position="25"> Without this feature, we typically observe that the produced sentences tend to be too short.</Paragraph>
      <Paragraph position="26">  conventional lexicon co-occur in the given sentence pair. Therefore, the weight for the provided conventional dictionary can be learned:  The intuition is that the conventional dictionary LEX is more reliable than the automatically trained lexicon and therefore should get a larger weight.  used is that we can add numerous features that deal with specific problems of the baseline statistical MT system. Here, we will restrict ourselves to the described set of features. Yet we could use grammatical features that relate certain grammatical dependencies of source and target language. For example, using a function k(*) that counts how many arguments the main verb of a sentence has in the source or target sentence, we can define the following feature, which has a nonzero value if the verb in each of the two sentences has the same number of arguments:</Paragraph>
      <Paragraph position="28"> In the same way, we can introduce semantic features or pragmatic features such as the dialogue act classification.</Paragraph>
      <Paragraph position="29">  Och and Ney The Alignment Template Approach to Statistical Machine Translation</Paragraph>
    </Section>
    <Section position="2" start_page="431" end_page="431" type="sub_section">
      <SectionTitle>
4.2 Training
</SectionTitle>
      <Paragraph position="0"> For the three different tasks on which we report results, we use two different training approaches. For the Verbmobil task, we train the model parameters l</Paragraph>
      <Paragraph position="2"> according to the maximum class posterior probability criterion (equation (4)). For the French-English Hansards task and the Chinese-English NIST task, we simply tune the model parameters by coordinate descent on held-out data with respect to the automatic evaluation metric employed, using as a starting point the model parameters obtained on the Verbmobil task. Note that this tuning depends on the starting point of the model parameters and is not guaranteed to converge to the global optimum on the training data. As a result, this approach is limited to a very small number of model parameters. An efficient algorithm for performing this tuning for a larger number of model parameters can be found in Och (2003).</Paragraph>
      <Paragraph position="3"> A standard approach to training the log-linear model parameters of the maximum class posterior probability criterion is the GIS (Generalized Iterative Scaling) algorithm (Darroch and Ratcliff 1972). To apply this algorithm, we have to solve various practical problems. The renormalization needed in equation (3) requires a sum over many possible sentences, for which we do not know of an efficient algorithm. Hence, we approximate this sum by extracting a large set of highly probable sentences as a sample from the space of all possible sentences (n-best approximation). The set of considered sentences is computed by means of an appropriately extended version of the search algorithm described in Section 5.</Paragraph>
      <Paragraph position="4"> Using an n-best approximation, we might face the problem that the parameters trained with the GIS algorithm yield worse translation results even on the training corpus. This can happen because with the modified model scaling factors, the n-best list can change significantly and can include sentences that have not been taken into account in training. Using these sentences, the new model parameters might perform worse than the old model parameters. To avoid this problem, we proceed as follows.</Paragraph>
      <Paragraph position="5"> In a first step, we perform a search, compute an n-best list, and use this n-best list to train the model parameters. Second, we use the new model parameters in a new search and compute a new n-best list, which is combined with the existing n-best list. Third, using this extended n-best list, new model parameters are computed. This process is iterated until the resulting n-best list does not change. In this algorithm, convergence is guaranteed, as in the limit the n-best list will contain all possible translations. In practice, the algorithm converges after five to seven iterations. In our experiments this final n-best list contains about 500-1000 alternative translations.</Paragraph>
      <Paragraph position="6"> We might have the problem that none of the given reference translations is part of the n-best list because the n-best list is too small or because the search algorithm performs pruning which in principle limits the possible translations that can be produced given a certain input sentence. To solve this problem, we define as reference translation for maximum-entropy training each sentence that has the minimal number of word errors with respect to any of the reference translations in the n-best list. More details of the training procedure can be found in Och and Ney (2002).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="431" end_page="439" type="metho">
    <SectionTitle>
5. Search
</SectionTitle>
    <Paragraph position="0"> In this section, we describe an efficient search architecture for the alignment template model.</Paragraph>
    <Section position="1" start_page="431" end_page="432" type="sub_section">
      <SectionTitle>
5.1 General Concept
</SectionTitle>
      <Paragraph position="0"> In general, the search problem for statistical MT even using only Model 1 of Brown et al. (1993) is NP-complete (Knight 1999). Therefore, we cannot expect to develop  Computational Linguistics Volume 30, Number 4 efficient search algorithms that are guaranteed to solve the problem without search errors. Yet for practical applications it is acceptable to commit some search errors (Section 6.1.2). Hence, the art of developing a search algorithm lies in finding suitable approximations and heuristics that allow an efficient search without committing too many search errors.</Paragraph>
      <Paragraph position="1"> In the development of the search algorithm described in this section, our main aim is that the search algorithm should be efficient. It should be possible to translate a sentence of reasonable length within a few seconds of computing time. We accept that the search algorithm sometimes results in search errors, as long as the impact on translation quality is minor. Yet it should be possible to reduce the number of search errors by increasing computing time. In the limit, it should be possible to search without search errors. The search algorithm should not impose any principal limitations. We also expect that the search algorithm be able to scale up to very long sentences with an acceptable computing time.</Paragraph>
      <Paragraph position="2"> To meet these aims, it is necessary to have a mechanism that restricts the search effort. We accomplish such a restriction by searching in a breadth-first manner with pruning: beam search. In pruning, we constrain the set of considered translation candidates (the &amp;quot;beam&amp;quot;) only to the promising ones. We compare in beam search those hypotheses that cover different parts of the input sentence. This makes the comparison of the probabilities problematic. Therefore, we integrate an admissible estimation of the remaining probabilities to arrive at a complete translation (Section 5.6) Many of the other search approaches suggested in the literature do not meet the described aims: * Neither optimal A* search (Och, Ueffing, and Ney 2001) nor optimal integer programming (Germann et al. 2001) for statistical MT allows efficient search for long sentences.</Paragraph>
      <Paragraph position="3"> * Greedy search algorithms (Wang 1998; Germann et al. 2001) typically commit severe search errors (Germann et al. 2001).</Paragraph>
      <Paragraph position="4"> * Other approaches to solving the search problem obtain polynomial time algorithms by assuming monotone alignments (Tillmann et al. 1997) or imposing a simplified recombination structure (Niessen et al. 1998).</Paragraph>
      <Paragraph position="5"> Others make simplifying assumptions about the search space (Garc'ia-Varea, Casacuberta, and Ney 1998; Garc'ia-Varea et al. 2001), as does the original IBM stack search decoder (Berger et al. 1994). All these simplifications ultimately make the search problem simpler but introduce fundamental search errors.</Paragraph>
      <Paragraph position="6"> In the following, we describe our search algorithm based on the concept of beam search, which allows a trade-off between efficiency and quality by adjusting the size of the beam. The search algorithm can be easily adapted to other phrase-based translation models. For single-word-based search in MT, a similar algorithm has been described in Tillmann and Ney (2003).</Paragraph>
    </Section>
    <Section position="2" start_page="432" end_page="433" type="sub_section">
      <SectionTitle>
5.2 Search Problem
</SectionTitle>
      <Paragraph position="0"> Putting everything together and performing search in maximum approximation, we obtain the following decision rule:</Paragraph>
      <Paragraph position="2"> Och and Ney The Alignment Template Approach to Statistical Machine Translation Using the four feature functions AT, AL, WRD, and LM, we obtain the following decision rule:</Paragraph>
      <Paragraph position="4"> Here, we have grouped the contributions of the various feature functions into those for each word (from LM and WRD, expression (24)), those for every alignment template (from AT and AL, expression (25)), and those for the end of sentence (expression (26)), which includes a term log p(EOS  |e</Paragraph>
      <Paragraph position="6"> ) for the end-of-sentence language model probability.</Paragraph>
      <Paragraph position="7"> To extend this decision rule for the word penalty (WP) feature function, we simply obtain an additional term l WP for each word. The class-based 5-gram language model (CLM) can be included like the trigram language model. Note that all these feature functions decompose nicely into contributions for each produced target language word or for each covered source language word. This makes it possible to develop an efficient dynamic programming search algorithm. Not all feature functions have this nice property: For the conventional lexicon feature function (LEX), we obtain an additional term in our decision rule which depends on the full sentence. Therefore, this feature function will not be integrated in the dynamic programming search but instead will be used to rerank the set of candidate translations produced by the search.</Paragraph>
    </Section>
    <Section position="3" start_page="433" end_page="435" type="sub_section">
      <SectionTitle>
5.3 Structure of Search Space
</SectionTitle>
      <Paragraph position="0"> We have to structure the search space in a suitable way to search efficiently. In our search algorithm, we generate search hypotheses that correspond to prefixes of target language sentences. Each hypothesis is the translation of a part of the source language sentence. A hypothesis is extended by appending one target word. The set of all hypotheses can be structured as a graph with a source node representing the sentence start, goal nodes representing complete translations, and intermediate nodes representing partial translations. There is a directed edge between hypotheses n  . Each edge has associated costs resulting from the contributions of all feature functions. Finally, our search problem can be reformulated as finding the optimal path through this graph. In the first step, we determine the set of all source phrases in ~ f for which an applicable alignment template exists. Every possible application of an alignment template  Computational Linguistics Volume 30, Number 4 If the source sentence contains words that have not been seen in the training data, we introduce a new alignment template that performs a one-to-one translation of each of these words by itself.</Paragraph>
      <Paragraph position="1"> In the second step, we determine a set of probable target language words for each target word position in the alignment template instantiation. Only these words are then hypothesized in the search. We call this selection of highly probable words observation pruning (Tillmann and Ney 2000). As a criterion for a word e at position i in the alignment template instantiation, we use</Paragraph>
      <Paragraph position="3"> In our experiments, we hypothesize only the five best-scoring words.</Paragraph>
      <Paragraph position="4"> A decision is a triple d =(Z, e, l) consisting of an alignment template instantiation Z, the generated word e, and the index l of the generated word in Z. A hypothesis n corresponds to a valid sequence of decisions d</Paragraph>
      <Paragraph position="6"> . The possible decisions are as follows: 1. Start a new alignment template: d</Paragraph>
      <Paragraph position="8"> ,1). In this case, the index l = 1. This decision can be made only if the previous decision d i[?]1 finished an alignment template and if the newly chosen alignment template instantiation does not overlap with any previously chosen alignment template instantiation. The resulting decision score corresponds to the contribution of the LM and the WRD features (expression (24)) for the produced word and the contribution of AL and AT features (expression (25)) for the started alignment template.  2. Extend an alignment template: d</Paragraph>
      <Paragraph position="10"> , l). This decision can be made only if the previous decision uses the same alignment template instantiation and has as index l [?] 1: d</Paragraph>
      <Paragraph position="12"> decision score corresponds to the contribution of the LM and the WRD features (expression (24)).</Paragraph>
      <Paragraph position="13"> 3. Finish the translation of a sentence: d</Paragraph>
      <Paragraph position="15"> =(EOS, EOS, 0). In this case, the hypothesis is marked as a goal hypothesis. This decision is possible only if the previous decision d i[?]1 finished an alignment template and if the alignment template instantiations completely cover the input sentence. The resulting decision score corresponds to the contribution of expression (26).</Paragraph>
      <Paragraph position="16">  . The sum of the decision scores is equal to the corresponding score described in expressions (24)-(26). A straightforward representation of all hypotheses would be the prefix tree of all possible sequences of decisions. Obviously, there would be a large redundancy in this search space representation, because there are many search nodes that are indistinguishable in the sense that the subtrees following these search nodes are identical. We can recombine these identical search nodes; that is, we have to maintain only the most probable hypothesis (Bellman 1957).</Paragraph>
      <Paragraph position="17"> In general, the criterion for recombining a set of nodes is that the hypotheses can be distinguished by neither language nor translation model. In performing recombination,  Algorithm for breadth-first search with pruning.</Paragraph>
      <Paragraph position="18"> we obtain a search graph instead of a search tree. The exact criterion for performing recombination for the alignment templates is described in Section 5.5.</Paragraph>
    </Section>
    <Section position="4" start_page="435" end_page="435" type="sub_section">
      <SectionTitle>
5.4 Search Algorithm
</SectionTitle>
      <Paragraph position="0"> Theoretically, we could use any graph search algorithm to search the optimal path in the search space. We use a breadth-first search algorithm with pruning. This approach offers very good possibilities for adjusting the trade-off between quality and efficiency.</Paragraph>
      <Paragraph position="1"> In pruning, we always compare hypotheses that have produced the same number of target words.</Paragraph>
      <Paragraph position="2"> Figure 7 shows a structogram of the algorithm. As the search space increases exponentially, it is not possible to explicitly represent it. Therefore, we represent the search space implicitly, using the functions Extend and Recombine. The function Extend produces new hypotheses extending the current hypothesis by one word. Some hypotheses might be identical or indistinguishable by the language and translation models. These are recombined by the function Recombine. We expand the search space such that only hypotheses with the same number of target language words are recombined.</Paragraph>
      <Paragraph position="3"> In the pruning step, we use two different types of pruning. First, we perform pruning relative to the score ^ Q of the current best hypothesis. We ignore all hypotheses that have a probability lower than log(t</Paragraph>
      <Paragraph position="5"> Q, where t p is an adjustable pruning parameter. This type of pruning can be performed when the hypothesis extensions are computed. Second, in histogram pruning (Steinbiss, Tran, and Ney 1994), we maintain only the best N p hypotheses. The two pruning parameters t p and N p have to be optimized with respect to the trade-off between efficiency and quality.</Paragraph>
    </Section>
    <Section position="5" start_page="435" end_page="436" type="sub_section">
      <SectionTitle>
5.5 Implementation
</SectionTitle>
      <Paragraph position="0"> In this section, we describe various issues involved in performing an efficient implementation of a search algorithm for the alignment template approach.</Paragraph>
      <Paragraph position="1"> A very important design decision in the implementation is the representation of a hypothesis. Theoretically, it would be possible to represent search hypotheses only by the associated decision and a back-pointer to the previous hypothesis. Yet this would be a very inefficient representation for the implementation of the operations  Computational Linguistics Volume 30, Number 4 that have to be performed in the search. The hypothesis representation should contain all information required to perform efficiently the computations needed in the search but should contain no more information than that, to keep the memory consumption small.</Paragraph>
      <Paragraph position="2"> In search, we produce hypotheses n, each of which contains the following information: null  1. e: the final target word produced 2. h: the state of the language model (to predict the following word) 3. c = c J 1 : the coverage vector representing the already covered positions of the source sentence (c j = 1 means the position j is covered, c j = 0 means the position j is not covered) 4. Z: a reference to the alignment template instantiation that produced the final target word 5. l: the position of the final target word in the alignment template instantiation 6. Q(n): the accumulated score of all previous decisions 7. n prime : a reference to the previous hypothesis Using this representation, we can perform the following operations very efficiently: * Determining whether a specific alignment template instantiation can be used to extend a hypothesis. To do this, we check whether the positions of the alignment template instantiation are still free in the hypothesis coverage vector.</Paragraph>
      <Paragraph position="3"> * Checking whether a hypothesis is final. To do this, we determine  whether the coverage vector contains no uncovered position. Using a bit vector as representation, the operation to check whether a hypothesis is final can be implemented very efficiently.</Paragraph>
      <Paragraph position="4">  ))alignment template instantiation finished We compare in beam search those hypotheses that cover different parts of the input sentence. This makes the comparison of the probabilities problematic. Therefore, we integrate an admissible estimation of the remaining probabilities to arrive at a complete translation. Details of the heuristic function for the alignment templates are provided in the next section.</Paragraph>
    </Section>
    <Section position="6" start_page="436" end_page="439" type="sub_section">
      <SectionTitle>
5.6 Heuristic Function
</SectionTitle>
      <Paragraph position="0"> To improve the comparability of search hypotheses, we introduce heuristic functions.</Paragraph>
      <Paragraph position="1"> A heuristic function estimates the probabilities of reaching the goal node from a certain  Och and Ney The Alignment Template Approach to Statistical Machine Translation search node. An admissible heuristic function is always an optimistic estimate; that is, for each search node, the product of edge probabilities of reaching a goal node is always equal to or smaller than the estimated probability. For an A*-based search algorithm, a good heuristic function is crucial to being able to translate long sentences. For a beam search algorithm, the heuristic function has a different motivation. It is used to improve the scoring of search hypotheses. The goal is to make the probabilities of all hypotheses more comparable, in order to minimize the chance that the hypothesis leading to the optimal translation is pruned away.</Paragraph>
      <Paragraph position="2"> Heuristic functions for search in statistical MT have been used in Wang and Waibel (1997) and Och, Ueffing, and Ney (2001). Wang and Waibel (1997) have described a simple heuristic function for Model 2 of Brown et al. (1993) that was not admissible. Och, Ueffing, and Ney (2001) have described an admissible heuristic function for Model 4 of Brown et al. (1993) and an almost-admissible heuristic function that is empirically obtained.</Paragraph>
      <Paragraph position="3"> We have to keep in mind that a heuristic function is helpful only if the overhead introduced in computing the heuristic function is more than compensated for by the gain obtained through a better pruning of search hypotheses. The heuristic functions described in the following are designed such that their computation can be performed efficiently.</Paragraph>
      <Paragraph position="4"> The basic idea for developing a heuristic function for an alignment model is that all source sentence positions that have not been covered so far still have to be translated to complete the sentence. If we have an estimation r X (j) of the optimal score for translating position j, then the value of the heuristic function R X (n) for a node n can be inferred by summing over the contribution for every position j that is not in the coverage vector c(n) (here X denotes different possibilities to choose the heuristic function):</Paragraph>
      <Paragraph position="6"> The situation in the case of the alignment template approach is more complicated, as not every word is translated alone, but typically the words are translated in context.</Paragraph>
      <Paragraph position="7"> Therefore, the basic quantity for the heuristic function in the case of the alignment template approach is a function r(Z) that assigns to every alignment template instantiation Z a maximal probability. Using r(Z), we can induce a position-dependent heuristic function r(j):</Paragraph>
      <Paragraph position="9"> Here, J(Z) denotes the number of source language words produced by the alignment template instantiation Z and j(Z) denotes the position of the first source language word. It can be easily shown that if r(Z) is admissible, then r(j) is also admissible. We have to show that for all nonoverlapping sequences Z</Paragraph>
      <Paragraph position="11"> Here, k(j) denotes the phrase index k that includes the target language word position j.</Paragraph>
      <Paragraph position="12"> In the following, we develop various heuristic functions r(Z) of increasing complexity.</Paragraph>
      <Paragraph position="13"> The simplest realization of a heuristic function r(Z) takes into account only the prior probability of an alignment template instantiation:</Paragraph>
      <Paragraph position="15"> The lexicon model can be integrated as follows:  Here, we assume a trigram language model. In general, it is necessary to maximize over all possible different language model histories. We can also combine the language model and the lexicon model into one heuristic function:  To include the phrase alignment probability in the heuristic function, we compute the minimum sum of all jump widths that is needed to complete the translation. This sum can be computed efficiently using the algorithm shown in Figure 8. Then, an admissible heuristic function for the jump width is obtained by</Paragraph>
      <Paragraph position="17"> Combining all the heuristic functions for the various models, we obtain as final heuristic function for a search hypothesis n</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="439" end_page="445" type="metho">
    <SectionTitle>
6. Results
6.1 Results on the Verbmobil Task
</SectionTitle>
    <Paragraph position="0"> We present results on the Verbmobil task, which is a speech translation task in the domain of appointment scheduling, travel planning, and hotel reservation (Wahlster 2000). Table 2 shows the corpus statistics for this task. We use a training corpus, which is used to train the alignment template model and the language models, a development corpus, which is used to estimate the model scaling factors, and a test corpus. On average, 3.32 reference translations for the development corpus and 5.14 reference translations for the test corpus are used.</Paragraph>
    <Paragraph position="1"> A standard vocabulary had been defined for the various speech recognizers used in Verbmobil. However, not all words of this vocabulary were observed in the training corpus. Therefore, the translation vocabulary was extended semiautomatically by adding about 13,000 German-English entries from an online bilingual lexicon available on the Web. The resulting lexicon contained not only word-word entries, but also multi-word translations, especially for the large number of German compound words. To counteract the sparseness of the training data, a couple of straightforward rule-based preprocessing steps were applied before any other type of processing:  Computational Linguistics Volume 30, Number 4 So far, in machine translation research there is no generally accepted criterion for the evaluation of experimental results. Therefore, we use various criteria. In the following experiments, we use: * WER (word error rate)/mWER (multireference word error rate): The WER is computed as the minimum number of substitution, insertion, and deletion operations that have to be performed to convert the generated sentence into the target sentence. In the case of the multireference word error rate for each test sentence, not just a single reference translation is used, as for the WER, but a whole set of reference translations. For each translation hypothesis, the edit distance to the most similar sentence is calculated (Niessen et al. 2000).</Paragraph>
    <Paragraph position="2"> * PER (position-independent WER): A shortcoming of the WER is the fact that it requires a perfect word order. An acceptable sentence can have a word order that is different from that of the target sentence, so the WER measure alone could be misleading. To overcome this problem, we introduce as an additional measure the position-independent word error rate. This measure compares the words in the two sentences, ignoring the word order.</Paragraph>
    <Paragraph position="3"> * BLEU (bilingual evalutation understudy) score: This score measures the precision of unigrams, bigrams, trigrams, and 4-grams with respect to a whole set of reference translations, with a penalty for too-short sentences (Papineni et al. 2001). Unlike all other evaluation criteria used here, BLEU measures accuracy, that is, the opposite of error rate. Hence, the larger BLEU scores, the better.</Paragraph>
    <Paragraph position="4"> In the following, we analyze the effect of various system components: alignment template length, search pruning, and language model n-gram size. A systematic evaluation of the alignment template system comparing it with other translation approaches (e.g., rule-based) has been performed in the Verbmobil project and is described in Tessiore and von Hahn (2000). There, the alignment-template-based system achieved a significantly larger number of &amp;quot;approximately correct&amp;quot; translations than the competing translation systems (Ney, Och, and Vogel 2001).</Paragraph>
    <Paragraph position="5">  the maximum length of the alignment templates in the source language. Typically, it is necessary to restrict the alignment template length to keep memory requirements low. We see that using alignment templates with only one or two words in the source languages results in very bad translation quality. Yet using alignment templates with lengths as small as three words yields optimal results.</Paragraph>
    <Paragraph position="6">  of beam search pruning and of the heuristic function. We use the following criteria: * Number of search errors: A search error occurs when the search algorithm misses the most probable translation and produces a translation which is less probable. As we typically cannot efficiently compute the probability of the optimal translation, we cannot efficiently compute the number of search errors. Yet we can compute a lower bound on the number of search errors by comparing the translation  Effect of pruning parameter t p and heuristic function on search efficiency for direct-translation model (N p = 50,000).</Paragraph>
    <Paragraph position="7"> no heuristic function AT+WRD +LM +AL time search time search time search time search  114.6 34 119.2 5 146.2 2 75.2 0 found under specific pruning thresholds with the best translation that we have found using very conservative pruning thresholds.</Paragraph>
    <Paragraph position="8"> * Average translation time per sentence: Pruning is used to adjust the trade-off between efficiency and quality. Hence, we present the average time needed to translate one sentence of the test corpus.</Paragraph>
    <Paragraph position="9"> * Translation quality (mWER, BLEU): Typically, a sentence can have many different correct translations. Therefore, a search error does not necessarily result in poorer translation quality. It is even possible that a search error can improve translation quality. Hence, we analyze the effect of search on translation quality, using the automatic evaluation criteria mWER and BLEU.</Paragraph>
    <Paragraph position="10"> Tables 4 and 5 show the effect of the pruning parameter t  . In all four tables, we provide the results for using no heuristic functions and three variants of an increasingly informative heuristic function. The first is an estimate of the alignment template and the lexicon probability (AT+WRD), the second adds an estimate of the language model (+LM) probability, and the third also adds the alignment probability (+AL). These heuristic functions are described in Section 5.6.</Paragraph>
    <Paragraph position="11"> Without a heuristic function, even more than a hundred seconds per sentence cannot guarantee search-error-free translation. We draw the conclusion that a good heuristic function is very important to obtaining an efficient search algorithm.  In addition, the search errors have a more severe effect on the error rates if we do not use a heuristic function. If we compare the error rates in Table 7, which correspond to about 55 search errors in Table 6, we obtain an mWER of 36.7% (53 search errors) using no heuristic function and an mWER of 32.6% (57 search errors) using the combined heuristic function. The reason is that without a heuristic function, often the &amp;quot;easy&amp;quot; part of the input sentence is translated first. This yields severe reordering errors.</Paragraph>
    <Paragraph position="12"> 6.1.3 Effect of the Length of the Language Model History. In this work, we use only n-gram-based language models. Ideally, we would like to take into account long-range dependencies. Yet long n-grams are seen rarely and are therefore rarely used on unseen data. Therefore, we expect that extending the history length will at some point not improve further translation quality.</Paragraph>
    <Paragraph position="13"> Table 8 shows the effect of the length of the language model history on translation quality. We see that the language model perplexity improves from 4,781 for a unigram model to 29.9 for a trigram model. The corresponding translation quality improves from an mWER of 45.9% to an mWER of 31.8%. The largest effect seems to come from taking into account the bigram dependence, which achieves an mWER of 32.9%. If we perform log-linear interpolation of a trigram model with a class-based 5-gram model, we observe an additional small improvement in translation quality to an mWER of 30.9%.</Paragraph>
    <Section position="1" start_page="443" end_page="443" type="sub_section">
      <SectionTitle>
6.2 Results on the Hansards task
</SectionTitle>
      <Paragraph position="0"> The Hansards task involves the proceedings of the Canadian parliament, which are kept by law in both French and English. About three million parallel sentences of this bilingual data have been made available by the Linguistic Data Consortium (LDC).</Paragraph>
      <Paragraph position="1"> Here, we use a subset of the data containing only sentences of up to 30 words. Table 9 shows the training and test corpus statistics.</Paragraph>
      <Paragraph position="2"> The results for French to English and for English to French are shown in Table 10.</Paragraph>
      <Paragraph position="3"> Because of memory limitations, the maximum alignment template length has been restricted to four words. We compare here against the single-word-based search for Model 4 described in Tillmann (2001). We see that the alignment template approach obtains significantly better results than the single-word-based search.</Paragraph>
    </Section>
    <Section position="2" start_page="443" end_page="445" type="sub_section">
      <SectionTitle>
6.3 Results on Chinese-English
</SectionTitle>
      <Paragraph position="0"> Various statistical, example-based, and rule-based MT systems for a Chinese-English news domain were evaluated in the NIST 2002 MT evaluation.</Paragraph>
      <Paragraph position="1">  template approach described in this article, we participated in these evaluations. The problem domain is the translation of Chinese news text into English. Table 11 gives an overview on the training and test data. The English vocabulary consists of full-form words that have been converted to lowercase letters. The number of sentences has been artificially increased by adding certain parts of the original training material more than once to the training corpus, in order to give larger weight to those parts of the training corpus that consist of high-quality aligned Chinese news text and are therefore expected to be especially helpful for the translation of the test data.</Paragraph>
    </Section>
    <Section position="3" start_page="445" end_page="445" type="sub_section">
      <SectionTitle>
Results of Chinese-English NIST MT evaluation, June 2002, large data track (NIST-09 score:
</SectionTitle>
      <Paragraph position="0"> larger values are better).</Paragraph>
      <Paragraph position="1"> System NIST-09 score Alignment template approach 7.65 Competing research systems 5.03-7.34 Best of six commercial off-the-shelf systems 6.08 The Chinese language poses special problems because the boundaries of Chinese words are not marked. Chinese text is provided as a sequence of characters, and it is unclear which characters have to be grouped together to obtain entities that can be interpreted as words. For statistical MT, it would be possible to ignore this fact and treat the Chinese characters as elementary units and translate them into English. Yet preliminary experiments showed that the existing alignment models produce better results if the Chinese characters are segmented in a preprocessing step into single words. We use the LDC segmentation tool.</Paragraph>
      <Paragraph position="2">  For the English corpus, the following preprocessing steps are applied. First, the corpus is tokenized; it is then segmented into sentences, and all uppercase characters are converted to lowercase. As the final evaluation criterion does not distinguish case, it is not necessary to deal with the case information.</Paragraph>
      <Paragraph position="3"> Then, the preprocessed Chinese and English corpora are sentence aligned in which the lengths of the source and target sentences are significantly different. From the resulting corpus, we automatically replace translations. In addition, only sentences with less than 60 words in English and Chinese are used.</Paragraph>
      <Paragraph position="4"> To improve the translation of Chinese numbers, we use a categorization of Chinese number and date expressions. For the statistical learning, all number and date expressions are replaced with one of two generic symbols, $number or $date. The number and date expressions are subjected to a rule-based translation by simple lexicon lookup. The translation of the number and date expressions is inserted into the output using the alignment information. For Chinese and English, this categorization is implemented independently of the other language.</Paragraph>
      <Paragraph position="5"> To evaluate MT quality on this task, NIST made available the NIST-09 evaluation tool. This tool provides a modified BLEU score by computing a weighted precision of n-grams modified by a length penalty for very short translations. Table 12 shows the results of the official evaluation performed by NIST in June 2002. With a score of 7.65, the results obtained were statistically significantly better than any other competing approach. Differences in the NIST score larger than 0.12 are statistically significant at the 95% level. We conclude that the developed alignment template approach is also applicable to unrelated language pairs such as Chinese-English and that the developed statistical models indeed seem to be largely language-independent. Table 13 shows various example translations.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="445" end_page="445" type="metho">
    <SectionTitle>
7. Conclusions
</SectionTitle>
    <Paragraph position="0"> We have presented a framework for statistical MT for natural languages which is more general than the widely used source-channel approach. It allows a baseline MT</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML