File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/j04-4002_intro.xml
Size: 13,028 bytes
Last Modified: 2025-10-06 14:02:15
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-4002"> <Title>c(c) 2004 Association for Computational Linguistics The Alignment Template Approach to Statistical Machine Translation</Title> <Section position="4" start_page="420" end_page="427" type="intro"> <SectionTitle> 3. Learning Translation Lexica </SectionTitle> <Paragraph position="0"> In this section, we describe methods for learning the single-word and phrase-based translation lexica that are the basis of the machine translation system described in Och and Ney The Alignment Template Approach to Statistical Machine Translation Section 4. First, we introduce the basic concepts of statistical alignment models, which are used to learn word alignment. Then, we describe how these alignments can be used to learn bilingual phrasal translations.</Paragraph> <Section position="1" start_page="421" end_page="421" type="sub_section"> <SectionTitle> 3.1 Statistical Alignment Models </SectionTitle> <Paragraph position="0"> to account for source words that are not aligned with any target word. In general, the statistical model depends on a set of unknown parameters th that is learned from training data. To express the dependence of the model on the parameter set, we use the following notation:</Paragraph> <Paragraph position="2"> A detailed description of different specific statistical alignment models can be found in Brown et al. (1993) and Och and Ney (2003). Here, we use the hidden Markov model (HMM) alignment model (Vogel, Ney, and Tillmann 1996) and Model 4 of Brown et al. (1993) to compute the word alignment for the parallel training corpus.</Paragraph> <Paragraph position="3"> To train the unknown parameters th, we are given a parallel training corpus consisting of S sentence pairs {(f This optimization can be performed using the expectation maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). For a given sentence pair there are a large number of alignments. The alignment ^a</Paragraph> <Paragraph position="5"> that has the highest probability (under a certain model) is also called the Viterbi alignment (of that model):</Paragraph> <Paragraph position="7"> A detailed comparison of the quality of these Viterbi alignments for various statistical alignment models compared to human-made word alignments can be found in Och and Ney (2003).</Paragraph> </Section> <Section position="2" start_page="421" end_page="423" type="sub_section"> <SectionTitle> 3.2 Symmetrization </SectionTitle> <Paragraph position="0"> The baseline alignment model does not allow a source word to be aligned with two or more target words. Therefore, lexical correspondences like the German compound word Zahnarzttermin for dentist's appointment cause problems because a single source word must be mapped onto two or more target words. Therefore, the resulting Viterbi alignment of the standard alignment models has a systematic loss in recall. Here, we Computational Linguistics Volume 30, Number 4 Figure 2 Example of a (symmetrized) word alignment (Verbmobil task).</Paragraph> <Paragraph position="1"> describe various methods for performing a symmetrization of our directed statistical alignment models by applying a heuristic postprocessing step that combines the alignments in both translation directions (source to target, target to source). Figure 2 shows an example of a symmetrized alignment.</Paragraph> <Paragraph position="2"> To solve this problem, we train in both translation directions. For each sentence pair, we compute two Viterbi alignments a</Paragraph> <Paragraph position="4"> > 0} denote the sets of alignments in the two Viterbi alignments. To increase the quality of the alignments, we can combine (symmetrize) A is determined. The elements of this intersection result from both Viterbi alignments and are therefore very reliable. Then, we extend the alignment A iteratively by adding alignments (i, j) occurring only in the or in the alignment A if neither f</Paragraph> <Paragraph position="6"> horizontal and vertical neighbors.</Paragraph> <Paragraph position="7"> Obviously, the intersection yields an alignment consisting of only one-to-one alignments with a higher precision and a lower recall. The union yields a higher recall and a lower precision of the combined alignment. The refined alignment method is often able to improve precision and recall compared to the nonsymmetrized alignments. Whether a higher precision or a higher recall is preferred depends on the final application of the word alignment. For the purpose of statistical MT, it seems that a higher recall is more important. Therefore, we use the union or the refined combination method to obtain a symmetrized alignment matrix.</Paragraph> <Paragraph position="8"> The resulting symmetrized alignments are then used to train single-word-based translation lexica p(e |f) by computing relative frequencies using the count N(e, f) of how many times e and f are aligned divided by the count N(f) of how many times the word f occurs:</Paragraph> <Paragraph position="10"/> </Section> <Section position="3" start_page="423" end_page="424" type="sub_section"> <SectionTitle> 3.3 Bilingual Contiguous Phrases </SectionTitle> <Paragraph position="0"> In this section, we present a method for learning relationships between whole phrases of m source language words and n target language words. This algorithm, which will be called phrase-extract, takes as input a general word alignment matrix (Section 3.2). The output is a set of bilingual phrases.</Paragraph> <Paragraph position="1"> In the following, we describe the criterion that defines the set of phrases that is consistent with the word alignment matrix:</Paragraph> <Paragraph position="3"> Hence, the set of all bilingual phrases that are consistent with the alignment is constituted by all bilingual phrase pairs in which all words within the source language phrase are aligned only with the words of the target language phrase and the words of the target language phrase are aligned only with the words of the source language phrase. Note that we require that at least one word in the source language phrase be aligned with at least one word of the target language phrase. As a result there are no empty source or target language phrases that would correspond to the &quot;empty word&quot; of the word-based statistical alignment models.</Paragraph> <Paragraph position="4"> These phrases can be computed straightforwardly by enumerating all possible phrases in one language and checking whether the aligned words in the other language are consecutive, with the possible exception of words that are not aligned at all. Figure 3 gives the algorithm phrase-extract that computes the phrases. The algorithm takes into account possibly unaligned words at the boundaries of the source or target language phrases. Table 1 shows the bilingual phrases containing between two and seven words that result from the application of this algorithm to the alignment of Figure 2.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 30, Number 4 Table 1 Examples of two- to seven-word bilingual phrases obtained by applying the algorithm phrase-extract to the alignment of Figure 2.</Paragraph> <Paragraph position="6"> ja , yes , ja , ich yes , I ja , ich denke mal yes , I think ja , ich denke mal , yes , I think , ja , ich denke mal , also yes , I think , well</Paragraph> <Paragraph position="8"> , ich denke mal , I think , ich denke mal , , I think , , ich denke mal , also , I think , well , ich denke mal , also wir , I think , well we ich denke mal I think ich denke mal , I think , ich denke mal , also I think , well ich denke mal , also wir I think , well we ich denke mal , also wir wollten I think , well we plan to denke mal , think , denke mal , also think , well denke mal , also wir think , well we denke mal , also wir wollten think , well we plan to , also , well , also wir , well we , also wir wollten , well we plan to also wir well we also wir wollten well we plan to wir wollten we plan to in unserer in our in unserer Abteilung in our department in unserer Abteilung ein neues Netzwerk a new network in our department in unserer Abteilung ein neues Netzwerk set up a new network in our department aufbauen unserer Abteilung our department ein neues a new ein neues Netzwerk a new network ein neues Netzwerk aufbauen set up a new network neues Netzwerk new network It should be emphasized that this constraint to consecutive phrases limits the expressive power. If a consecutive phrase in one language is translated into two or three nonconsecutive phrases in the other language, there is no corresponding bilingual phrase pair learned by this approach. In principle, this approach to learning phrases from a word-aligned corpus could be extended straightforwardly to handle nonconsecutive phrases in source and target language as well. Informal experiments have shown that allowing for nonconsecutive phrases significantly increases the number of extracted phrases and especially increases the percentage of wrong phrases. Therefore, we consider only consecutive phrases.</Paragraph> </Section> <Section position="4" start_page="424" end_page="427" type="sub_section"> <SectionTitle> 3.4 Alignment Templates </SectionTitle> <Paragraph position="0"> In the following, we add generalization capability to the bilingual phrase lexicon by replacing words with word classes and also by storing the alignment information for each phrase pair. These generalized and alignment-annotated phrase pairs are called alignment templates. Formally, an alignment template z is a triple (F Algorithm phrase-extract for extracting phrases from a word-aligned sentence pair. Here quasi-consecutive(TP) is a predicate that tests whether the set of words TP is consecutive, with the possible exception of words that are not aligned.</Paragraph> <Paragraph position="1"> that describes the alignment</Paragraph> <Paragraph position="3"> matrix element with value 1 means that the words at the corresponding positions are aligned, and the value 0 means that the words are not aligned. If a source word is not aligned with a target word, then it is aligned with the empty word e , which is at the imaginary position i = 0.</Paragraph> <Paragraph position="4"> The classes used in F</Paragraph> <Paragraph position="6"> are automatically trained bilingual classes using the method described in Och (1999) and constitute a partition of the vocabulary of source and target language. In general, we are not limited to disjoint classes as long as each specific instance of a word is disambiguated, that is, uniquely belongs to a specific class. In the following, we use the class function C to map words to their classes. Hence, it would be possible to employ parts-of-speech or semantic categories instead of the automatically trained word classes used here.</Paragraph> <Paragraph position="7"> The use of classes instead of the words themselves has the advantage of better generalization. For example, if there exist classes in source and target language that contain town names, it is possible that an alignment template learned using a specific town name can be generalized to other town names.</Paragraph> <Paragraph position="8"> In the following, ~e and ~ f denote target and source phrases, respectively. To train the probability of applying an alignment template p(z =(F f), we use an extended version of the algorithm phrase-extract from Section 3.3. All bilingual phrases that are consistent with the alignment are extracted together with the align- null Computational Linguistics Volume 30, Number 4 Figure 4 Examples of alignment templates obtained in training.</Paragraph> <Paragraph position="9"> ment within this bilingual phrase. Thus, we obtain a count N(z) of how often an alignment template occurred in the aligned training corpus. The probability of using an alignment template to translate a specific source language phrase To reduce the memory requirement of the alignment templates, we compute these probabilities only for phrases up to a certain maximal length in the source language. Och and Ney The Alignment Template Approach to Statistical Machine Translation Depending on the size of the corpus, the maximal length in the experiments is between four and seven words. In addition, we remove alignment templates that have a probability lower than a certain threshold. In the experiments, we use a threshold of 0.01.</Paragraph> <Paragraph position="10"> It should be emphasized that this algorithm for computing aligned phrase pairs and their associated probabilities is very easy to implement. The joint translation model suggested by Marcu and Wong (2002) tries to learn phrases as part of a full EM algorithm, which leads to very large memory requirements and a rather complicated training algorithm. A comparison of the two approaches can be found in Koehn, Och, and Marcu (2003).</Paragraph> </Section> </Section> class="xml-element"></Paper>