File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1032_metho.xml
Size: 10,676 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1032"> <Title>Symmetric Word Alignments for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Statistical Word Alignment Models </SectionTitle> <Paragraph position="0"> In this section, we will give an overview of the commonly used statistical word alignment techniques. They are based on the source-channel approach to statistical machine translation (Brown et al., 1993). We are given a source language sentence fJ1 := f1:::fj:::fJ which has to be translated into a target language sentence eI1 := e1:::ei:::eI. Among all possible target language sentences, we will choose the sentence with the highest probability: null</Paragraph> <Paragraph position="2"> This decomposition into two knowledge sources allows for an independent modeling of target language model Pr(eI1) and translation model Pr(fJ1 jeI1). Into the translation model, the word alignment A is introduced as a hid-</Paragraph> <Paragraph position="4"> Usually, the alignment is restricted in the sense that each source word is aligned to at most one target word, i.e. A = aJ1. The alignment may contain the connection aj = 0 with the 'empty' word e0 to account for source sentence words that are not aligned to any target word at all. A detailed description of the popular translation/alignment models IBM-1 to IBM-5 (Brown et al., 1993), as well as the Hidden-Markov alignment model (HMM) (Vogel et al., 1996) can be found in (Och and Ney, 2003). Model 6 is a loglinear combination of the IBM-4, IBM-1, and the HMM alignment models.</Paragraph> <Paragraph position="5"> A Viterbi alignment ^A of a specific model is an alignment for which the following equation</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 State Occupation Probabilities </SectionTitle> <Paragraph position="0"> The training of all alignment models is done using the EM-algorithm. In the E-step, the counts for each sentence pair (fJ1 ;eI1) are calculated. Here, we present this calculation on the example of the HMM. For its lexicon parameters, the marginal probability of a target word ei to occur at the target sentence position i as the translation of the source word fj at the source sentence position j is estimated with the following sum:</Paragraph> <Paragraph position="2"> This value represents the likelihood of aligning fj to ei via every possible alignment A = aJ1 that includes the alignment connection aj = i.</Paragraph> <Paragraph position="3"> By normalizing over the target sentence positions, we arrive at the state occupation probability: null</Paragraph> <Paragraph position="5"> In the M-step of the EM training, the state occupation probabilities are aggregated for all words in the source and target vocabularies by taking the sum over all training sentence pairs. After proper renormalization the lexicon probabilities p(fje) are determined.</Paragraph> <Paragraph position="6"> Similarly, the training can be performed in the inverse (target-to-source) direction, yielding the state occupation probabilities pi(jjeI1;fJ1 ).</Paragraph> <Paragraph position="7"> The negated logarithms of the state occupation probabilities</Paragraph> <Paragraph position="9"> can be viewed as costs of aligning the source word fj with the target word ei. Thus, the word alignment task can be formulated as the task of finding a mapping between the source and the target words, so that each source and each target position is covered and the total costs of the alignment are minimal.</Paragraph> <Paragraph position="10"> Using state occupation probabilities for word alignment modeling results in a number of advantages. First of all, in calculation of these probabilities with the models IBM-1, IBM-2 and HMM the EM-algorithm is performed exact, i.e. the summation over all alignments is efficiently performed in the Estep. For the HMM this is done using the Baum-Welch algorithm (Baum, 1972). So far, an efficient algorithm to compute the sum over all alignments in the fertility models IBM-3 to IBM-5 is not known. Therefore, this sum is approximated using a subset of promising alignments (Och and Ney, 2000). In both cases, the resulting estimates are more precise than the ones obtained by the maximum approximation, i.e. by considering only the Viterbi alignment.</Paragraph> <Paragraph position="11"> Instead of using the state occupation probabilities from only one training direction as costs (Equation 1), we can interpolate the state occupation probabilities from the source-to-target and the target-to-source training for each pair (i,j) of positions in a sentence pair (fJ1 ;eI1). This will improve the estimation of the local alignment costs. Having such symmetrized costs, we can employ the graph alignment algorithms (cf. Section 4) to produce reliable alignment connections which include many-to-one and one-to-many alignment relationships. The presence of both relationship types characterizes a symmetric alignment that can potentially improve the translation results (Figure 1 shows an example of a symmetric alignment).</Paragraph> <Paragraph position="12"> Another important advantage is the efficiency of the graph algorithms used to deter- null tions (Verbmobil task, spontaneous speech).</Paragraph> <Paragraph position="13"> mine the final symmetric alignment. They will be discussed in Section 4.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Alignment Algorithms </SectionTitle> <Paragraph position="0"> In this section, we describe the alignment extraction algorithms. We assume that for each sentence pair (fJ1 ;eI1) we are given a cost ma- null trix C.1 The elements of this matrix cij are the local costs that result from aligning source word fj to target word ei. For a given alignment A I PS J, we define the costs of this alignment c(A) as the sum of the local costs of all aligned word pairs:</Paragraph> <Paragraph position="2"> Now, our task is to find the alignment with the minimum costs. Obviously, the empty alignment has always costs of zero and would be optimal. To avoid this, we introduce additional constraints. The first constraint is source sentence coverage. Thus each source word has to be aligned to at least one target word or alternatively to the empty word. The second constraint is target sentence coverage. Similar to the source sentence coverage thus each target word is aligned to at least one source word or the empty word.</Paragraph> <Paragraph position="3"> Enforcing only the source sentence coverage, the minimum cost alignment is a mapping from source positions j to target positions aj, including zero for the empty word. Each target position aj can be computed as:</Paragraph> <Paragraph position="5"> This means, in each column we choose the row with the minimum costs. This method resembles the common IBM models in the sense 1For notational convenience, we omit the dependency on the sentence pair (fJ1 ;eI1) in this section. that the IBM models are also a mapping from source positions to target positions. Therefore, this method is comparable to the IBM models for the source-to-target direction. Similarly, if we enforce only the target sentence coverage, the minimum cost alignment is a mapping from target positions i to source positions bi. Here, we have to choose in each row the column with the minimum costs. The complexity of these algorithms is in O(I C/J).</Paragraph> <Paragraph position="6"> The algorithms for determining such a non-symmetric alignment are rather simple. A more interesting case arises, if we enforce both constraints, i.e. each source word as well as each target word has to be aligned at least once. Even in this case, we can find the global optimum in polynomial time.</Paragraph> <Paragraph position="7"> The task is to find a symmetric alignment A, for which the costsc(A) are minimal (Equation 2). This task is equivalent to finding a minimum-weight edge cover (MWEC) in a complete bipartite graph2. The two node sets of this bipartite graph correspond to the source sentence positions and the target sentence positions, respectively. The costs of an edge are the elements of the cost matrix C.</Paragraph> <Paragraph position="8"> To solve the minimum-weight edge cover problem, we reduce it to the maximum-weight bipartite matching problem. As described in (Keijsper and Pendavingh, 1998), this reduction is linear in the graph size. For the maximum-weight bipartite matching problem, well-known algorithm exist, e.g. the Hungarian method. The complexity of this algorithm is in O((I + J)C/I C/J). We will call the solution of the minimum-weight edge cover problem with the Hungarian method &quot;the MWEC algorithm&quot;. In contrary, we will refer to the algorithm enforcing either source sentence coverage or target sentence coverage as the one-sided minimum-weight edge cover algorithm (o-MWEC).</Paragraph> <Paragraph position="9"> The cost matrix of a sentence pair (fJ1 ;eI1) can be computed as a weighted linear interpolation of various cost types hm:</Paragraph> <Paragraph position="11"> In our experiments, we will use the negated logarithm of the state occupation probabilities as described in Section 3. To obtain a more symmetric estimate of the costs, we will interpolate both the source-to-target direction and 2An edge cover of G is a set of edges E0 such that each node of G is incident to at least one edge in E0. the target-to-source direction (thus the state occupation probabilities are interpolated loglinearly). Because the alignments determined in the source-to-target training may substantially differ in quality from those produced in the target-to-source training, we will use an interpolation weight fi:</Paragraph> <Paragraph position="13"> Additional feature functions can be included to compute cij; for example, one could make use of a bilingual word or phrase dictionary.</Paragraph> <Paragraph position="14"> To apply the methods described in this section, wemadetwoassumptions: first, thecosts of an alignment can be computed as the sum of local costs. Second, the features have to be static in the sense that we have to fix the costs before aligning any word. Therefore, we cannot apply dynamic features such as the IBM-4 distortion model in a straightforward way.</Paragraph> <Paragraph position="15"> One way to overcome these restrictions lies in using the state occupation probabilities; e.g.</Paragraph> <Paragraph position="16"> for IBM-4, they contain the distortion model to some extent.</Paragraph> </Section> class="xml-element"></Paper>