File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1067_metho.xml
Size: 6,873 bytes
Last Modified: 2025-10-06 14:10:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1067"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Distortion Models For Statistical Machine Translation</Title> <Section position="5" start_page="530" end_page="530" type="metho"> <SectionTitle> 3 Measuring Word Order Similarity </SectionTitle> <Paragraph position="0"> Between Two Language In this section, we propose a simple, novel method for measuring word order similarity (or differences) between any given language pair. This method is based on word-alignments and the BLEU metric.</Paragraph> <Paragraph position="1"> We assume that we have word-alignments for a set of sentence pairs. We first reorder words in the target sentence (e.g., English when translating from Arabic to English) according to the order in which they are aligned to the source words as shown in Table 1. If a target word is not aligned, then, we assume that it is aligned to the same source word that the preceding aligned target word is aligned to.</Paragraph> <Paragraph position="2"> Once the reordered target (here English) sentences are generated, we measure the distortion between the language pair by computing the BLEU3 score between the original target and reordered target, treating the original target as the reference.</Paragraph> <Paragraph position="3"> Table 2 shows these scores for Arabic-English and 3the BLEU scores reported throughout this paper are for case-sensitive BLEU. The number of references used is also reported (e.g., BLEUr1n4c: r1 means 1 reference, n4 means upto 4-gram are considred, c means case sensitive).</Paragraph> <Paragraph position="4"> Chinese-English. The word alignments we use are both annotated manually by human annotators. The Arabic-English test set is the NIST MT Evaluation 2003 test set. It contains 663 segments (i.e., sentences). The Arabic side consists of 16,652 tokens and the English consists of 19,908 tokens. The Chinese-English test set contains 260 segments. The Chinese side is word segmented and consists of 4,319 tokens and the English consists of 5,525 tokens.</Paragraph> <Paragraph position="5"> As suggested by the BLEU scores reported in Table 2, Arabic-English has more word order differences than Chinese-English. The difference in n-gPrec is bigger for smaller values of n, which suggests that Arabic-English has more local word order differences than in Chinese-English.</Paragraph> </Section> <Section position="6" start_page="530" end_page="531" type="metho"> <SectionTitle> 4 Proposed Distortion Model </SectionTitle> <Paragraph position="0"> The distortion model we are proposing consists of three components: outbound, inbound, and pair distortion.</Paragraph> <Paragraph position="1"> Intuitively our distortion models attempt to capture the order in which source words need to be translated. For instance, the outbound distortion component attempts to capture what is typically translated immediately after the word that has just been translated. Do we tend to translate words that precede it or succeed it? Which word position to translate next? Our distortion parameters are directly estimated from word alignments by simple counting over alignment links in the training data. Any aligner such as (Al-Onaizan et al., 1999) or (Vogel et al., 1996) can be used to obtain word alignments. For the results reported in this paper word alignments were obtained using a maximum-posterior word aligner4 described in (Ge, 2004).</Paragraph> <Paragraph position="2"> We will illustrate the components of our model with a partial word alignment. Let us assume that our source sentence5 is (f10, f250, f300)6, and our target sentence is (e410, e20), and their word alignment is a = ((f10, e410), (f300, e20)). Word Alignment a can the source and target sentences, we also assume that the start symbols in the source and target are aligned, and similarly for the end symbols. Those special symbols are omitted in our example for ease of presentation.</Paragraph> <Paragraph position="3"> precision as defined in BLEU.</Paragraph> <Paragraph position="4"> be rewritten as a1 = 1 and a2 = 3 (i.e., the second target word is aligned to the third source word). From this partial alignment we increase the counts for the following outbound, inbound, and pair distortions: Po(d = +2jf10), Pi(d = +2jf300). and Pp(d = +2jf10, f300).</Paragraph> <Paragraph position="5"> Formally, our distortion model components are defined as follows:</Paragraph> <Paragraph position="7"> where fi is a foreign word (i.e., Arabic in our case), d is the step size, and C(djfi) is the observed count of this parameter over all word alignments in the training data. The value for d, in theory, ranges from max to +max (where max is the maximum source sentence length observed), but in practice only a small number of those step sizes are observed in the training data, and hence, have non-zero value).</Paragraph> <Paragraph position="9"> In order to use these probability distributions in our decoder, they are then turned into costs. The outbound distortion cost is defined as:</Paragraph> <Paragraph position="11"> where Ps(d) is a smoothing distribution 7 and a is a linear-mixture parameter 8.</Paragraph> <Paragraph position="12"> which is set empirically.</Paragraph> <Paragraph position="13"> The inbound and pair costs (Ci(djfi) and Cp(djfi, fj)) can be defined in a similar fashion. So far, our distortion cost is defined in terms of words, not phrases. Therefore, we need to generalize the distortion cost in order to use it in a phrase-based decoder. This generalization is defined in terms of the internal word alignment within phrases (we used the Viterbi word alignment). We illustrate this with an example: Suppose the last position translated in the source sentence so far is n and we are to cover a source phrase p=wlAyp wA$nTn that begins at position m in the source sentence. Also, suppose that our phrase dictionary provided the translation Washington State, with internal word alignment a = (a1 = 2, a2 = 1) (i.e., a=(<Washington,wA$nTn>,<State,wlAyp>), then the outbound phrase cost is defined as:</Paragraph> <Paragraph position="15"> where l is the length of the target phrase, a is the internal word alignment, fn is source word at position n (in the sentence), and fai is the source word that is aligned to the i-th word in the target side of the phrase (not the sentence).</Paragraph> <Paragraph position="16"> The inbound and pair distortion costs (i..e, Ci(p, n, m, a) and Cp(p, n, m, a)) can be defined in a similar fashion.</Paragraph> <Paragraph position="17"> The above distortion costs are used in conjunction with other cost components used in our decoder. The ultimate word order choice made is influenced by both the language model cost as well as the distortion cost.</Paragraph> </Section> class="xml-element"></Paper>