File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1022_metho.xml
Size: 14,075 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1022"> <Title>Minimum Bayes-Risk Decoding for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Translation Loss Functions </SectionTitle> <Paragraph position="0"> We now introduce translation loss functions to measure the quality of automatically generated translations. Suppose we have a sentence a0 in a source language for which we have generated an automatic translation a1a3a2 with word-to-word alignmenta4a5a2 relative to a0 . The word-to-word alignment a4a5a2 specifies the words in the source sentence a0 that are aligned to each word in the translation a1a6a2 . We wish to compare this automatic translation with a reference translation a1 with word-to-word alignment a4 relative to a0 .</Paragraph> <Paragraph position="1"> We will now present a three-tier hierarchy of translation loss functions of the form a7a5a8a9a8a10a1a6a2a10a11a12a4a5a2a14a13a15a11a16a8a10a1a17a11a12a4a18a13a20a19a12a0a21a13 that measure a8a10a1a21a2a22a11a23a4a24a2a14a13 against a8a10a1a17a11a12a4a18a13 . These loss functions will make use of different levels of information from word strings, MT alignments and syntactic structure from parse-trees of both the source and target strings as illustrated in the following table.</Paragraph> <Paragraph position="2"> We start with an example of two competing English translations for a Chinese sentence (in Pinyin without tones), with their word-to-word alignments in Figure 1.</Paragraph> <Paragraph position="3"> The reference translation for the Chinese sentence with its word-to-word alignment is shown in Figure 2. In this section, we will show the computation of different loss functions for this example.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Lexical Loss Functions </SectionTitle> <Paragraph position="0"> The first class of loss functions uses no information about word alignments or parse-trees, so that a7a24a8a12a8a25a1a21a2a10a11a12a4a5a2a37a13a20a11a16a8a10a1a17a11a12a4a18a13a20a19a23a0a21a13 can be reduced to a7a24a8a10a1a17a11a12a1a3a2a40a13 . We consider three loss functions in this category: The BLEU score (Papineni et al., 2001), word-error rate, and the position-independent word-error rate (Och, 2002). Another example of a loss function in this class is the MT-eval metric introduced in Melamed et al. (2003). A loss function of this type depends only on information from word strings.</Paragraph> <Paragraph position="1"> BLEU score (Papineni et al., 2001) computes the geometric mean of the precision of a41 -grams of various lengths (a41a43a42a45a44a47a46a49a48a40a48a50a52a51 ) between a hypothesis and a reference translation, and includes a brevity penalty (a53a54a8a10a1a17a11a12a1a21a2a14a13a56a55a57a46 ) if the hypothesis is shorter than the reference. We use a50a59a58a61a60 .</Paragraph> <Paragraph position="3"> where a69 a75 a8a25a1a17a11a23a1a21a2a37a13 is the precision of a41 -grams in the hypothesis a1a21a2 . The BLEU score is zero if any of the n-gram precisions a69 a75 a8a10a1a17a11a12a1a6a2a14a13 is zero for that sentence pair. We note that a87a88a55 a62 a7a63a1a3a64a65a8a10a1a17a11a12a1 a2a13a52a55a89a46 . We derive a loss function from BLEU score as</Paragraph> <Paragraph position="5"> distance between the reference and the hypothesis word strings to the number of words in the reference. String-edit distance is measured as the minimum number of edit operations needed to transform a word string to the other word string.</Paragraph> <Paragraph position="6"> Position-independent Word Error Rate (PER) measures the minimum number of edit operations needed to transform a word string to any permutation of the other word string. The PER score (Och, 2002) is then computed as a ratio of this distance to the number of words in the reference word string.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Target Language Parse-Tree Loss Functions </SectionTitle> <Paragraph position="0"> The second class of translation loss functions uses information only from the parse-trees of the two translations, so that a7a5a8a9a8a10a1a17a11a12a4a18a13a20a11a16a8a10a1a21a2a22a11a12a4a5a2a37a13a20a19a12a0a21a13a92a58a93a7a24a8a25a26a30a27a29a11a9a26a30a27a32a31a25a13 . This loss function has no access to any information from the source sentence or the word alignments.</Paragraph> <Paragraph position="1"> Examples of such loss functions are tree-edit distances between parse-trees, string-edit distances between event representation of parse-trees (Tang et al., 2002), and treekernels (Collins and Duffy, 2002). The computation of tree-edit distance involves an unconstrained alignment of the two English parse-trees. We can simplify this problem once we have a third parse tree (for the Chinese sentence) with node-to-node alignment relative to the two English trees. We will introduce such a loss function in the next section. We did not perform experiments involving this class of loss functions, but mention them for completeness in the hierarchy of loss functions.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Bilingual Parse-Tree Loss Functions </SectionTitle> <Paragraph position="0"> The third class of loss functions uses information from word strings, alignments and parse-trees in both languages, and can be described by a7a5a8a9a8a25a1a94a11a12a4a5a13a15a11a16a8a10a1a21a2a22a11a12a4a24a2a14a13a20a19a23a0a21a13a90a58a95a7a5a8a9a8a25a26a96a27a97a11a12a4a5a13a15a11a16a8a25a26a96a27a79a31a98a11a23a4a24a2a14a13a20a19a9a26a30a38a39a13 . We will now describe one such loss function using the example in Figures 1 and 2. Figure 3 shows a tree-to-tree mapping between the source (Chinese) parse-tree and parse-trees of its reference translation and two competing hypothesis (English) translations.</Paragraph> <Paragraph position="2"> the first two months of this year guangdong 's high[?]tech products 3.76 billion US dollars jin[?]nian qian liangyue guangdong gao xinjishu chanpin chukou sanqidianliuyi meiyuan the first two months of this year guangdong exported high[?]tech products 3.76 billion US dollars</Paragraph> <Paragraph position="4"> the Chinese (English) sentence shown as unaligned are aligned to the NULL word in the English (Chinese) sentence.</Paragraph> <Paragraph position="5"> We first assume that a node a41 in the source tree a26a32a38 can be mapped to a node a0 in a26 (and a node a0 a2 in a26a24a2 ) using word alignment a4 (and a4a18a2 respectively). We denote the subtree of a26 rooted at node a0 by a1a3a2 and the subtree of</Paragraph> <Paragraph position="7"> . We will now describe a simple procedure that makes use of the word alignment a4 to construct node-to-node alignment between nodes in the source tree a26a28a38 and the target tree a26 . For each node a41 in the source tree a26 a38 we consider the subtree a1 a75 rooted at a41 . We first read off the source word sequence corresponding to the leaves of a1 a75 . We next consider the subset of words in the target sentence that are aligned to any word in this source word sequence, and select the leftmost and rightmost words from this subset. We locate the leaf nodes corresponding to these two words in the target parse tree a26 , and obtain their closest common ancestor node a0 a42 a26 . This procedure gives us a mapping from a node a41 a42 a26a28a38 to a node a0a59a42 a26 and this mapping associates one subtree a1 a75 a42 a26a30a38 to one subtree</Paragraph> <Paragraph position="9"/> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3.2 Loss Computation between Aligned Parse-Trees </SectionTitle> <Paragraph position="0"> Given the subtree alignment between a26a28a38 and a26 , and a26a30a38 and a26 a2 , we first identify the subset of nodes in a26a32a38 for which we can identify a corresponding node in both a26 and a26a24a2 .</Paragraph> <Paragraph position="2"> The Bilingual Parse-Tree (BiTree) Loss Function can then be computed as</Paragraph> <Paragraph position="4"> through particular choices of a53 . In our experiments, we used a 0/1 loss function between sub-trees a1 and a1 a2 .</Paragraph> <Paragraph position="6"> (2) We note that other tree-to-tree distance measures can also be used to compute a53 , e.g. the distance function could compare if the subtrees a1 and a1 a2 have the same headword/non-terminal tag.</Paragraph> <Paragraph position="7"> The Bitree loss function measures the distance between two trees in terms of distances between their corresponding subtrees. In this way, we replace the string-to-string (Levenshtein) alignments (for WER) or a41 -gram matches (for BLEU/PER) with subtree-to-subtree alignments. null The Bitree Error Rate (in %) is computed as a ratio of the Bi-tree Loss function to the number of nodes in the set a5a50a6a38 .</Paragraph> <Paragraph position="8"> The complete node-to-node alignment between the parse-tree of the source (Chinese) sentence and the parse trees of its reference translation and the two hypothesis translations (English) is given in Table 1. Each row in this table shows the alignment between a node in the Chinese parse-tree and nodes in the reference and the two hypothesis parse-trees. The computation of the Bitree Loss function and the Bitree Error Rate is presented in the last two rows of the table.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Comparison of Loss Functions </SectionTitle> <Paragraph position="0"> In Table 2 we compare various translation loss functions for the example from Figure 1. The two hypothesis translations are very similar at the word level and therefore the BLEU score, PER and the WER are identical. However we observe that the sentences differ substantially in their syntactic structure (as seen from Parse-Trees in Figure 3), and to a lesser extent in their word-to-word alignments (Figure 1) to the source sentence. The first hypothesis translation is parsed as a sentence a21a23a22 a50a25a24a27a26a28a24 while the second translation is parsed as a noun phrase. The Bi-tree loss function which depends both on the parse-trees and the word-to-word alignments, is therefore very different for the two translations (Table 2). While string based metrics such as BLEU, WER and PER are insensitive to the syntactic structure of the translations, BiTree Loss is able to measure this aspect of translation quality, and assigns different scores to the two translations.</Paragraph> <Paragraph position="1"> We provide this example to show how a loss function which makes use of syntactic structure from source and target parse trees, can capture properties of translations that string based loss functions are unable to measure.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Minimum Bayes-Risk Decoding </SectionTitle> <Paragraph position="0"> Statistical Machine Translation (Brown et al., 1990) can be formulated as a mapping of a word sequence a0 in a source language to word sequence a1a3a2 in the target language that has a word-to-word alignmenta4a18a2 relative to a0 . Given the source sentence a0 , the MT decoder a29 a8a25a0a21a13 produces a target word string a1a6a2 with word-to-word alignment a4a5a2 . Relative to a reference translation a1 with word alignment a4 , the decoder performance is measured as a7a24a8a12a8a25a1a17a11a23a4a5a13a15a11a30a29 a8a25a0a21a13a12a13 . Our goal is to find the decoder that has the best performance over all translations. This is mea-</Paragraph> <Paragraph position="2"> The expectation is taken under the true distribution a24a17a8a25a1a94a11a12a4a21a11a23a0a21a13 that describes translations of human quality. Given a loss function and a distribution, it is well known that the decision rule that minimizes the Bayes-Risk is given by (Bickel and Doksum, 1977; Goel and Byrne, 2000):</Paragraph> <Paragraph position="4"> We shall refer to the decoder given by this equation as the Minimum Bayes-Risk (MBR) decoder. The MBR decoder can be thought of as selecting a consensus translation: For each sentence a0 , Equation 3 selects the translation that is closest on an average to all the likely translations and alignments. The closeness is measured under the loss function of interest.</Paragraph> <Paragraph position="5"> This optimal decoder has the difficulties of search (minimization) and computing the expectation under the true distribution. In practice, we will consider the space of translations to be an a50 -best list of translation alternatives generated under a baseline translation model. Of course, we do not have access to the true distribution over translations. We therefore use statistical translation models (Och, 2002) to approximate the distribution and a43a42a18 a42 a27 a37 a18a18a17a45a44a28 a24a4a26a46a44a28 a27 . This is a rescoring procedure that searches for consensus under a given loss function. The posterior probability of each hypothesis in the a50 -best list is derived from the joint probability assigned by the base-line translation model.</Paragraph> <Paragraph position="6"> The conventional Maximum A Posteriori (MAP) decoder can be derived as a special case of the MBR decoder by considering a loss function that assigns a equal cost (say 1) to all misclassifications. Under the 0/1 loss function,</Paragraph> <Paragraph position="8"> the decoder of Equation 3 reduces to the MAP decoder</Paragraph> <Paragraph position="10"> This illustrates why we are interested in MBR decoders based on other loss functions: the MAP decoder is optimal with respect to a loss function that is very harsh. It does not distinguish between different types of translation errors and good translations receive the same penalty as poor translations.</Paragraph> </Section> class="xml-element"></Paper>