File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1607_metho.xml
Size: 17,422 bytes
Last Modified: 2025-10-06 14:10:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1607"> <Title>Phrasetable Smoothing for Statistical Machine Translation</Title> <Section position="4" start_page="53" end_page="53" type="metho"> <SectionTitle> 2 Phrase-based Statistical MT </SectionTitle> <Paragraph position="0"> Given a source sentence s, our phrase-based SMT system tries to find the target sentence ^t that is the most likely translation of s. To make search more efficient, we use the Viterbi approximation and seek the most likely combination of t and its alignment a with s, rather than just the most likely</Paragraph> <Paragraph position="2"> get phrases such that t = ~t1 ...~tK; ~sk are source phrases such that s = ~sj1 ... ~sjK; and ~sk is the translation of the kth target phrase ~tk.</Paragraph> <Paragraph position="3"> ferent phrasetables are used in parallel, and when rules are used to translate certain classes of entities. To model p(t,a|s), we use a standard loglinear approach:</Paragraph> <Paragraph position="5"> where each fi(s,t,a) is a feature function, and weights li are set using Och's algorithm (Och, 2003) to maximize the system's BLEU score (Papineni et al., 2001) on a development corpus. The features used in this study are: the length of t; a single-parameter distortion penalty on phrase reordering in a, as described in (Koehn et al., 2003); phrase translation model probabilities; and trigram language model probabilities logp(t), using Kneser-Ney smoothing as implemented in the SRILM toolkit (Stolcke, 2002).</Paragraph> <Paragraph position="6"> Phrase translation model probabilities are features of the form:</Paragraph> <Paragraph position="8"> ie, we assume that the phrases ~sk specified by a are conditionally independent, and depend only on their aligned phrases ~tk. The &quot;forward&quot; phrase probabilities p(~t|~s) are not used as features, but only as a filter on the set of possible translations: for each source phrase ~s that matches some ngram in s, only the 30 top-ranked translations ~t according to p(~t|~s) are retained.</Paragraph> <Paragraph position="9"> To derive the joint counts c(~s,~t) from which p(~s|~t) and p(~t|~s) are estimated, we use the phrase induction algorithm described in (Koehn et al., 2003), with symmetrized word alignments generated using IBM model 2 (Brown et al., 1993).</Paragraph> </Section> <Section position="5" start_page="53" end_page="56" type="metho"> <SectionTitle> 3 Smoothing Techniques </SectionTitle> <Paragraph position="0"> Smoothing involves some recipe for modifying conditional distributions away from pure relative-frequency estimates made from joint counts, in order to compensate for data sparsity. In the spirit of ((Hastie et al., 2001), figure 2.11, pg. 38) smoothing can be seen as a way of combining the relative-frequency estimate, which is a model with high complexity, high variance, and low bias, with another model with lower complexity, lower variance, and high bias, in the hope of obtaining better performance on new data. There are two main ingredients in all such recipes: some probability distribution that is smoother than relative frequencies (ie, that has fewer parameters and is thus less complex) and some technique for combining that distribution with relative frequency estimates. We will now discuss both these choices: the distribution for carrying out smoothing and the combination technique. In this discussion, we use ~p() to denote relative frequency distributions.</Paragraph> <Section position="1" start_page="54" end_page="55" type="sub_section"> <SectionTitle> Choice of Smoothing Distribution </SectionTitle> <Paragraph position="0"> One can distinguish between two approaches to smoothing phrase tables. Black-box techniques do not look inside phrases but instead treat them as atomic objects: that is, both the ~s and the ~t in the expression p(~s|~t) are treated as units about which nothing is known except their counts. In contrast, glass-box methods break phrases down into their component words.</Paragraph> <Paragraph position="1"> The black-box approach, which is the simpler of the two, has received little attention in the SMT literature. An interesting aspect of this approach is that it allows one to implement phrasetable smoothing techniques that are analogous to LM smoothing techniques, by treating the problem of estimating p(~s|~t) as if it were the problem of estimating a bigram conditional probability. In this paper, we give experimental results for phrasetable smoothing techniques analogous to Good-Turing, Fixed-Discount, Kneser-Ney, and Modified Kneser-Ney LM smoothing.</Paragraph> <Paragraph position="2"> Glass-box methods for phrasetable smoothing have been described by other authors: see section 3.3. These authors decompose p(~s|~t) into a set of lexical distributions p(s|~t) by making independence assumptions about the words s in ~s. The other possibility, which is similar in spirit to ngram LM lower-order estimates, is to combine estimates made by replacing words in ~t with wildcards, as proposed in section 3.4.</Paragraph> <Paragraph position="3"> Choice of Combination Technique Although we explored a variety of black-box and glass-box smoothing distributions, we only tried two combination techniques: linear interpolation, which we used for black-box smoothing, and log-linear interpolation, which we used for glass-box smoothing.</Paragraph> <Paragraph position="4"> For black-box smoothing, we could have used a backoff scheme or an interpolation scheme. Back-off schemes have the form:</Paragraph> <Paragraph position="6"> pb(~s|~t), else where ph(~s|~t) is a higher-order distribution, pb(~s|~t) is a smooth backoff distribution, and t is a threshold above which counts are considered reliable. Typically, t = 1 and ph(~s|~t) is version of ~p(~s|~t) modified to reserve some probability mass for unseen events.</Paragraph> <Paragraph position="7"> Interpolation schemes have the general form:</Paragraph> <Paragraph position="9"> where a and b are combining coefficients. As noted in (Chen and Goodman, 1998), a key difference between interpolation and backoff is that the former approach uses information from the smoothing distribution to modify ~p(~s|~t) for higher-frequency events, whereas the latter uses it only for low-frequency events (most often 0frequency events). Since for phrasetable smoothing, better prediction of unseen (zero-count) events has no direct impact--only seen events are represented in the phrasetable, and thus hypothesized during decoding--interpolation seemed a more suitable approach.</Paragraph> <Paragraph position="10"> For combining relative-frequency estimates with glass-box smoothing distributions, we employed loglinear interpolation. This is the traditional approach for glass-box smoothing (Koehn et al., 2003; Zens and Ney, 2004). To illustrate the difference between linear and loglinear interpolation, consider combining two Bernoulli distributions p1(x) and p2(x) using each method:</Paragraph> <Paragraph position="12"> to simulate uniform smoothing gives ploglin(x) = p1(x)a/(p1(x)a + q1(x)a). This is actually less smooth than the original distribution p1(x): it preserves extreme values 0 and 1, and makes intermediate values more extreme. On the other hand,</Paragraph> <Paragraph position="14"> opposite properties: it moderates extreme values and tends to preserve intermediate values.</Paragraph> <Paragraph position="15"> An advantage of loglinear interpolation is that we can tune loglinear weights so as to maximize the true objective function, for instance BLEU; recall that our translation model is itself loglinear, with weights set to minimize errors. In fact, a limitation of the experiments described in this paper is that the loglinear weights for the glass-box techniques were optimized for BLEU using Och's algorithm (Och, 2003), while the linear weights for black-box techniques were set heuristically. Obviously, this gives the glass-box techniques an advantage when the different smoothing techniques are compared using BLEU! Implementing an algorithm for optimizing linear weights according to BLEU is high on our list of priorities.</Paragraph> <Paragraph position="16"> The preceding discussion implicitly assumes a single set of counts c(~s,~t) from which conditional distributions are derived. But, as phrases of different lengths are likely to have different statistical properties, it might be worthwhile to break down the global phrasetable into separate phrasetables for each value of |~t |for the purposes of smoothing. Any similar strategy that does not split up {~s|c(~s,~t) > 0} for any fixed ~t can be applied to any smoothing scheme. This is another idea we are eager to try soon.</Paragraph> <Paragraph position="17"> We now describe the individual smoothing schemes we have implemented. Four of them are black-box techniques: Good-Turing and three fixed-discount techniques (fixed-discount interpolated with unigram distribution, Kneser-Ney fixed-discount, and modified Kneser-Ney fixeddiscount). Two of them are glass-box techniques: Zens-Ney &quot;noisy-or&quot; and Koehn-Och-Marcu IBM smoothing. Our experiments tested not only these individual schemes, but also some loglinear combinations of a black-box technique with a glass-box technique.</Paragraph> </Section> <Section position="2" start_page="55" end_page="55" type="sub_section"> <SectionTitle> 3.1 Good-Turing </SectionTitle> <Paragraph position="0"> Good-Turing smoothing is a well-known technique (Church and Gale, 1991) in which observed counts c are modified according to the formula:</Paragraph> <Paragraph position="2"> where cg is a modified count value used to replace c in subsequent relative-frequency estimates, and nc is the number of events having count c. An intuitive motivation for this formula is that it approximates relative-frequency estimates made by successively leaving out each event in the corpus, and then averaging the results (N'adas, 1985).</Paragraph> <Paragraph position="3"> A practical difficulty in implementing Good-Turing smoothing is that the nc are noisy for large c. For instance, there may be only one phrase pair that occurs exactly c = 347,623 times in a large corpus, and no pair that occurs c = 347,624 times, leading to cg(347,623) = 0, clearly not what is intended. Our solution to this problem is based on the technique described in (Church and Gale, 1991). We first take the log of the observed (c,nc) values, and then use a linear least squares fit to lognc as a function of logc. To ensure that the result stays close to the reliable values of nc for large c, error terms are weighted by c, ie: c(lognc[?]lognprimec)2, where nprimec are the fitted values. Our implementation pools all counts c(~s,~t) together to obtain nprimec (we have not yet tried separate counts based on length of ~t as discussed above). It follows directly from (2) that the total count mass assigned to unseen phrase pairs is cg(0)n0 = n1, which we approximate by nprime1. This mass is distributed among contexts ~t in proportion to c(~t), giving final estimates:</Paragraph> <Paragraph position="5"/> </Section> <Section position="3" start_page="55" end_page="56" type="sub_section"> <SectionTitle> 3.2 Fixed-Discount Methods </SectionTitle> <Paragraph position="0"> Fixed-discount methods subtract a fixed discount D from all non-zero counts, and distribute the resulting probability mass according to a smoothing distribution (Kneser and Ney, 1995). We use an interpolated version of fixed-discount proposed by (Chen and Goodman, 1998) rather than the original backoff version. For phrase pairs with non-zero counts, this distribution has the general form:</Paragraph> <Paragraph position="2"> where pb(~s|~t) is the smoothing distribution. Normalization constraints fix the value of a(~t):</Paragraph> <Paragraph position="4"> where n1+([?],~t) is the number of phrases ~s for which c(~s,~t) > 0.</Paragraph> <Paragraph position="5"> We experimented with two choices for the smoothing distribution pb(~s|~t). The first is a plain unigram p(~s), and the second is the Kneser-Ney</Paragraph> <Paragraph position="7"> ie, the proportion of unique target phrases that ~s is associated with, where n1+(~s,[?]) is defined analogously to n1+([?],~t). Intuitively, the idea is that source phrases that co-occur with many different target phrases are more likely to appear in new contexts.</Paragraph> <Paragraph position="8"> For both unigram and Kneser-Ney smoothing distributions, we used a discounting coefficient derived by (Ney et al., 1994) on the basis of a leave-one-out analysis: D = n1/(n1 + 2n2). For the Kneser-Ney smoothing distribution, we also tested the &quot;Modified Kneser-Ney&quot; extension suggested in (Chen and Goodman, 1998), in which specific coefficients Dc are used for small count values c up to a maximum of three (ie D3 is used for c [?] 3). For c = 2 and c = 3, we used formulas given in that paper.</Paragraph> </Section> <Section position="4" start_page="56" end_page="56" type="sub_section"> <SectionTitle> 3.3 Lexical Decomposition </SectionTitle> <Paragraph position="0"> The two glass-box techniques that we considered involve decomposing source phrases with independence assumptions. The simplest approach assumes that all source words are conditionally independent, so that:</Paragraph> <Paragraph position="2"> We implemented two variants for p(sj|~t) that are described in previous work. (Zens and Ney, 2004) describe a &quot;noisy-or&quot; combination:</Paragraph> <Paragraph position="4"> where -sj is the probability that sj is not in the translation of ~t, and p(sj|ti) is a lexical probability. (Zens and Ney, 2004) obtain p(sj|ti) from smoothed relative-frequency estimates in a word-aligned corpus. Our implementation simply uses IBM1 probabilities, which obviate further smoothing. null The noisy-or combination stipulates that sj should not appear in ~s if it is not the translation of any of the words in ~t. The complement of this, proposed in (Koehn et al., 2005), to say that sj should appear in ~s if it is the translation of at least one of the words in ~t:</Paragraph> <Paragraph position="6"> where Aj is a set of likely alignment connections for sj. In our implementation of this method, we assumed that Aj = {1,..., ~I}, ie the set of all connections, and used IBM1 probabilities for p(s|t).</Paragraph> </Section> <Section position="5" start_page="56" end_page="56" type="sub_section"> <SectionTitle> 3.4 Lower-Order Combinations </SectionTitle> <Paragraph position="0"> We mentioned earlier that LM ngrams have a naturally-ordered sequence of smoothing distributions, obtained by successively dropping the last word in the context. For phrasetable smoothing, because no word in ~t is a priori less informative than any others, there is no exact parallel to this technique. However, it is clear that estimates made by replacing particular target (conditioning) words with wildcards will be smoother than the original relative frequencies. A simple scheme for combining them is just to average:</Paragraph> <Paragraph position="2"> One might also consider progressively replacing the least informative remaining word in the target phrase (using tf-idf or a similar measure).</Paragraph> <Paragraph position="3"> The same idea could be applied in reverse, by replacing particular source (conditioned) words with wildcards. We have not yet implemented this new glass-box smoothing technique, but it has considerable appeal. The idea is similar in spirit to Collins' backoff method for prepositional phrase attachment (Collins and Brooks, 1995).</Paragraph> </Section> </Section> <Section position="6" start_page="56" end_page="57" type="metho"> <SectionTitle> 4 Related Work </SectionTitle> <Paragraph position="0"> As mentioned previously, (Chen and Goodman, 1998) give a comprehensive survey and evaluation of smoothing techniques for language modeling. As also mentioned previously, there is relatively little published work on smoothing for statistical MT. For the IBM models, alignment probabilities need to be smoothed for combinations of sentence lengths and positions not encountered in training data (Garc'ia-Varea et al., 1998). Moore (2004) has found that smoothing to correct overestimated IBM1 lexical probabilities for rare words can improve word-alignment performance. Langlais (2005) reports negative results for synonym-based smoothing of IBM2 lexical probabilities prior to extracting phrases for phrase-based SMT.</Paragraph> <Paragraph position="1"> For phrase-based SMT, the use of smoothing to avoid zero probabilities during phrase induction is reported in (Marcu and Wong, 2002), but no details are given. As described above, (Zens and Ney, 2004) and (Koehn et al., 2005) use two different variants of glass-box smoothing (which they call &quot;lexical smoothing&quot;) over the phrasetable, and combine the resulting estimates with pure relative-frequency ones in a loglinear model. Finally, (Cettollo et al., 2005) describes the use of Witten-Bell smoothing (a black-box technique) for phrasetable counts, but does not give a comparison to other methods. As Witten-Bell is reported by (Chen and Goodman, 1998) to be significantly worse than Kneser-Ney smoothing, we have not yet tested this method.</Paragraph> </Section> class="xml-element"></Paper>