File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1607_intro.xml
Size: 4,684 bytes
Last Modified: 2025-10-06 14:03:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1607"> <Title>Phrasetable Smoothing for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="53" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Smoothing is an important technique in statistical NLP, used to deal with perennial data sparseness and empirical distributions that overfit the training corpus. Surprisingly, however, it is rarely mentioned in statistical Machine Translation. In particular, state-of-the-art phrase-based SMT relies on a phrasetable--a large set of ngram pairs over the source and target languages, along with their translation probabilities. This table, which may contain tens of millions of entries, and phrases of up to ten words or more, is an excellent candidate for smoothing. Yet very few publications describe phrasetable smoothing techniques in detail.</Paragraph> <Paragraph position="1"> In this paper, we provide the first systematic study of smoothing methods for phrase-based SMT. Although we introduce a few new ideas, most methods described here were devised by others; the main purpose of this paper is not to invent new methods, but to compare methods. In experiments over many language pairs, we show that smoothing yields small but consistent gains in translation performance. We feel that this paper only scratches the surface: many other combinations of phrasetable smoothing techniques remain to be tested.</Paragraph> <Paragraph position="2"> We define a phrasetable as a set of source phrases (ngrams) ~s and their translations ~t, along with associated translation probabilities p(~s|~t) and p(~t|~s). These conditional distributions are derived from the joint frequencies c(~s,~t) of source/target phrase pairs observed in a word-aligned parallel corpus.</Paragraph> <Paragraph position="3"> Traditionally, maximum-likelihood estimation from relative frequencies is used to obtain conditional probabilities (Koehn et al., 2003), eg, p(~s|~t) = c(~s,~t)/summationtext~s c(~s,~t) (since the estimation problems for p(~s|~t) and p(~t|~s) are symmetrical, we will usually refer only to p(~s|~t) for brevity).</Paragraph> <Paragraph position="4"> The most obvious example of the overfitting this causes can be seen in phrase pairs whose constituent phrases occur only once in the corpus.</Paragraph> <Paragraph position="5"> These are assigned conditional probabilities of 1, higher than the estimated probabilities of pairs for which much more evidence exists, in the typical case where the latter have constituents that co-occur occasionally with other phrases. During decoding, overlapping phrase pairs are in direct competition, so estimation biases such as this one in favour of infrequent pairs have the potential to significantly degrade translation quality.</Paragraph> <Paragraph position="6"> An excellent discussion of smoothing techniques developed for ngram language models (LMs) may be found in (Chen and Goodman, 1998; Goodman, 2001). Phrasetable smoothing differs from ngram LM smoothing in the following ways: * Probabilities of individual unseen events are not important. Because the decoder only proposes phrase translations that are in the phrasetable (ie, that have non-zero count), it never requires estimates for pairs ~s,~t having c(~s,~t) = 0.1 However, probability mass is reserved for the set of unseen translations, implying that probability mass is subtracted from the seen translations.</Paragraph> <Paragraph position="7"> * There is no obvious lower-order distribution for backoff. One of the most important techniques in ngram LM smoothing is to combine estimates made using the previous n[?]1 words with those using only the previous n[?]i words, for i = 2...n. This relies on the fact that closer words are more informative, which has no direct analog in phrasetable smoothing.</Paragraph> <Paragraph position="8"> * The predicted objects are word sequences (in another language). This contrasts to LM smoothing where they are single words, and are thus less amenable to decomposition for smoothing purposes.</Paragraph> <Paragraph position="9"> We propose various ways of dealing with these special features of the phrasetable smoothing problem, and give evaluations of their performance within a phrase-based SMT system.</Paragraph> <Paragraph position="10"> The paper is structured as follows: section 2 gives a brief description of our phrase-based SMT system; section 3 presents the smoothing techniques used; section 4 reviews previous work; section 5 gives experimental results; and section 6 concludes and discusses future work.</Paragraph> </Section> class="xml-element"></Paper>