File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1002_intro.xml

Size: 8,471 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1002">
  <Title>Machine Translation</Title>
  <Section position="3" start_page="0" end_page="10" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The last several years have seen phrasal statistical machine translation (SMT) systems outperform word-based approaches by a wide margin (Koehn 2003). Unfortunately the use of phrases in SMT is beset by a number of difficult theoretical and practical problems, which we attempt to characterize below. Recent research into syntax-based SMT (Quirk and Menezes 2005; Chiang 2005) has produced promising results in addressing some of the problems; research motivated by other statistical models has helped to address others (Banchs et al. 2005). We refine and unify two threads of research in an attempt to address all of these problems simultaneously.</Paragraph>
    <Paragraph position="1"> Such an approach proves both theoretically more desirable and empirically superior.</Paragraph>
    <Paragraph position="2"> In brief, Phrasal SMT systems employ phrase pairs automatically extracted from parallel corpora. To translate, a source sentence is first partitioned into a sequence of phrases I = s1...sI. Each source phrase si is then translated into a target phrase ti. Finally the target phrases are permuted, and the translation is read off in order. Beam search is used to approximate the optimal translation. We refer the reader to Keohn et al.</Paragraph>
    <Paragraph position="3"> (2003) for a detailed description. Unless otherwise noted, the following discussion is generally applicable to Alignment Template systems (Och and Ney 2004) as well.</Paragraph>
    <Section position="1" start_page="0" end_page="9" type="sub_section">
      <SectionTitle>
1.1. Advantages of phrasal SMT
Non-compositionality
</SectionTitle>
      <Paragraph position="0"> Phrases capture the translations of idiomatic and other non-compositional fixed phrases as a unit, side-stepping the need to awkwardly reconstruct them word by word. While many words can be translated into a single target word, common everyday phrases such as the English password translating as the French mot de passe cannot be easily subdivided. Allowing such translations to be first class entities simplifies translation implementation and improves translation quality.</Paragraph>
      <Paragraph position="1"> Local re-ordering Phrases provide memorized re-ordering decisions.</Paragraph>
      <Paragraph position="2"> As previously noted, translation can be conceptually divided into two steps: first, finding a set of phrase pairs that simultaneously covers the source side and provides a bag of translated target phrases; and second, picking an order for those target phrases. Since phrase pairs consist of memorized substrings of the training data, they are very likely to produce correct local reorderings. null Contextual information Many phrasal translations may be easily subdivided into word-for-word translation, for instance the English phrase the cabbage may be translated word-for-word as le chou. However we note that la is also a perfectly reasonable word-for-word translation of the, yet la chou is not a grammatical French string. Even when a phrase appears compositional, the incorporation of contextual information often improves translation  quality. Phrases are a straightforward means of capturing local context.</Paragraph>
      <Paragraph position="3"> 1.2. Theoretical problems with phrasal SMT Exact substring match; no discontiguity Large fixed phrase pairs are effective when an exact match can be found, but are useless otherwise. The alignment template approach (where phrases are modeled in terms of word classes instead of specific words) provides a solution at the expense of truly fixed phrases. Neither phrasal SMT nor alignment templates allow discontiguous translation pairs.</Paragraph>
      <Paragraph position="4"> Global re-ordering Phrases do capture local reordering, but provide no global re-ordering strategy, and the number of possible orderings to be considered is not lessened significantly. Given a sentence of n words, if the average target phrase length is 4 words (which is unusually high), then the re-ordering space is reduced from n! to only (n/4)!: still impractical for exact search in most sentences. Systems must therefore impose some limits on phrasal reordering, often hard limits based on distance as in Koehn et al. (2003) or some linguistically motivated constraint, such as ITG (Zens and Ney, 2004). Since these phrases are not bound by or even related to syntactic constituents, linguistic generalizations (such as SVO becoming SOV, or prepositions becoming postpositions) are not easily incorporated into the movement models.</Paragraph>
      <Paragraph position="5"> Probability estimation To estimate the translation probability of a phrase pair, several approaches are used, often concurrently as features in a log-linear model. Conditional probabilities can be estimated by maximum likelihood estimation. Yet the phrases most likely to contribute important translational and ordering information--the longest ones--are the ones most subject to sparse data issues.</Paragraph>
      <Paragraph position="6"> Alternately, conditional phrasal models can be constructed from word translation probabilities; this approach is often called lexical weighting (Vogel et al. 2003). This avoids sparse data issues, but tends to prefer literal translations where the word-for-word probabilities are high Furthermore most approaches model phrases as bags of words, and fail to distinguish between local re-ordering possibilities.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
Partitioning limitation
</SectionTitle>
      <Paragraph position="0"> A phrasal approach partitions the sentence into strings of words, making several questionable assumptions along the way. First, the probability of the partitioning is never considered. Long phrases tend to be rare and therefore have sharp probability distributions. This adds an inherent bias toward long phrases with questionable MLE probabilities (e.g. 1/1 or 2/2). 1 Second, the translation probability of each phrase pair is modeled independently. Such an approach fails to model any phenomena that reach across boundaries; only the target language model and perhaps whole-sentence bag of words models cross phrase boundaries. This is especially important when translating into languages with agreement phenomena. Often a single phrase does not cover all agreeing modifiers of a headword; the uncovered modifiers are biased toward the most common variant rather than the one agreeing with its head. Ideally a system would consider overlapping phrases rather than a single partitioning, but this poses a problem for generative models: when words are generated multiple times by different phrases, they are effectively penalized.</Paragraph>
      <Paragraph position="1"> 1.3. Practical problem with phrases: size In addition to the theoretical problems with phrases, there are also practical issues. While phrasal systems achieve diminishing returns due</Paragraph>
      <Paragraph position="3"> The Alignment Template method incorporates a loose partitioning probability by instead estimating the probability as (in the special case where each word has a unique class):</Paragraph>
      <Paragraph position="5"> Note that these counts could differ significantly. Picture a source phrase that almost always translates into a discontiguous phrase (e.g. English not becoming French ne ... pas), except for the rare occasion where, due to an alignment error or odd training data, it translates into a contiguous phrase (e.g. French ne parle pas). Then the first probability formulation of ne parle pas given not would be unreasonably high. However, this is a partial fix since it again suffers from data sparsity problems, especially on longer templates where systems hope to achieve the best benefits from phrases.</Paragraph>
      <Paragraph position="6">  to sparse data, one does see a small incremental benefit with increasing phrase lengths. Given that storing all of these phrases leads to very large phrase tables, many research systems simply limit the phrases gathered to those that could possibly influence some test set. However, this is not feasible for true production MT systems, since the data to be translated is unknown.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML