File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1033_intro.xml

Size: 7,197 bytes

Last Modified: 2025-10-06 14:03:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1033">
  <Title>A Hierarchical Phrase-Based Model for Statistical Machine Translation</Title>
  <Section position="2" start_page="0" end_page="264" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The alignment template translation model (Och and Ney, 2004) and related phrase-based models advanced the previous state of the art by moving from words to phrases as the basic unit of translation. Phrases, which can be any substring and not necessarily phrases in any syntactic theory, allow these models to learn local reorderings, translation of short idioms, or insertions and deletions that are sensitive to local context. They are thus a simple and powerful mechanism for machine translation.</Paragraph>
    <Paragraph position="1"> The basic phrase-based model is an instance of the noisy-channel approach (Brown et al., 1993),1 in which the translation of a French sentence f into an 1Throughout this paper, we follow the convention of Brown et al. of designating the source and target languages as &amp;quot;French&amp;quot; and &amp;quot;English,&amp;quot; respectively. The variables f and e stand for source and target sentences; f ji stands for the substring of f from position i to position j inclusive, and similarly for eji . English sentence e is modeled as: arg max</Paragraph>
    <Paragraph position="3"> The translation model P( f  |e) &amp;quot;encodes&amp;quot; e into f by the following steps:  1. segment e into phrases -e1 *** -eI, typically with a uniform distribution over segmentations; 2. reorder the -ei according to some distortion model; 3. translate each of the -ei into French phrases ac- null cording to a model P( -f  |-e) estimated from the training data.</Paragraph>
    <Paragraph position="4"> Other phrase-based models model the joint distribution P(e, f ) (Marcu and Wong, 2002) or made P(e) and P( f  |e) into features of a log-linear model (Och and Ney, 2002). But the basic architecture of phrase segmentation (or generation), phrase reordering, and phrase translation remains the same.</Paragraph>
    <Paragraph position="5"> Phrase-based models can robustly perform translations that are localized to substrings that are common enough to have been observed in training. But Koehn et al. (2003) find that phrases longer than three words improve performance little, suggesting that data sparseness takes over for longer phrases. Above the phrase level, these models typically have a simple distortion model that reorders phrases independently of their content (Och and Ney, 2004; Koehn et al., 2003), or not at all (Zens and Ney, 2004; Kumar et al., 2005).</Paragraph>
    <Paragraph position="6"> But it is often desirable to capture translations whose scope is larger than a few consecutive words.  diplomatic relations with North Korea' If we count zhiyi, lit. 'of-one,' as a single token, then translating this sentence correctly into English requires reversing a sequence of five elements. When we run a phrase-based system, Pharaoh (Koehn et al., 2003; Koehn, 2004a), on this sentence (using the experimental setup described below), we get the following phrases with translations: (4) [Aozhou] [shi] [yu] [Bei Han] [you] [bangjiao]1 [de shaoshu guojia zhiyi] [Australia] [is] [dipl. rels.]1 [with] [North Korea] [is] [one of the few countries] where we have used subscripts to indicate the re-ordering of phrases. The phrase-based model is able to order &amp;quot;diplomatic...Korea&amp;quot; correctly (using phrase reordering) and &amp;quot;one...countries&amp;quot; correctly (using a phrase translation), but does not accomplish the necessary inversion of those two groups. A lexicalized phrase-reordering model like that in use in ISI's system (Och et al., 2004) might be able to learn a better reordering, but simpler distortion models will probably not.</Paragraph>
    <Paragraph position="7"> We propose a solution to these problems that does not interfere with the strengths of the phrase-based approach, but rather capitalizes on them: since phrases are good for learning reorderings of words, we can use them to learn reorderings of phrases as well. In order to do this we need hierarchical phrases that consist of both words and subphrases. For example, a hierarchical phrase pair that might help with the above example is: (5) &lt;yu 1 you 2 , have 2 with 1 &gt; where 1 and 2 are placeholders for subphrases. This would capture the fact that Chinese PPs almost always modify VP on the left, whereas English PPs usually modify VP on the right. Because it generalizes over possible prepositional objects and direct objects, it acts both as a discontinuous phrase pair and as a phrase-reordering rule. Thus it is considerably more powerful than a conventional phrase pair. Similarly, (6) &lt; 1 de 2 , the 2 that 1 &gt; would capture the fact that Chinese relative clauses modify NPs on the left, whereas English relative clauses modify on the right; and (7) &lt; 1 zhiyi, one of 1 &gt; would render the construction zhiyi in English word order. These three rules, along with some conventional phrase pairs, suffice to translate the sentence correctly: (8) [Aozhou] [shi] [[[yu [Bei Han]1 you [bangjiao]2] de [shaoshu guojia]3] zhiyi] [Australia] [is] [one of [the [few countries]3 that [have [dipl. rels.]2 with [North Korea]1]]] The system we describe below uses rules like this, and in fact is able to learn them automatically from a bitext without syntactic annotation. It translates the above example almost exactly as we have shown, the only error being that it omits the word 'that' from (6) and therefore (8).</Paragraph>
    <Paragraph position="8"> These hierarchical phrase pairs are formally productions of a synchronous context-free grammar (defined below). A move to synchronous CFG can be seen as a move towards syntax-based MT; however, we make a distinction here between formally syntax-based and linguistically syntax-based MT. A system like that of Yamada and Knight (2001) is both formally and linguistically syntax-based: formally because it uses synchronous CFG, linguistically because the structures it is defined over are (on the English side) informed by syntactic theory (via the Penn Treebank). Our system is formally syntax-based in that it uses synchronous CFG, but not necessarily linguistically syntax-based, because it induces a grammar from a parallel text without relying on any linguistic annotations or assumptions; the result sometimes resembles a syntactician's grammar but often does not. In this respect it resembles Wu's  bilingual bracketer (Wu, 1997), but ours uses a different extraction method that allows more than one lexical item in a rule, in keeping with the phrase-based philosophy. Our extraction method is basically the same as that of Block (2000), except we allow more than one nonterminal symbol in a rule, and use a more sophisticated probability model.</Paragraph>
    <Paragraph position="9"> In this paper we describe the design and implementation of our hierarchical phrase-based model, and report on experiments that demonstrate that hierarchical phrases indeed improve translation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML