File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1001_intro.xml

Size: 3,479 bytes

Last Modified: 2025-10-06 14:02:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1001">
  <Title>A Projection Extension Algorithm for Statistical Machine Translation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Various papers use phrase-based translation systems (Och et al., 1999; Marcu and Wong, 2002; Yamada and Knight, 2002) that have shown to improve translation quality over single-word based translation systems introduced in (Brown et al., 1993). In this paper, we present a similar system with a much simpler set of model parameters. Specifically, we compute the probability of a block sequence a0a2a1a3 . A block a0 is a pair consisting of a contiguous source and a contiguous target phrase. The block sequence  a4 target and source phrases. The example is actual decoder output and the English translation is slightly incorrect.</Paragraph>
    <Paragraph position="1"> probability a5a7a6a9a8 a0a10a1a3a12a11 is decomposed into conditional probabilities using the chain rule:  We try to find the block sequence that maximizes</Paragraph>
    <Paragraph position="3"> a0a37a1a3a36a11 . The model proposed is a joint model as in (Marcu and Wong, 2002), since target and source phrases are generated jointly. The approach is illustrated in Figure 1. The source phrases are given on the a50 -axis and the target phrases are given on the a51 -axis. During block decoding a bijection between source and target phrases is generated. The two types of parameters in Eq 1 are defined as: a52 Block unigram model a53a55a54a42a56a10a57a59a58 : We compute unigram probabilities for the blocks. The blocks are simpler than the alignment templates (Och et al., 1999) in that they do not have an internal structure.</Paragraph>
    <Paragraph position="4"> a52 Trigram language model: the probability a53a55a54a42a56a60a57a62a61a56a37a57a25a63a65a64a10a58 between adjacent blocks is computed as the probability of the first target word in the target clump of a56a10a57 given the final two words of the target clump of a56a28a57a25a63a65a64 . The exponent a66 is set in informal experiments to be a67a69a68a71a70 . No other parameters such as distortion probabilities are used.</Paragraph>
    <Paragraph position="5"> To select blocks a56 from training data, we compute unigram block co-occurrence counts a72a73a54a42a56a28a58 . a72a73a54a42a56a74a58 cannot be computed for all blocks in the training data: we would obtain hundreds of millions of blocks. The blocks are restricted by an underlying word alignment. In this paper, we present a block generation algorithm similar to the one in (Och et al., 1999) in full detail: source intervals are projected into target intervals under a restriction derived from a high-precision word alignment. The projection yields a set of high-precision block links. These block links are further extended using a high-recall word alignment. The block extension algorithm is shown to improve translation performance significantly. The system is tested on a Chinese-English (CE) and an Arabic-English (AE) translation task.</Paragraph>
    <Paragraph position="6"> The paper is structured as follows: in Section 2, we present the baseline block generation algorithm.</Paragraph>
    <Paragraph position="7"> The block extension approach is described in Section 2.1. Section 3 describes a DP-based decoder using blocks. Experimental results are presented in Section 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML