File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1002_metho.xml

Size: 16,195 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1002">
  <Title>Machine Translation</Title>
  <Section position="4" start_page="10" end_page="10" type="metho">
    <SectionTitle>
2. Previous work
</SectionTitle>
    <Paragraph position="0"> 2.1. Delayed phrase construction To avoid the major practical problem of phrasal SMT--namely large phrase tables, most of which are not useful to any one sentence--one can instead construct phrase tables on the fly using an indexed form of the training data (Zhang and Vogel 2005; Callison-Burch et al. 2005).</Paragraph>
    <Paragraph position="1"> However, this does not relieve any of the theoretical problems with phrase-based SMT.</Paragraph>
  </Section>
  <Section position="5" start_page="10" end_page="13" type="metho">
    <SectionTitle>
2.2. Syntax-based SMT
</SectionTitle>
    <Paragraph position="0"> Two recent systems have attempted to address the contiguity limitation and global re-ordering problem using syntax-based approaches.</Paragraph>
    <Paragraph position="1"> Hierarchical phrases Recent work in the use of hierarchical phrases (Chiang 2005) improves the ability to capture linguistic generalizations, and also removes the limitation to contiguous phrases. Hierarchical phrases differ from standard phrases in one important way: in addition to lexical items, a phrase pair may contain indexed placeholders, where each index must occur exactly once on each side. Such a formulation leads to a formally syntax-based translation approach, where translation is viewed as a parallel parsing problem over a grammar with one non-terminal symbol.</Paragraph>
    <Paragraph position="2"> This approach significantly outperforms a phrasal SMT baseline in controlled experimentation.</Paragraph>
    <Paragraph position="3"> Hierarchical phrases do address the need for non-contiguous phrases and suggest a powerful ordering story in the absence of linguistic information, although this reordering information is bound in a deeply lexicalized form. Yet they do not address the phrase probability estimation problem; nor do they provide a means of modeling phenomena across phrase boundaries.</Paragraph>
    <Paragraph position="4"> The practical problems with phrase-based translation systems are further exacerbated, since the number of translation rules with up to two non-adjacent non-terminals in a 1-1 monotone sentence pair of n source and target words is O(n6), as compared to O(n2) phrases.</Paragraph>
    <Paragraph position="5"> Treelet Translation Another means of extending phrase-based translation is to incorporate source language syntactic information. In Quirk and Menezes (2005) we presented an approach to phrasal SMT based on a parsed dependency tree representation of the source language. We use a source dependency parser and project a target dependency tree using a word-based alignment, after which we extract tree-based phrases ('treelets') and train a tree-based ordering model. We showed that using treelets and a tree-based ordering model results in significantly better translations than a leading phrase-based system (Pharaoh, Koehn 2004), keeping all other models identical.</Paragraph>
    <Paragraph position="6"> Like the hierarchical phrase approach, treelet translation succeeds in improving the global re-ordering search and allowing discontiguous phrases, but does not solve the partitioning or estimation problems. While we found our treelet system more resistant to degradation at smaller phrase sizes than the phrase-based system, it nevertheless suffered significantly at very small phrase sizes. Thus it is also subject to practical problems of size, and again these problems are exacerbated since there are potentially an exponential number of treelets.</Paragraph>
    <Paragraph position="7"> 2.3. Bilingual n-gram channel models To address on the problems of estimation and partitioning, one recent approach transforms channel modeling into a standard sequence modeling problem (Banchs et al. 2005). Consider the following aligned sentence pair in Figure 1a. In such a well-behaved example, it is natural to consider the problem in terms of sequence models. Picture a generative process that produces a sentence pair in left to right, emitting a pair of words in lock step. Let M = &lt; m1, ..., mn &gt; be a sequence of word pairs mi = &lt; s, t &gt;. Then one can generatively model the probability of an aligned sentence pair using techniques from n-gram language modeling:</Paragraph>
    <Paragraph position="9"> When an alignment is one-to-one and monotone, this definition is sufficient. However alignments are seldom purely one-to-one and monotone in practice; Figure 1b displays common behavior such as one-to-many alignments, inserted words, and non-monotone translation. To address these problems, Banchs et al. (2005) suggest defining tuples such that:  (1) the tuple sequence is monotone, (2) there are no word alignment links between two distinct tuples, (3) each tuple has a non-NULL source side, which may require that target words aligned to NULL are joined with their following word, and (4) no smaller tuples can be extracted without  violating these constraints.</Paragraph>
    <Paragraph position="10"> Note that M is now a sequence of phrase pairs instead of word pairs. With this adjusted definition, even Figure 1b can be generated using the same process using the following tuples:</Paragraph>
    <Paragraph position="12"> There are several advantages to such an approach. First, it largely avoids the partitioning problem; instead of segmenting into potentially large phrases, the sentence is segmented into much smaller tuples, most often pairs of single words. Furthermore the failure to model a partitioning probability is much more defensible when the partitions are much smaller. Secondly, n-gram language model probabilities provide a robust means of estimating phrasal translation probabilities in context that models interactions between all adjacent tuples, obviating the need for overlapping mappings.</Paragraph>
    <Paragraph position="13"> These tuple channel models still must address practical issues such as model size, though much work has been done to shrink language models with minimal impact to perplexity (e.g. Stolcke 1998), which these models could immediately leverage. Furthermore, these models do not address the contiguity problem or the global reordering problem.</Paragraph>
    <Paragraph position="14"> 3. Translation by MTUs In this paper, we address all four theoretical problems using a novel combination of our syntactically-informed treelet approach (Quirk and Menezes 2005) and a modified version of bilingual n-gram channel models (Banchs et al.</Paragraph>
    <Paragraph position="15"> 2005). As in our previous work, we first parse the sentence into a dependency tree. After this initial parse, we use a global search to find a candidate that maximizes a log-linear model, where these candidates consist of a target word sequence annotated with a dependency structure, a word alignment, and a treelet decomposition.</Paragraph>
    <Paragraph position="16"> We begin by exploring minimal translation units and the models that concern them.</Paragraph>
    <Section position="1" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
3.1. Minimal Translation Units
</SectionTitle>
      <Paragraph position="0"> Minimal Translation Units (MTUs) are related to the tuples of Banchs et al. (2005), but differ in several important respects. First, we relieve the restriction that the MTU sequence be monotone.</Paragraph>
      <Paragraph position="1"> This prevents spurious expansion of MTUs to incorporate adjacent context only to satisfy monotonicity. In the example, note that the previous algorithm would extract the tuple &lt;following example, exemple suivant&gt; even though the translations are mostly independent. Their partitioning is also context dependent: if the sentence did not contain the words following or suivant, then &lt; example, exemple &gt; would be a single MTU. Secondly we drop the requirement that no MTU have a NULL source side. While some insertions can be modeled in terms of adjacent words, we believe more robust models  can be obtained if we consider insertions as (a) Monotone aligned sentence pair (b) More common non-monotone aligned sentence pair  independent units. In the end our MTUs are defined quite simply as pairs of source and target word sets that follow the given constraints: (1) there are no word alignment links between distinct MTUs, and (2) no smaller MTUs can be extracted without  violating the previous constraint.</Paragraph>
      <Paragraph position="2"> Since our word alignment algorithm is able to produce one-to-one, one-to-many, many-to-one, one-to-zero, and zero-to-one translations, these act as our basic units. As an example, let us consider example (1) once again. Using this new algorithm, the MTUs would be:</Paragraph>
      <Paragraph position="4"> A finer grained partitioning into MTUs further reduces the data sparsity and partitioning issues associated with phrases. Yet it poses issues in modeling translation: given a sequence of MTUs that does not have a monotone segmentation, how do we model the probability of an aligned translation pair? We propose several solutions, and use each in a log-linear combination of models.</Paragraph>
      <Paragraph position="5"> First, one may walk the MTUs in source order, ignoring insertion MTUs altogether. Such a model is completely agnostic of the target word order; instead of generating an aligned source target pair, it generates a source sentence along with a bag of target phrases. This approach expends a great deal of modeling effort in regenerating the source sentence, which may not be altogether desirable, though it does condition on surrounding translations. Also, it can be evaluated on candidates before orderings are considered. This latter property may be useful in two-stage decoding strategies where translations are considered before orderings.</Paragraph>
      <Paragraph position="6"> Secondly, one may walk the MTUs in target order, ignoring deletion MTUs. Where the sourceorder MTU channel model expends probability mass generating the source sentence, this model expends a probability mass generating the target sentence and therefore may be somewhat redundant with the target language model.</Paragraph>
      <Paragraph position="7"> Finally, one may walk the MTUs in dependency tree order. Let us assume that in addition to an aligned source-target candidate pair, we have a dependency parse of the source side. Where the past models conditioned on surface adjacent MTUs, this model conditions on tree adjacent MTUs. Currently we condition only on the ancestor chain, where parent1(m) is the parent MTU of m, parent2(m) is the grandparent of m, and so on:  This model hopes to capture information completely distinct from the other two models, such as translational preferences contingent on the head, even in the presence of long distance dependencies. Note that it generates unordered dependency tree pairs.</Paragraph>
      <Paragraph position="8"> All of these models can be trained from a parallel corpus that has been word aligned and the source side dependency parsed. We walk through each sentence extracting MTUs in source, target, and tree order. Standard n-gram language modeling tools can be used to train MTU language models.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
3.2. Decoding
</SectionTitle>
      <Paragraph position="0"> We employ a dependency tree-based beam search decoder to search the space of translations. First the input is parsed into a dependency tree  structure. For each input node in the dependency tree, an n-best list of candidates is produced. Candidates consist of a target dependency tree along with a treelet and word alignment. The decoder generally assumes phrasal cohesion: candidates covering a substring (not subsequence) of the input sentence produce a potential substring (not subsequence) of the final translation. In addition to allowing a DP / beam decoder, this allows us to evaluate string-based models (such as the target language model and the source and target order MTU n-gram models) on partial candidates. This decoder is unchanged from our previous work: the MTU n-gram models are simply incorporated as feature functions in the log-linear combination. In the experiments section the MTU models are referred to as model set (1).</Paragraph>
      <Paragraph position="1">  We use word probability tables p(t  |s) and p(s  |t) estimated by IBM Model 1 (Brown et al. 1993).</Paragraph>
      <Paragraph position="2"> Such models can be built over phrases if used in a phrasal decoder or over treelets if used in a treelet decoder. These models are referred to as set (2). Word-based models A target language model using modified Kneser-Ney smoothing captures fluency; a word count feature offsets the target LM preference for shorter selections; and a treelet/phrase count helps bias toward translations using fewer phrases.</Paragraph>
      <Paragraph position="3"> These models are referred to as set (3).</Paragraph>
      <Paragraph position="4">  As in Quirk and Menezes (2005), we include a linguistically-informed order model that predicts the head-relative position of each node independently, and a tree-based bigram target language model; these models are referred to as set (4).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="13" end_page="14" type="metho">
    <SectionTitle>
4. Experimental setup
</SectionTitle>
    <Paragraph position="0"> We evaluate the translation quality of the system using the BLEU metric (Papineni et al., 02) under a variety of configurations. As an additional baseline, we compare against a phrasal SMT decoder, Pharaoh (Koehn et al. 2003).</Paragraph>
    <Section position="1" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
4.1. Data
</SectionTitle>
      <Paragraph position="0"> Two language pairs were used for this comparison: English to French, and English to Japanese. The data was selected from technical software documentation including software manuals and product support articles; Table 4.1 presents the major characteristics of this data.</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
4.2. Training
</SectionTitle>
      <Paragraph position="0"> We parsed the source (English) side of the corpora using NLPWIN, a broad-coverage rule-based parser able to produce syntactic analyses at varying levels of depth (Heidorn 2002). For the purposes of these experiments we used a dependency tree output with part-of-speech tags and unstemmed surface words. Word alignments were produced by GIZA++ (Och and Ney 2003) with a standard training regimen of five iterations of Model 1, five iterations of the HMM Model, and five iterations of Model 4, in both directions.</Paragraph>
      <Paragraph position="1"> These alignments were combined heuristically as described in our previous work.</Paragraph>
      <Paragraph position="2"> We then projected the dependency trees and used the aligned dependency tree pairs to extract treelet translation pairs, train the order model, and train MTU models. The target language models were trained using only the target side of the corpus. Finally we trained model weights by maximizing BLEU (Och 2003) and set decoder optimization parameters (n-best list size, timeouts  etc) on a development test set of 200 held-out sentences each with a single reference translation. Parameters were individually estimated for each distinct configuration.</Paragraph>
      <Paragraph position="3"> Pharaoh The same GIZA++ alignments as above were used in the Pharaoh decoder (Koehn 2004). We used the heuristic combination described in (Och and Ney 2003) and extracted phrasal translation pairs from this combined alignment as described in (Koehn et al., 2003). Aside from MTU models and syntactic models (Pharaoh uses its own ordering approach), the same models were used: MLE and lexical weighting channel models, target LM, and phrase and word count. Model weights were also trained following Och (2003).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML