XML Viewer - w06-3124

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3124_metho.xml
Size: 11,999 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3124">
  <Title>Microsoft Research Treelet Translation System: NAACL 2006 Europarl Evaluation</Title>
  <Section position="3" start_page="0" end_page="160" type="metho">
    <SectionTitle>
2. System Details
</SectionTitle>
    <Paragraph position="0"> A brief word on notation: s and t represent source and target lexical nodes; S and T represent source and target trees; s and t represent source and target treelets (connected subgraphs of the dependency tree). The expression [?]t[?] T refers to all the lexical items in the target language tree T and |T| refers to the count of lexical items in T. We use subscripts to indicate selected words: Tn represents the nth lexical item in an in-order traversal of T.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1. Training
</SectionTitle>
      <Paragraph position="0"> We use the broad coverage dependency parser NLPWIN [3] to obtain source language dependency trees, and we use GIZA++ [4] to produce word alignments. The GIZA++ training regimen and parameters are tuned to optimize BLEU [5] scores on held-out data. Using the word alignments, we follow a set of dependency tree projection heuristics [1] to construct target dependency trees, producing a word-aligned parallel dependency tree corpus. Treelet translation pairs are extracted by enumerating all source treelets (to a maximum size) aligned to a target treelet.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="158" type="sub_section">
      <SectionTitle>
2.2. Decoding
</SectionTitle>
      <Paragraph position="0"> We use a tree-based decoder, inspired by dynamic programming. It searches for an approximation of  the n-best translations of each subtree of the input dependency tree. Translation candidates are composed from treelet translation pairs extracted from the training corpus. This process is described in more detail in [1].</Paragraph>
    </Section>
    <Section position="3" start_page="158" end_page="160" type="sub_section">
      <SectionTitle>
2.3. Models
</SectionTitle>
      <Paragraph position="0"> 2.3.1. Channel models We employ several channel models: a direct maximum likelihood estimate of the probability of target given source, as well as an estimate of source given target and target given source using the word-based IBM Model 1 [6]. For MLE, we use absolute discounting to smooth the probabilities: PMLEt[?]s=cs ,t[?]l cs,* Here, c represents the count of instances of the treelet pair [?]s, t[?] in the training corpus, and l is determined empirically.</Paragraph>
      <Paragraph position="1"> For Model 1 probabilities we compute the sum over all possible alignments of the treelet without normalizing for length. The calculation of source given target is presented below; target given source is calculated symmetrically.</Paragraph>
      <Paragraph position="2">  number of theoretical problems, such as the ad hoc estimation of phrasal probability, the failure to model the partition probability, and the tenuous connection between the phrases and the underlying word-based alignment model. In string-based SMT systems, these problems are outweighed by the key role played by phrases in capturing &amp;quot;local&amp;quot; order. In the absence of good global ordering models, this has led to an inexorable push towards longer and longer phrases, resulting in serious practical problems of scale, without, in the end, obviating the need for a real global ordering story.</Paragraph>
      <Paragraph position="3"> In [13] we discuss these issues in greater detail and also present our approach to this problem. Briefly, we take as our basic unit the Minimal Translation Unit (MTU) which we define as a set of source and target word pairs such that there are no word alignment links between distinct MTUs, and no smaller MTUs can be extracted without violating the previous constraint. In other words, these are the minimal non-compositional phrases. We then build models based on n-grams of MTUs in source string, target string and source dependency tree order. These bilingual n-gram models in combination with our global ordering model allow us to use shorter phrases without any loss in quality, or alternately to improve quality while keeping phrase size constant.</Paragraph>
      <Paragraph position="4"> As an example, consider the aligned sentence pair in Figure 1. There are seven MTUs:</Paragraph>
      <Paragraph position="6"> We can then predict the probability of each MTU in the context of (a) the previous MTUs in source order, (b) the previous MTUs in target order, or (c) the ancestor MTUs in the tree. We consider all of these traversal orders, each acting as a separate feature function in the log linear combination. For source and target traversal order we use a trigram model, and a bigram model for tree order.</Paragraph>
      <Paragraph position="7"> 2.3.3. Target language models We use both a surface level trigram language model and a dependency-based bigram language model [7], similar to the bilexical dependency modes used in some English Treebank parsers (e.g. [8]).</Paragraph>
      <Paragraph position="9"> Ptrisurf is a Kneser-Ney smoothed trigram language model trained on the target side of the training corpus, and Pbilex is a Kneser-Ney smoothed we 2 should 1 follow the 2 Rio 1 agenda +1 hemos 1 de +1 cumplir el 1 programa +1 de 1 Rio +1  bigram language model trained on target language dependencies extracted from the aligned parallel dependency tree corpus.</Paragraph>
      <Paragraph position="10">  The order model assigns a probability to the position (pos) of each target node relative to its head based on information in both the source and target trees:</Paragraph>
      <Paragraph position="12"> Here, position is modeled in terms of closeness to the head in the dependency tree. The closest pre-modifier of a given head has position -1; the closest post-modifier has a position 1. Figure 1 shows an example dependency tree pair annotated with head-relative positions.</Paragraph>
      <Paragraph position="13"> We use a small set of features reflecting local information in the dependency tree to model P(pos  (t,parent(t))  |S, T): * Lexical items of t and parent(t), the parent of t in the dependency tree.</Paragraph>
      <Paragraph position="14"> * Lexical items of the source nodes aligned to t and head(t).</Paragraph>
      <Paragraph position="15"> * Part-of-speech (&amp;quot;cat&amp;quot;) of the source nodes  aligned to the head and modifier.</Paragraph>
      <Paragraph position="16"> * Head-relative position of the source node aligned to the source modifier.</Paragraph>
      <Paragraph position="17"> These features along with the target feature are gathered from the word-aligned parallel dependency tree corpus and used to train a statistical model. In previous versions of the system, we trained a decision tree model [9]. In the current version, we explored log-linear models. In addition to providing a different way of combining information from multiple features, log-linear models allow us to model the similarity among different classes (target positions), which is advantageous for our task.</Paragraph>
      <Paragraph position="18"> We implemented a method for automatic selection of features and feature conjunctions in the log-linear model. The method greedily selects feature conjunction templates that maximize the accuracy on a development set. Our feature selection study showed that the part-of-speech labels of the source nodes aligned to the head and the modifier and the head-relative position of the source node corresponding to the modifier were the most important features. It was useful to concatenate the part-of-speech of the source head with every feature. This effectively achieves learning of separate movement models for each source head category. Lexical information on the pairs of head and dependent in the source and target was also very useful.</Paragraph>
      <Paragraph position="19"> To model the similarity among different target classes and to achieve pooling of data across similar classes, we added multiple features of the target position. These features let our model know, for example, that position -5 looks more like position -6 than like position 3. We added a feature &amp;quot;positive&amp;quot;/&amp;quot;negative&amp;quot; which is shared by all positive/negative positions. We also added a feature looking at the displacement of a position in the target from the corresponding position in the source and features which group the target positions into bins. These features of the target position are combined with features of the input. This model was trained on the provided parallel corpus. As described in Section 2.1 we parsed the source sentences, and projected target dependencies. Each head-modifier pair in the resulting target trees constituted a training instance for the order model.</Paragraph>
      <Paragraph position="20"> The score computed by the log-linear order model is used as a single feature in the overall log-linear combination of models (see Section 1), whose parameters were optimized using MaxBLEU [2]. This order model replaced the decision tree-based model described in [1].</Paragraph>
      <Paragraph position="21"> We compared the decision tree model to the log-linear model on predicting the position of a modifier using reference parallel sentences, independent of the full MT system. The decision tree achieved per decision accuracy of 69% whereas the log-linear model achieved per decision accuracy of 79%.1 In the context of the full MT system, however, the new order model provided a more modest improvement in the BLEU score of 0.39%.</Paragraph>
      <Paragraph position="22"> 2.3.5. Other models We include two pseudo-models that help balance certain biases inherent in our other models.</Paragraph>
      <Paragraph position="23"> * Treelet count. This feature is a count of treelets used to construct the candidate. It acts as a bias toward translations that use a smaller number of treelets; hence toward larger sized treelets incorporating more context.</Paragraph>
      <Paragraph position="24"> * Word count. We also include a count of the words in the target sentence. This feature  helps to offset the bias of the target language model toward shorter sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="160" end_page="160" type="metho">
    <SectionTitle>
3. Discussion
</SectionTitle>
    <Paragraph position="0"> We participated in the English to Spanish track, using the supplied bilingual data only. We used only the target side of the bilingual corpus for the target language model, rather than the larger supplied language model. We did find that increasing the target language order from 3 to 4 had a noticeable impact on translation quality. It is likely that a larger target language corpus would have an impact, but we did not explore this.</Paragraph>
  </Section>
  <Section position="5" start_page="160" end_page="160" type="metho">
    <SectionTitle>
BLEU
</SectionTitle>
    <Paragraph position="0"> We found that the addition of bilingual n-gram based models had a substantial impact on translation quality. Adding these models raised BLEU scores about 0.8%, but anecdotal evidence suggests that human-evaluated quality rose by much more than the BLEU score difference would suggest. In general, we felt that in this corpus, due to the great diversity in translations for the same source language words and phrases, and given just one reference translation, BLEU score correlated rather poorly with human judgments. This was borne out in the human evaluation of the final test results. Humans ranked our system first and second, in-domain and out-of-domain respectively, even though it was in the middle of a field of ten systems by BLEU score. Furthermore, n-gram channel models may provide greater robustness. While our BLEU score dropped 3.61% on out-of-domain data, the average BLEU score of the other nine competing systems dropped 5.11%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML