File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1012_metho.xml

Size: 16,762 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1012">
  <Title>Extensions to HMM-based Statistical Word Alignment Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Part of Speech Tags in a Translation
Model
</SectionTitle>
    <Paragraph position="0"> Augmenting the model a0a25a1a4a3 a5a7a15a8a10 a11a7a26a13 with part of speech tag information leads to the following equations. We use a3 a5a7 , a10 a11a7 or vector notation e, f to denote English and French strings. (a27 and a28 represent the lengths of the French and English strings respectively.) Let us define eT and fT as possible POS tag sequences of the sentences e and f. We can rewrite the string translation probability a0a2a1a4a3 a5a7a15a8a10 a11a7a14a13 as follows (using Bayes rule to give the last line):</Paragraph>
    <Paragraph position="2"> If we also assume that the taggers in both languages generate a single tag sequence for each sentence then the equation for machine translation by the noisy channel model simplifies to</Paragraph>
    <Paragraph position="4"> This is the decomposition of the string translation probability into a language model and translation model. In this paper we only address the translation model and assume that there exists a one-to-one alignment from target to source words. Therefore, a29a2a1a19a32a51a40a42a32a44a43a45a8a30a73a40a42a30a74a43a14a13a75a33 a34a49a76a77a29a2a1a19a32a51a40a42a32a44a43a25a40a42a78a79a8a30a41a40a42a30a50a43a14a13 One possible way to rewrite a29a25a1a4a10 a11a7a53a40a80a10a41a81 a11a7a45a40a42a82 a11a7a83a8</Paragraph>
    <Paragraph position="6"> Here each a82a9a22 gives the index of the word a3a15a102a44a103 to which a10a23a22 is aligned. The models we present in this paper will differ in the decompositions of alignment probabilities, tag translation and word translation probabilities in Eqn. 1. Section 3 describes the baseline model in more detail. Section 4 illustrates examples where the baseline model performs poorly. Section 5 presents our extensions and Section 6 presents experimental results.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Baseline Model
</SectionTitle>
    <Paragraph position="0"> Translation of French and English sentences shows a strong localization effect. Words close to each other in the source language remain close in the translation. Furthermore, most of the time the alignment shows monotonicity. This means that pairwise alignments stay close to the diagonal line of the a1a105a104a6a40a107a106a108a13 plane. It has been shown (Vogel et al., 1996; Och et al., 1999; Och and Ney, 2000a) that HMM based alignment models are effective at capturing such localization. null We use as a baseline the model presented by (Och and Ney, 2000a). A basic bigram HMM-based model gives us</Paragraph>
    <Paragraph position="2"> In this HMM model,2 alignment probabilities are independent of word position and depend only on jump width (a82a6a22a115a114a116a82a9a22a23a24 a7 ).3 The Och and Ney (2000a) model includes refinements including special treatment of a jump to Null and smoothing with a uniform prior which we also included in our initial model. As in their model we set the probability for jump from any state to Null to a fixed value (a17 a33a118a117a54a119 ) which we estimated from held-out data.</Paragraph>
    <Paragraph position="4"> as output.</Paragraph>
    <Paragraph position="5"> 3In order for the model not to be deficient, we normalize the jump probabilities at each EM step so that jumping outside of the borders of the sentence is not possible.</Paragraph>
    <Paragraph position="6"> NULL in itaddition , serious threat to confederation en pourrait uneconstituerelle serieuse menace pour la confederation et,outre</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Alignment Irregularities
</SectionTitle>
    <Paragraph position="0"> Although the baseline Hidden Markov alignment model successfully generates smooth alignments, there are a fair number of alignment examples where pairwise match shows local irregularities. One instance of this is the transition of the NP a139 JJ NN rule to NP a139 NN JJ from English to French. We can list two main reasons why word translation probabilities may not catch such irregularities to monotonicity. First, it may be the case that both the English adjective and noun are words that are unknown. In this case the translation probabilities will be close to each other after smoothing. Second, the adjective-noun pair may consist of words that are frequently seen together in English. National reserve and Canadian parliament, are examples of such pairs. As a result there will be an indirect association between the English noun and the translation of the English adjective. In both cases, word translation probabilities will not be differentiating enough and alignment probabilities become the dominating factor to determine where a10 a22 aligns.</Paragraph>
    <Paragraph position="1"> Figure 1 illustrates how our baseline HMM model makes an alignment mistake of this sort. The table in the figure displays alignment and translation probabilities of two competing alignments (namely Aln1 and Aln2) for the last three words. In both alignments, the shown a10 a22 and a3 a102a44a103 are periods at the end of the French and English sentences. The first alignment maps nationale to national and unit'e to unity. (i.e. a3a16a102a46a103a107a131 a110 a33 national and a3a21a102a44a103a42a131a6a133 =unity). The second alignment maps both nationale and unit'e to unity (i.e. a3a16a102a46a103a107a131 a110 a33 unity and a3a16a102a46a103a107a131a47a133a140a33 unity). Starting from the unity-unit'e alignment, the jump width sequences a141 (a82a9a22a23a24 a7 a114a77a82a15a22a23a24a73a142 ), (a82a9a22a14a114a77a82a9a22a23a24 a7 )a143 for Aln1 and Aln2 are a141 a114a145a144 , 2a143 and a141 0, 1a143 respectively. The table shows that the gain from use of monotonic alignment probabilities dominates over the lowered word translation probability. Although national and nationale are strongly correlated according to the translation probabilities, jump widths of a114a140a144 and 2 are less probable than jump widths of 0 and 1.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Extensions
</SectionTitle>
    <Paragraph position="0"> In this section we describe our improvements on the HMM model. We present evaluation results in Section 6 after describing the technical details of our models here.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 POS Tags for Translation Probabilities
</SectionTitle>
      <Paragraph position="0"> Our model with part of speech tags for translation probabilities uses the following simplification of the translation probability shown in Eqn. 1.4</Paragraph>
      <Paragraph position="2"> In this model we introduce tag translation probabilities as an extra factor to Eqn. 2. Intuitively the role of this factor is to boost the translation probabilities for words of parts of speech that can often be translations of each other. Thus this probability distribution provides prior knowledge of the possible translations of a word based only on its part of speech.</Paragraph>
      <Paragraph position="3"> However, P(a10a41a81 a22 a8a3a122a81 a102a44a103 ) should not be too sharp or 4Since we are only concerned with alignment here and not generation of candidate translations the factor P(a147</Paragraph>
      <Paragraph position="5"> be ignored and we omit it from the equations for the rest of the paper.</Paragraph>
      <Paragraph position="6"> it will dominate the alignment probabilities and the probabilities a29a25a1a4a10a52a8a3a21a13 . We use the following linear interpolation to smooth tag translation probabilities:</Paragraph>
      <Paragraph position="8"> T is the size of the French tag set and a148 is set to be 0.1 in our experiments. The tag translation model is so heavily smoothed with a uniform distribution because in EM the tag translation probabilities quickly become very sharp and can easily overrule the alignment and word translation probabilities. The Results section shows that the addition of this factor reduces the alignment error rate, with the improvement being especially large when the training data size is small.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Tag Sequences for Jump Probabilities
</SectionTitle>
      <Paragraph position="0"> This section describes an extension to the bigram HMM model that uses source and target language tag sequences as conditioning information when predicting the alignment of target language words.</Paragraph>
      <Paragraph position="1"> In the decomposition of the joint probability a29a2a1a19a32a47a40a42a32a46a43a70a40a42a78a60a8a30a73a40a42a30a74a43a26a13 shown in Eqn. 1 the factor for alignment probabilities is</Paragraph>
      <Paragraph position="3"> A bigram HMM model assumes independence of a82a9a22 from anything but the previous alignment position a82a9a22a23a24 a7 and the length of the English sentence. Brown et al. (1993) and Och et al. (1999) variably condition this probability on the English word in position a82a9a22a23a24 a7 and/or the French word in position a104 . As conditioning directly on words would yield a large number of parameters and would be impractical, they cluster the words automatically into bilingual word classes.</Paragraph>
      <Paragraph position="4"> The question arises then whether we would have larger gains by conditioning on the part of speech tags of those words or even more words around the alignment position. For example, if we use the following conditioning information:</Paragraph>
      <Paragraph position="6"> we could model probabilities of transpositions and insertion of function words in the target language that have no corresponding words in the source language (a3a16a102a46a103 is Null) similarly to the channel operations of the (Yamada and Knight, 2001) syntax-based statistical translation model. Since the syntactic knowledge provided by POS tags is quite limited, this is a crude model of transpositions and Null insertions at the preterminal level. However we could still expect that it would help in modeling local word order variations. For example, in the sentence J'aime la chute 'I love the fall' the probability of aligning a10a23a22a156a33 la (a10a41a81a98a22a146a33 DT) to the will be boosted by knowing a3a122a81a157a102a44a103a42a131 a110 a33 VBP and a3a122a81a31a102a46a103a107a131 a110a44a155 a7 a33 DT. Similarly, in the sentence J'aime des chiens 'I love dogs' the probability of aligning a10a21a22a158a33 la to Null will be increased by knowing a3a122a81 a102a46a103a107a131 a110 a33 VBP and</Paragraph>
      <Paragraph position="8"> a33 NNS. VBP followed by NNS crudely conducts the information that the verb is followed by a noun phrase which does not include a determiner.</Paragraph>
      <Paragraph position="9"> We conducted a series of experiments where the alignment probabilities are conditioned on different subsets of the part of speech tags</Paragraph>
      <Paragraph position="11"> In order to be able to condition on a10a41a81a12a22a15a40a80a10a41a81a98a22 a155 a7 when generating an alignment position for a10a15a22 , we have to change the generative model for the sentence f and its tag sequence fT to generate the part of speech tags for the French words before choosing alignment positions for them. The French POS tags could be generated for example from a prior distribution a29a2a1a4a10a41a81a41a22a21a13 or from the previous French tags as in an HMM for part-of-speech tagging. The generative model becomes: a29a2a1a19a32a51a40a42a32a44a43a159a8a30a56a13a115a33</Paragraph>
      <Paragraph position="13"> This model makes the assumption that target words are independent of their tags given the corresponding source word and models only the dependence of alignment positions on part of speech tags.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Modeling Fertility
</SectionTitle>
      <Paragraph position="0"> A major advantage of the IBM models 3-5 over the HMM alignment model is the presence of a model of source word fertility. Thus knowledge that some words translate as phrases in the target language is incorporated in the model.</Paragraph>
      <Paragraph position="1"> The HMM model has no memory, apart from the previous alignment, about how many words it has aligned to a source word. Yet even this memory is not used to decide whether to generate more words from a given English word. The decision to generate again (to make a jump of size 0) is independent of the word and is estimated over all words in the corpus.</Paragraph>
      <Paragraph position="2"> We extended the HMM model to decide whether to generate more words from the previous English word a3a16a102a44a103a42a131 a110 or to move on to a different word depending on the identity of the English word a3a6a102a44a103a42a131 a110 . We introduced a factor a29a2a1 staya8a3 a102a44a103a42a131 a110 a13 where the boolean random variable stay depends on the English word a10a23a22a23a24 a7 aligned to. Since in most cases words with fertility greater than one generate words that are consecutive in the target language, this extension approximates fertility modeling. More specifically, the baseline model (i.e., Eqn. 2) is changed as follows:  a13 in Eqn. 5 is the Kronecker delta function. Basically, the new alignment probabilities</Paragraph>
      <Paragraph position="4"> a13 state that a jump width of zero depends on the English word. If we define the fertility of a word as the number of consecutive words from the target language it generates, then the probability distribution for the fertility of an English word e according to this model is geometric with a probability of success a144a85a114a166a29a2a1 staya8a3a15a13 . The expectation isa7  to the real fertility distribution may not be very good, this approximation improves alignment accuracy in practice.</Paragraph>
      <Paragraph position="5"> Sparsity is a problem in estimating stay probabilities P(staya8a3a16a102a46a103a107a131 a110 ). We use the probability of a jump of size zero from the baseline model as our prior to do smoothing as follows:</Paragraph>
      <Paragraph position="7"> a90 in this equation is the alignment probability from the baseline model with zero jump distance.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Translation Model for Null
</SectionTitle>
      <Paragraph position="0"> As originally proposed by Brown et al. (1993), words in the target sentence for which there are no corresponding English words are assumed to be generated by the special English word Null. Null appears in every English sentence and often serves to generate syntactic elements in the target language that are missing in the source. A probability distribution a29a2a1a4a10a52a8Nulla13 for generation probabilities of the Null is re-estimated from a training corpus.</Paragraph>
      <Paragraph position="1"> Modeling a Null word has proven problematic. It has required many special fixes to keep models from aligning everything to Null or to keep them from aligning nothing to Null (Och and Ney, 2000b). This might stem from the problem that the Null is responsible for generating syntactic elements of the target language as well as generating words that make the target language sentence more idiomatic and stylistic. The intuition for our model of translation probabilities for target words that do not have corresponding source words is that these words are generated from the special English Null and also from the next word in the target language by a mixture model. The pair la conf'ed'eration in Figure 1 is an example of such case where conf'ed'eration contributes extra information in generation of la. The formula for the probability of a target word given that it does not have a corresponding aligning word in the source is:  pus using EM. The dependence of a French word on the next French word requires a change in the generative model to first propose alignments for all words in the French sentence and to then generate the French words given their alignments, starting from the end of the sentence and going towards the beginning. For the new model there is an efficient dynamic programming algorithm for computations in EM similar to the forward-backward algorithm. The probability  These can be computed recursively and used for efficient computation of posteriors in EM.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML