XML Viewer - p04-1066

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1066_metho.xml
Size: 19,627 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1066">
  <Title>Improving IBM Word-Alignment Model 1</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Problems with Model 1
</SectionTitle>
    <Paragraph position="0"> Model 1 clearly has many shortcomings as a model of translation. Some of these are structural limitations, and cannot be remedied without making the model significantly more complicated. Some of the major structural limitations include: * (Many-to-one) Each word in the target sentence can be generated by at most one word in the source sentence. Situations in which a phrase in the source sentence translates as a single word in the target sentence are not wellmodeled. null * (Distortion) The position of any word in the target sentence is independent of the position of the corresponding word in the source sentence, or the positions of any other source language words or their translations. The tendency for a contiguous phrase in one language to be translated as a contiguous phrase in another language is not modeled at all.</Paragraph>
    <Paragraph position="1"> * (Fertility) Whether a particular source word is selected to generate the target word for a given position is independent of which or how many other target words the same source word is selected to generate.</Paragraph>
    <Paragraph position="2"> These limitations of Model 1 are all well known, they have been addressed in other word-alignment models, and we will not discuss them further here. Our concern in this paper is with two other problems with Model 1 that are not deeply structural, and can be addressed merely by changing how the parameters of Model 1 are estimated.</Paragraph>
    <Paragraph position="3"> The first of these nonstructural problems with Model 1, as standardly trained, is that rare words in the source language tend to act as &amp;quot;garbage collectors&amp;quot; (Brown et al., 1993b; Och and Ney, 2004), aligning to too many words in the target language. This problem is not unique to Model 1, but anecdotal examination of Model 1 alignments suggests that it may be worse for Model 1, perhaps because Model 1 lacks the fertility and distortion parameters that may tend to mitigate the problem in more complex models.</Paragraph>
    <Paragraph position="4"> The cause of the problem can be easily understood if we consider a situation in which the source sentence contains a rare word that only occurs once in our training data, plus a frequent word that has an infrequent translation in the target sentence. Suppose the frequent source word has the translation present in the target sentence only 10% of the time in our training data, and thus has an estimated translation probability of around 0.1 for this target word. Since the rare source word has no other occurrences in the data, EM training is free to assign whatever probability distribution is required to maximize the joint probability of this sentence pair. Even if the rare word also needs to be used to generate its actual translation in the sentence pair, a relatively high joint probability will be obtained by giving the rare word a probability of 0.5 of generating its true translation and 0.5 of spuriously generating the translation of the frequent source word. The probability of this incorrect alignment will be higher than that obtained by assigning a probability of 1.0 to the rare word generating its true translation, and generating the true translation of the frequent source word with a probability of 0.1. The usual fix for over-fitting problems of this type in statistical NLP is to smooth the probability estimates involved in some way.</Paragraph>
    <Paragraph position="5"> The second nonstructural problem with Model 1 is that it seems to align too few target words to the null source word. Anecdotal examination of Model 1 alignments of English source sentences with French target sentences reveals that null word alignments rarely occur in the highest probability alignment, despite the fact that French sentences often contain function words that do not correspond directly to anything in their English translation. For example, English phrases of the form  The structure of Model 1 again suggests why we should not be surprised by this problem. As normally defined, Model 1 hypothesizes only one null word per sentence. A target sentence may contain many words that ideally should be aligned to null, plus some other instances of the same word that should be aligned to an actual source language word. For example, we may have an English/French sentence pair that contains two instances of of in the English sentence, and five instances of de in the French sentence. Even if the null word and of have the same initial probabilty of generating de, in iterating EM, this sentence is going to push the model towards estimating a higher probabilty that of generates de and a lower estimate that the null word generates de. This happens because there are are two instances of of in the source sentence and only one hypothetical null word, and Model 1 gives equal weight to each occurrence of each source word. In effect, of gets two votes, but the null word gets only one. We seem to need more instances of the null word for Model 1 to assign reasonable probabilities to target words aligning to the null word.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Smoothing Translation Counts
</SectionTitle>
    <Paragraph position="0"> We address the nonstructural problems of Model 1 discussed above by three methods. First, to address the problem of rare words aligning to too many words, at each interation of EM we smooth all the translation probability estimates by adding virtual counts according to a uniform probability distribution over all target words. This prevents the model from becoming too confident about the translation probabilities for rare source words on the basis of very little evidence. To estimate the smoothed probabilties we use the following formula:</Paragraph>
    <Paragraph position="2"> tion. We could take |V  |simply to be the total number of distinct words observed in the target language training, but we know that the target language will have many words that we have never observed. We arbitrarily chose |V  |to be 100,000, which is somewhat more than the total number of distinct words in our target language training data. The value of n is empirically optimized on annotated development test data.</Paragraph>
    <Paragraph position="3"> This sort of &amp;quot;add-n&amp;quot; smoothing has a poor reputation in statistical NLP, because it has repeatedly been shown to perform badly compared to other methods of smoothing higher-order n-gram models for statistical language modeling (e.g., Chen and Goodman, 1996). In those studies, however, add-n smoothing was used to smooth bigram or trigram models. Add-n smoothing is a way of smoothing with a uniform distribution, so it is not surprising that it performs poorly in language modeling when it is compared to smoothing with higher order models; e.g, smoothing trigrams with bigrams or smoothing bigrams with unigrams. In situations where smoothing with a uniform distribution is appropriate, it is not clear that add-n is a bad way to do it. Furthermore, we would argue that the word translation probabilities of Model 1 are a case where there is no clearly better alternative to a uniform distribution as the smoothing distribution. It should certainly be better than smoothing with a unigram distribution, since we especially want to benefit from smoothing the translation probabilities for the rarest words, and smoothing with a unigram distribution would assume that rare words are more likely to translate to frequent words than to other rare words, which seems counterintuitive.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Adding Null Words to the Source
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sentence
</SectionTitle>
      <Paragraph position="0"> We address the lack of sufficient alignments of target words to the null source word by adding extra null words to each source sentence. Mathematically, there is no reason we have to add an integral number of null words, so in fact we let the number of null words in a sentence be any positive number. One can make arguments in favor of adding the same number of null words to every sentence, or in favor of letting the number of null words be proportional to the length of the sentence. We have chosen to add a fixed number of null words to each source sentence regardless of length, and will leave for another time the question of whether this works better or worse than adding a number of null words proportional to the sentence length.</Paragraph>
      <Paragraph position="1"> Conceptually, adding extra null words to source sentences is a slight modification to the structure of Model 1, but in fact, we can implement it without any additional model parameters by the simple expedient of multiplying all the translation probabilities for the null word by the number of null words per sentence. This multiplication is performed during every iteration of EM, as the translation probabilities for the null word are re-estimated from the corresponding expected counts. This makes these probabilities look like they are not normalized, but Model 1 can be applied in such a way that the translation probabilities for the null word are only ever used when multiplied by the number of null words in the sentence, so we are simply using the null word translation parameters to keep track of this product pre-computed. In training a version of Model 1 with only one null word per sentence, the parameters have their normal interpretation, since we are multiplying the standard probability estimates by 1.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="1" type="metho">
    <SectionTitle>
6 Initializing Model 1 with Heuristic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
Parameter Estimates
</SectionTitle>
      <Paragraph position="0"> Normally, the translation probabilities of Model 1 are initialized to a uniform distribution over the target language vocabulary to start iterating EM. The unspoken justification for this is that EM training of Model 1 will always converge to the same set of parameter values from any set of initial values, so the intial values should not matter. But this is only the case if we want to obtain the parameter values at convergence, and we have strong reasons to believe that these values do not produce the most accurate sentence alignments. Even though EM will head towards those values from any initial position in the parameter space, there may be some starting points we can systematically find that will take us closer to the optimal parameter values for alignment accuracy along the way.</Paragraph>
      <Paragraph position="1"> To test whether a better set of initial parameter estimates can improve Model 1 alignment accuracy, we use a heuristic model based on the log-likelihood-ratio (LLR) statistic recommended by Dunning (1993). We chose this statistic because it has previously been found to be effective for automatically constructing translation lexicons (e.g., Melamed, 2000; Moore, 2001). In our application, the statistic can be defined by the following formula:</Paragraph>
      <Paragraph position="3"> In this formula t and s mean that the corresponding words occur in the respective target and source sentences of an aligned sentence pair, !t and !s mean that the corresponding words do not occur in the respective sentences, t? and s? are variables ranging over these values, and C(t?,s?) is the observed joint count for the values of t? and s?. All the probabilities in the formula refer to maximum likelihood estimates.</Paragraph>
      <Paragraph position="4">  These LLR scores can range in value from 0 to N *log(2), where N is the number of sentence pairs in the training data. The LLR score for a pair of words is high if the words have either a strong positive association or a strong negative association. Since we expect translation pairs to be positively associated, we discard any negatively associated word pairs by requiring that p(t,s) &gt;p(t) * p(s).</Paragraph>
      <Paragraph position="5"> To use LLR scores to obtain initial estimates for the translation probabilities of Model 1, we have to somehow transform them into numbers that range from 0 to 1, and sum to no more than 1 for all the target words associated with each source word. We know that words with high LLR scores tend to be translations, so we want high LLR scores to correspond to high probabilities, and low LLR scores to correspond to low probabilities. The simplest approach would be to divide each LLR score by the sum of the scores for the source word of the pair, which would produce a normalized conditional probability distribution for each source word.</Paragraph>
      <Paragraph position="6"> Doing this, however, would discard one of the major advantages of using LLR scores as a measure of word association. All the LLR scores for rare words tend to be small; thus we do not put too much confidence in any of the hypothesized word associations for such words. This is exactly the property needed to prevent rare source words from becoming garbage collectors. To maintain this property, for each source word we compute the sum of the  This is not the form in which the LLR statistic is usually presented, but it can easily be shown by basic algebra to be equivalent to [?]l in Dunning's paper. See Moore (2004) for details.</Paragraph>
      <Paragraph position="7"> LLR scores over all target words, but we then divide every LLR score by the single largest of these sums. Thus the source word with the highest LLR score sum receives a conditional probability distribution over target words summing to 1, but the corresponding distribution for every other source word sums to less than 1, reserving some probability mass for target words not seen with that word, with more probability mass being reserved the rarer the word.</Paragraph>
      <Paragraph position="8"> There is no guarantee, of course, that this is the optimal way of discounting the probabilities assigned to less frequent words. To allow a wider range of possibilities, we add one more parameter to the model by raising each LLR score to an empirically optimized exponent before summing the resulting scores and scaling them from 0 to 1 as described above. Choosing an exponent less than 1.0 decreases the degree to which low scores are discounted, and choosing an exponent greater than 1.0 increases degree of discounting.</Paragraph>
      <Paragraph position="9"> We still have to define an initialization of the translation probabilities for the null word. We cannot make use of LLR scores because the null word occurs in every source sentence, and any word occuring in every source sentence will have an LLR score of 0 with every target word, since p(t|s)= p(t) in that case. We could leave the distribution for the null word as the uniform distribution, but we know that a high proportion of the words that should align to the null word are frequently occuring function words. Hence we initialize the distribution for the null word to be the unigram distribution of target words, so that frequent function words will receive a higher probability of aligning to the null word than rare words, which tend to be content words that do have a translation. Finally, we also effectively add extra null words to every sentence in this heuristic model, by multiplying the null word probabilities by a constant, as described in Section 5.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
7 Training and Evaluation
</SectionTitle>
    <Paragraph position="0"> We trained and evaluated our various modifications to Model 1 on data from the bilingual word alignment workshop held at HLT-NAACL 2003 (Mihalcea and Pedersen, 2003). We used a subset of the Canadian Hansards bilingual corpus supplied for the workshop, comprising 500,000 English-French sentences pairs, including 37 sentence pairs designated as &amp;quot;trial&amp;quot; data, and 447 sentence pairs designated as test data. The trial and test data had been manually aligned at the word level, noting particular pairs of words either as &amp;quot;sure&amp;quot; or &amp;quot;possible&amp;quot; alignments, as described by Och and Ney (2003).</Paragraph>
    <Paragraph position="1"> To limit the number of translation probabilities that we had to store, we first computed LLR association scores for all bilingual word pairs with a positive association (p(t,s) &gt;p(t)*p(s)), and discarded from further consideration those with an LLR score of less that 0.9, which was chosen to be just low enough to retain all the &amp;quot;sure&amp;quot; word alignments in the trial data. This resulted in 13,285,942 possible word-to-word translation pairs (plus 66,406 possible null-word-to-word pairs).</Paragraph>
    <Paragraph position="2"> For most models, the word translation parameters are set automatically by EM. We trained each variation of each model for 20 iterations, which was enough in almost all cases to discern a clear minimum error on the 37 sentence pairs of trial data, and we chose as the preferred iteration the one with the lowest alignment error rate on the trial data. The other parameters of the various versions of Model 1 described in Sections 4-6 were optimized with respect to alignment error rate on the trial data using simple hill climbing. All the results we report for the 447 sentence pairs of test data use the parameter values set to their optimal values for the trial data.</Paragraph>
    <Paragraph position="3"> We report results for four principal versions of Model 1, trained using English as the source language and French as the target language: * The standard model is initialized using uniform distributions, and trained without smoothing using EM, for a number of iterations optimized on the trial data.</Paragraph>
    <Paragraph position="4"> * The smoothed model is like the standard model, but with optimized values of the null-word weight and add-n parameter.</Paragraph>
    <Paragraph position="5"> * The heuristic model simply uses the initial heuristic estimates of the translation parameter values, with an optimized LLR exponent and null-word weight, but no EM re-estimation.</Paragraph>
    <Paragraph position="6"> * The combined model initializes the translation parameter values with the heuristic estimates, using the LLR exponent and null-word weight from the optimal heuristic model, and applies EM using optimized values of the null-word weight and add-n parameters. The null-word weight used during EM is optimized separately from the null-word weight used in the initial heuristic parameter estimates.</Paragraph>
    <Paragraph position="7"> We also performed ablation experiments in which we ommitted each applicable modification in turn from each principal version of Model 1, to observe the effect on alignment error. All non-EM-trained parameters were re-optimized on the trial data for each version of Model 1 tested, with the exception  that the value of the LLR exponent and initial null-word weight in the combined model were carried over from the heuristic model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML