File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1047_metho.xml

Size: 14,374 bytes

Last Modified: 2025-10-06 14:14:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1047">
  <Title>Decoding Algorithm in Statistical Machine Translation</Title>
  <Section position="4" start_page="366" end_page="368" type="metho">
    <SectionTitle>
2 Stack Decoding Algorithm
</SectionTitle>
    <Paragraph position="0"> Stack decoders are widely used in speech recognition systems. The basic algorithm can be described as following:  1. Initialize the stack with a null hypothesis. 2. Pop the hypothesis with the highest score off the stack, name it as current-hypothesis. 3. if current-hypothesis is a complete sentence, output it and terminate.</Paragraph>
    <Paragraph position="1"> 4. extend current-hypothesis by appending a word in the lexicon to its end. Compute the score of the new hypothesis and insert it into the stack. Do this for all the words in the lexicon. null 5. Go to 2.</Paragraph>
    <Section position="1" start_page="366" end_page="367" type="sub_section">
      <SectionTitle>
2.1 Scoring the hypotheses
</SectionTitle>
      <Paragraph position="0"> In stack search for statistical machine translation, a hypothesis H includes (a) the length l of the source sentence, and (b) the prefix words in the sentence. Thus a hypothesis can be written as H = l : ere2.. &amp;quot;ek, which postulates a source sentence of length l and its first k words. The score of H, fit, consists of two parts: the prefix score gH for ele2&amp;quot;&amp;quot; ek and the heuristic score hH for the part ek+lek+2&amp;quot;-et that is yet to be appended to H to complete the sentence.</Paragraph>
      <Paragraph position="1">  (3) can be used to assess a hypothesis. Although it was obtained from the alignment model, it would be easier for us to describe the scoring method if we interpret the last expression in the equation in the following way: each word el in the hypothesis contributes the amount e t(gj \[ ei)a(i l J, l, m) to the probability of the target sentence word gj. For each hypothesis H = l : el,e2,-&amp;quot;,ek, we use SH(j) to denote the probability mass for the target word gl contributed by the words in the hypothesis:</Paragraph>
      <Paragraph position="3"> Extending H with a new word will increase Sn(j),l &lt; j &lt; m.</Paragraph>
      <Paragraph position="4"> To make the score additive, the logarithm of the probability in (3) was used. So the prefix score contributed by the translation model is :~'\]~=0 log St/(j). Because our objective is to maximize P(e, g), we have to include as well the logarithm of the language model probability of the hypothesis in the score, therefore we have</Paragraph>
      <Paragraph position="6"> here N is the order of the ngram language model.</Paragraph>
      <Paragraph position="7"> The above g-score gH of a hypothesis H = l : ele?...ek can be calculated from the g-score of its</Paragraph>
      <Paragraph position="9"> A practical problem arises here. For a many early stage hypothesis P, Sp(j) is close to 0. This causes problems because it appears as a denominator in (5) and the argument of the log function when calculating gp. We dealt with this by either limiting the translation probability from the null word (Brown  et al., 1993) at the hypothetical 0-position(Brown et al., 1993) over a threshold during the EM training, or setting SHo (j) to a small probability 7r instead of 0 for the initial null hypothesis H0. Our experiments show that lr = 10 -4 gives the best result.</Paragraph>
      <Paragraph position="10">  To guarantee an optimal search result, the heuristic function must be an upper-bound of the score for all possible extensions ek+le/c+2...et(Nilsson, 1971) of a hypothesis. In other words, the benefit of extending a hypothesis should never be underestimated. Otherwise the search algorithm will conclude prematurely with a non-optimal hypothesis.</Paragraph>
      <Paragraph position="11"> On the other hand, if the heuristic function over-estimates the merit of extending a hypothesis too much, the search algorithm will waste a huge amount of time after it hits a correct result to safeguard the optimality.</Paragraph>
      <Paragraph position="12"> To estimate the language model score h LM of the unrealized part of a hypothesis, we used the negative of the language model perplexity PPtrain on the training data as the logarithm of the average probability of predicting a new word in the extension from a history. So we have</Paragraph>
      <Paragraph position="14"> Here is the motivation behind this. We assume that the perplexity on training data overestimates the likelihood of the forthcoming word string on average. However, when there are only a few words to be extended (k is close to 1), the language model probability for the words to be extended may be much higher than the average. This is why the constant term C was introduced in (6). When k &lt;&lt; l, -(l-k)PPtrain is the dominating term in (6), so the heuristic language model score is close to the average. This can avoid overestimating the score too much. As k is getting closer to l, the constant term C plays a more important role in (6) to avoid underestimating the language model score. In our experiments, we used C = PPtrain +log(Pmax), where Pm== is the maximum ngram probability in the language model.</Paragraph>
      <Paragraph position="15"> To estimate the translation model score, we introduce a variable va(j), the maximum contribution to the probability of the target sentence word gj from any possible source language words at any position between i and l:</Paragraph>
      <Paragraph position="17"> here LE is the English lexicon.</Paragraph>
      <Paragraph position="18"> Since vit (j) is independent of hypotheses, it only needs to be calculated once for a given target sentence. null When k &lt; 1, the heuristic function for the hypothesis H = 1 : ele2 -..e/c, is</Paragraph>
      <Paragraph position="20"> where log(v(k+l)t(j))- logSg(j)) is the maximum increasement that a new word can bring to the likelihood of the j-th target word.</Paragraph>
      <Paragraph position="21"> When k = l, since no words can be appended to the hypothesis, it is obvious that h~ = O.</Paragraph>
      <Paragraph position="22"> This heuristic function over-estimates the score of the upcoming words. Because of the constraints from language model and from the fact that a position in a source sentence cannot be occupied by two different words, normally the placement of words in those unfilled positions cannot maximize the likelihood of all the target words simultaneously.</Paragraph>
    </Section>
    <Section position="2" start_page="367" end_page="367" type="sub_section">
      <SectionTitle>
2.2 Pruning and aborting search
</SectionTitle>
      <Paragraph position="0"> Due to physical space limitation, we cannot keep all hypotheses alive. We set a constant M, and whenever the number of hypotheses exceeds M, the algorithm will prune the hypotheses with the lowest scores. In our experiments, M was set to 20,000.</Paragraph>
      <Paragraph position="1"> There is time limitation too. It is of little practical interest to keep a seemingly endless search alive too long. So we set a constant T, whenever the decoder extends more than T hypotheses, it will abort the search and register a failure. In our experiments, T was set to 6000, which roughly corresponded to 2 and half hours of search effort.</Paragraph>
    </Section>
    <Section position="3" start_page="367" end_page="368" type="sub_section">
      <SectionTitle>
2.3 Multi-Stack Search
</SectionTitle>
      <Paragraph position="0"> The above decoder has one problem: since the heuristic function overestimates the merit of extending a hypothesis, the decoder always prefers hypotheses of a long sentence, which have a better chance to maximize the likelihood of the target words. The decoder will extend the hypothesis with large I first, and their children will soon occupy the stack and push the hypotheses of a shorter source sentence out of the stack. If the source sentence is a short one, the decoder will never be able to find it, for the hypotheses leading to it have been pruned permanently.</Paragraph>
      <Paragraph position="1"> This &amp;quot;incomparable&amp;quot; problem was solved with multi-stack search(Magerman, 1994). A separate stack was used for each hypothesized source sentence length 1. We do compare hypotheses in different stacks in the following cases. First, we compare a complete sentence in a stack with the hypotheses in other stacks to safeguard the optimality of search result; Second, the top hypothesis in a stack is compared with that of another stack. If the difference is greater than a constant ~, then the less probable one will not be extended. This is called soft-pruning, since whenever the scores of the hypotheses in other stacks go down, this hypothesis may revive.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="368" end_page="369" type="metho">
    <SectionTitle>
3 Stack Search with a Simplified
</SectionTitle>
    <Paragraph position="0"> Model In the IBM translation model 2, the alignment parameters depend on the source and target sentence length I and m. While this is an accurate model, it causes the following difficulties: 1. there are too many parameters and therefore too few trainingdata per parameter. This may not be a problem when massive training data are available. However, in our application, this is a severe problem. Figure 1 plots the length distribution for the English and German sentences. When sentences get longer, there are fewer training data available.</Paragraph>
    <Paragraph position="1"> 2. the search algorithm has to make multiple hypotheses of different source sentence length. For each source sentence length, it searches through almost the same prefix words and finally settles on a sentence length. This is a very time consuming process and makes the decoder very inefficient.</Paragraph>
    <Paragraph position="2"> To solve the first problem, we adjusted the count for the parameter a(i \[ j, l, m) in the EM parameter estimation by adding to it the counts for the parameters a(i l j, l', m'), assuming (l, m) and (1', m') are close enough. The closeness were measured in</Paragraph>
    <Paragraph position="4"> .... , ....... .......</Paragraph>
    <Paragraph position="5"> .... . ....... .......</Paragraph>
    <Paragraph position="6"> -:,&amp;quot; ....... ....... :,'' ...... ....... ....... ...</Paragraph>
    <Paragraph position="7"> ...# ....... ~...~..# ....... #..~ .:.. ....... C/..~...~..~...~ ....... ...~ 1' 1  source/target sentence length. The dark dot at the intersection (l, m) corresponds to the set of counts for the alignment parameters a(. \[ o,l, m) in the EM estimation. The adjusted counts are the sum of the counts in the neighboring sets residing inside the circle centered at (1, m) with radius r. We took r = 3 in our experiment.</Paragraph>
    <Paragraph position="8"> Euclidean distance (Figure 2). So we have</Paragraph>
    <Paragraph position="10"> where ~(i I J, l, m) is the adjusted count for the parameter a(i I J, 1, m), c(i I J, l, m; e, g) is the expected count for a(i I J, l, m) from a paired sentence (e g), and c(ilj, l,m;e,g) = 0 when lel l, or Igl C/ m, or i &gt; l, or j &gt; m.</Paragraph>
    <Paragraph position="11"> Although (9) can moderate the severity of the first data sparse problem, it does not ease the second inefficiency problem at all. We thus made a radical change to (9) by removing the precondition that (l, m) and (l', m') must be close enough. This results in a simplified translation model, in which the alignment parameters are independent of the sentence length 1 and m:</Paragraph>
    <Paragraph position="13"> here i,j &lt; Lm, and L,n is the maximum sentence length allowed in the translation system. A slight change to the EM algorithm was made to estimate the parameters.</Paragraph>
    <Paragraph position="14"> There is a problem with this model: given a sentence pair g and e, when the length of e is smaller than Lm, then the alignment parameters do not sum</Paragraph>
    <Paragraph position="16"> We deal with this problem by padding e to length Lm with dummy words that never gives rise to any word in the target of the channel.</Paragraph>
    <Paragraph position="17"> Since the parameters are independent of the source sentence length, we do not have to make an  assumption about the length in a hypothesis. Whenever a hypothesis ends with the sentence end symbol &lt;/s&gt; and its score is the highest, the decoder reports it as the search result. In this case, a hypothesis can be expressed as H = el,e2,...,ek, and IHI is used to denote the length of the sentence prefix of the hypothesis H, in this case, k.</Paragraph>
    <Section position="1" start_page="369" end_page="369" type="sub_section">
      <SectionTitle>
3.1 Heuristics
</SectionTitle>
      <Paragraph position="0"> Since we do not make assumption of the source sentence length, the heuristics described above can no longer be applied. Instead, we used the following</Paragraph>
      <Paragraph position="2"> here h~ is the heuristics for the hypothesis that extend H with n more words to complete the source sentence (thus the final source sentence length is \[H\[ + n.) Pp(x \[ y) is the eoisson distribution of the source sentence length conditioned on the target sentence length. It is used to calculate the mean of the heuristics over all possible source sentence length, m is the target sentence length. The parameters of the Poisson distributions can be estimated from training data.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="369" end_page="369" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> Due to historical reasons, stack search got its current name. Unfortunately, the requirement for search states organization is far beyond what a stack and its push pop operations can handle. What we really need is a dynamic set which supports the following operations:  1. INSERT: to insert a new hypothesis into the set.</Paragraph>
    <Paragraph position="1"> 2. DELETE: to delete a state in hard pruning. 3. MAXIMUM: to find the state with the best score to extend.</Paragraph>
    <Paragraph position="2"> 4. MINIMUM: to find the state to be pruned.  We used the Red-Black tree data structure (Cormen, Leiserson, and Rivest, 1990) to implement the dynamic set, which guarantees that the above operations take O(log n) time in the worst case, where n is the number of search states in the set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML