File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1030_metho.xml

Size: 12,088 bytes

Last Modified: 2025-10-06 14:08:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1030">
  <Title>Reordering Constraints for Phrase-Based Statistical Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Alignment Template Approach
</SectionTitle>
    <Paragraph position="0"> In this section, we give a brief description of the translation system, namely the alignment template approach. The key elements of this translation approach (Och et al., 1999) are the alignment templates. These are pairs of source andtargetlanguagephraseswithanalignment within the phrases. The alignment templates are build at the level of word classes. This improves the generalization capability of the alignment templates.</Paragraph>
    <Paragraph position="1"> We use maximum entropy to train the model scaling factors (Och and Ney, 2002).</Paragraph>
    <Paragraph position="2"> As feature functions we use a phrase translationmodel as wellas awordtranslation model. Additionally, we use two language model feature functions: a word-based trigram model and a class-based five-gram model. Furthermore, we use two heuristics, namely the word penalty and the alignment template penalty.</Paragraph>
    <Paragraph position="3"> To model the alignment template reorderings, we use a feature function that penalizes re-orderings linear in the jump width.</Paragraph>
    <Paragraph position="4"> A dynamic programming beam search algorithm is used to generate the translation hypothesis with maximum probability. This search algorithm allows for arbitrary reorderings at the level of alignment templates. Within the alignment templates, the reordering is learned in training and kept fix during the search process. There are no constraints on the reorderings within the alignment templates. null This is only a brief description of the alignment template approach. For further details, see (Och et al., 1999; Och and Ney, 2002).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Reordering Constraints
</SectionTitle>
    <Paragraph position="0"> Although unconstrained reordering looks perfect from a theoretical point of view, we find that in practice constrained reordering shows J uncovered positioncovered position uncovered position for extension  with k = 3, i.e. up to three positions may be skipped.</Paragraph>
    <Paragraph position="1"> better performance. The possible advantages of reordering constraints are: 1. The search problem is simplified. As a result there are fewer search errors.</Paragraph>
    <Paragraph position="2"> 2. Unconstrained reordering is only helpful if we are able to estimate the reordering probabilities reliably, which is unfortunately not the case.</Paragraph>
    <Paragraph position="3"> In this section, we will describe two variants of reordering constraints. The first constraints are based on the IBM constraints for single-word based translation models. The second constraints are based on ITGs. In the following, we will use the term &amp;quot;phrase&amp;quot; to mean either a sequence of words or a sequence of word classes as used in the alignment templates.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 IBM Constraints
</SectionTitle>
      <Paragraph position="0"> In this section, we describe restrictions on the phrase reordering in spirit of the IBM constraints (Berger et al., 1996).</Paragraph>
      <Paragraph position="1"> First, we briefly review the IBM constraints at the word level. The target sentence is produced word by word. We keep a coverage vector to mark the already translated (covered) source positions. The next target word has to be the translation of one of the first k uncovered, i.e. not translated, source positions. The IBM constraints are illustrated in Figure 1.</Paragraph>
      <Paragraph position="2"> For further details see e.g. (Tillmann and Ney, 2003).</Paragraph>
      <Paragraph position="3"> For the phrase-based translation approach, we use the same idea. The target sentence is produced phrase by phrase. Now, we allow skipping of up to k phrases. If we set k = 0, we obtain a search that is monotone at the phrase level as a special case.</Paragraph>
      <Paragraph position="4"> The search problem can be solved using dynamic programming. We define a auxiliary function Q(j;S;e). Here, the source position j is the first unprocessed source position; with unprocessed, we mean this source position is neither translated nor skipped. We use the set S = f(jn;ln)jn = 1;:::;Ng to keep track of the skipped source phrases with lengths ln and starting positions jn. We show the formulae for a bigram language model and use the target language word e to keep track of the language model history. The symbol $ is used to mark the sentence start and the sentence end. The extension to higher-order n-gram language models is straightforward. We use M to denote the maximum phrase length in the source language. We obtain the following dynamic programming equations:</Paragraph>
      <Paragraph position="6"> In the recursion step, we have distinguished three cases: in the first case, we translate the next source phrase. This is the same expansion that is done in monotone search. In the second case, we translate a previously skipped phrase and in the third case we skip a source phrase. For notational convenience, we have omitted one constraint in the preceding equations: the final word of the target phrase ~e is the new language model state e (using a bi-gram language model).</Paragraph>
      <Paragraph position="7"> Now, we analyze the complexity of this algorithm. Let E denote the vocabulary size of the target language and let ~E denote the maximum number of phrase translation candidates for a given source phrase. Then, JC/(JC/M)kC/E is an upper bound for the size of the Q-table.</Paragraph>
      <Paragraph position="8"> Once we have fixed a specific element of this table, the maximization steps can be done in O(E C/ ~E C/ (M + k ! 1) + (k ! 1)). Therefore, the complexity of this algorithm is in O(JC/(JC/M)kC/EC/(EC/ ~EC/(M+k!1)+(k!1))).</Paragraph>
      <Paragraph position="9"> Assuming k &lt; M, this can be simplified to  inverted concatenation of two consecutive blocks.</Paragraph>
      <Paragraph position="10"> setting k = 0 results in a search algorithm that is monotone at the phrase level.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 ITG Constraints
</SectionTitle>
      <Paragraph position="0"> In this section, we describe the ITG constraints (Wu, 1995; Wu, 1997). Here, we interpret theinput sentenceasasequenceof blocks.</Paragraph>
      <Paragraph position="1"> In the beginning, each alignment template is a block of its own. Then, the reordering process can be interpreted as follows: we select two consecutive blocks and merge them to a single block by choosing between two options: either keep the target phrases in monotone order or invert the order. This idea is illustrated in Figure 2. The dark boxes represent the two blocks to be merged. Once two blocks are merged, they are treated as a single block and they can be only merged further as a whole. It is not allowed to merge one of the subblocks again.</Paragraph>
      <Paragraph position="2">  The ITG constraints allow for a polynomial-time search algorithm. It is based on the following dynamic programming recursion equations. During the search a table Qjl;jr;eb;et is constructed. Here, Qjl;jr;eb;et denotes the probability of the best hypothesis translating the source words from position jl (left) to position jr (right) which begins with the target language word eb (bottom) and ends with the word et (top). This is illustrated in Figure 3.</Paragraph>
      <Paragraph position="3"> The initialization is done with the phrase-based model described in Section 2. We introduce a new parameter pm (m^= monotone), which denotes the probability of a monotone combination of two partial hypotheses. Here, we formulate the recursion equation for a bi-gram language model, but of course, the same method can also be applied for a trigram lan-</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> The resulting algorithm is similar to the CYKparsing algorithm. It has a worst-case complexity of O(J3 C/E4). Here, J is the length of the source sentence and E is the vocabulary size of the target language.</Paragraph>
      <Paragraph position="8">  For the ITG constraints a dynamic programming search algorithm exists as described in the previous section. It would be more practical with respect to language model recombination to have an algorithm that generates the target sentence word by word or phrase by phrase. The idea is to start with the beam search decoder for unconstrained search and modify it in such a way that it will produce only reorderings that do not violate the ITG constraints. Now, we describe one way to obtain such a decoder. It has been pointed out in (Zens and Ney, 2003) that the ITG constraints can be characterized as follows: a re-ordering violates the ITG constraints if and only if it contains (3;1;4;2) or (2;4;1;3) as a subsequence. This means, if we select four columns and the corresponding rows from the alignment matrix and we obtain one of the two patternsillustrated inFigure4, thisreordering cannot be generated with the ITG constraints.</Paragraph>
      <Paragraph position="9"> Now, we have to modify the beam search decoder such that it cannot produce these two patterns. We implement this in the following way. During the search, we have a coverage vector cov of the source sentence available for each partial hypothesis. A coverage vec- null patterns that violate the ITG constraints.</Paragraph>
      <Paragraph position="10"> tor is a binary vector marking the source sentence words that have already been translated (covered). Additionally, we know the current source sentence position jc and a candidate source sentence position jn to be translated next.</Paragraph>
      <Paragraph position="11"> To avoid the patterns in Figure 4, we have to constrain the placement of the third phrase, because once we have placed the first three phrases we also have determined the position of the fourth phrase as the remaining uncovered position. Thus, we check the following constraints:</Paragraph>
      <Paragraph position="13"> The constraints in Equations 1 and 2 enforce the following: imagine, we traverse the coverage vector cov from the current position jc to the position to be translated next jn. Then, it is not allowed to move from an uncovered position to a covered one.</Paragraph>
      <Paragraph position="14"> Now, we sketch the proof that these constraints are equivalent to the ITG constraints. It is easy to see that the constraint in Equation 1 avoids the pattern on the left-hand side in Figure 4. To be precise: after placing the first two phrases at (b,1) and (d,2), it avoids the placement of the third phrase at (a,3).</Paragraph>
      <Paragraph position="15"> Similarly, the constraint in Equation 2 avoid the pattern on the right-hand side in Figure 4. Therefore, if we enforce the constraints in Equation 1 and Equation 2, we cannot violate the ITG constraints.</Paragraph>
      <Paragraph position="16"> We still have to show that we can generate all the reorderings that do not violate the ITG constraints. Equivalently, we show that any reordering that violates the constraints in Equation 1 or Equation 2 will also violate the ITG constraints. It is rather easy to see that any reordering that violates the constraint in  Equation 1 will generate the pattern on the left-hand side in Figure 4. The conditions to violate Equation 1 are the following: the new candidate position jn is to the left of the current position jc, e.g. positions (a) and (d). Somewhere in between there has to be an covered position j whose successor position j +1 is uncovered, e.g. (b) and (c). Therefore, any reordering that violates Equation 1 generates the pattern on the left-hand side in Figure 4, thus it violates the ITG constraints.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML