XML Viewer - w02-1019

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1019_metho.xml
Size: 21,509 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1019">
  <Title>Minimum Bayes-Risk Word Alignments of Bilingual Texts</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word-to-Word Bitext Alignment
</SectionTitle>
    <Paragraph position="0"> We will study the problem of aligning an English sentence to a French sentence and we will use the word alignment of the IBM statistical translation models (Brown et al., 1993).</Paragraph>
    <Paragraph position="1"> Let and denote a pair of translated English and French sentences. An English word is defined as an ordered pair , where the index refers to the position of the word in the English sentence; is the vocabulary of English; and the word at position is the NULL word to which &amp;quot;spurious&amp;quot; French words may be aligned. Similarly, a French word is written as .</Paragraph>
    <Paragraph position="2"> An alignment between and is defined to be a sequence where . Under the alignment , the French word is connected to the English word . For every alignment , we define a link set defined as whose elements are given by the alignment links .</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Alignment Loss Functions
</SectionTitle>
    <Paragraph position="0"> In this section we introduce loss functions to measure the quality of automatically produced alignments. Suppose we wish to compare an automatically produced alignment to a reference alignment , which we assume was produced by a competent translator. We will define various loss functions that measure the quality of relative to through their link sets and .</Paragraph>
    <Paragraph position="1"> The desirable qualities in translation are fluency Association for Computational Linguistics.</Paragraph>
    <Paragraph position="2"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 140-147. Proceedings of the Conference on Empirical Methods in Natural and adequacy. We assume here that both word sequences are fluent and adequate translations but that the word and phrase correspondences are unknown.</Paragraph>
    <Paragraph position="3"> It is these correspondences that we wish to determine and evaluate automatically.</Paragraph>
    <Paragraph position="4"> We now present two general classes of loss functions that measure alignment quality. In subsequent sections, we will give specific examples of these and show how to construct decoders that are optimized for each loss function.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Alignment Error
</SectionTitle>
      <Paragraph position="0"> The Alignment Error Rate (AER) introduced by Och and Ney (2000b) measures the fraction of links by which the automatic alignment differs from the reference alignment. Links to the NULL word are ignored. This is done by defining modified link sets for the reference alignment and the automatic alignment .</Paragraph>
      <Paragraph position="1"> The reference annotation procedure allowed the human transcribers to identify which links in they judged to be unambiguous. In addition to the reference alignment, this gives a set of sure links (S) which is a subset of .</Paragraph>
      <Paragraph position="2"> AER is defined as (Och and Ney, 2000b)</Paragraph>
      <Paragraph position="4"> Since our modeling techniques require loss functions rather than error rates, we introduce the Alignment Error loss function (2) We consider error rates to be &amp;quot;normalized&amp;quot; loss functions. We also note that, unlike AER, does not distinguish between ambiguous and unambiguous links. However, if a decoder generates an alignment for which is zero, the AER is also zero. Therefore if AER is the metric of interest, we will design alignment procedures to minimize .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Generalized Alignment Error
</SectionTitle>
      <Paragraph position="0"> We are interested in extending the Alignment Error loss function to incorporate various linguistic features into the measurement of alignment quality.</Paragraph>
      <Paragraph position="1"> The Generalized Alignment Error loss is defined as</Paragraph>
      <Paragraph position="3"> Here we have introduced the word-to-word distance measure which compares the links and as a function of the words in the translation. refers to all loss functions that have the form of Equation 3. Specific loss functions are determined through the choice of . To see the value in this, suppose is a verb in the French sentence and that it is aligned in the reference alignment to , the verb in the English sentence. If our goal is to ensure verb alignment, then can be constructed to penalize any link in the automatic alignment in which is not a verb. We will later give examples of distances in which is based on Part-of-Speech (POS) tags, parse tree distances, and automatically determined word clusters. We note that the can almost be reduced to , except for the treatment of NULL in the English sentence.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Minimum Bayes-Risk Decoding For
Automatic Word Alignment
</SectionTitle>
    <Paragraph position="0"> We present the Minimum Bayes-Risk alignment formulation and derive MBR alignment procedures under the loss functions of Section 3.</Paragraph>
    <Paragraph position="1"> Given a translated pair of English-French sentences , the decoder produces an alignment . Relative to a reference alignment , the decoder performance is measured as . Our goal is to find the decoder that has the best performance over all translated sentences. This is measured through the Bayes Risk . The expectation is taken with respect to the true distribution that describes &amp;quot;human quality&amp;quot; alignments of translations as they are found in bitext. Given a loss function and a probability distribution, it is well known that the decision rule which minimizes the Bayes Risk is given by the following expression (Bickel and Doksum, 1977; Goel and Byrne, 2000).</Paragraph>
    <Paragraph position="2"> (5) Several modeling assumptions have been made to obtain this form of the decoder. We do not have access to the true distribution over translations. We therefore use statistical MT models to approximate . We furthermore assume that the space of alignment alternatives can be restricted to an alignment lattice , which is a compact representation of the most likely word alignments of the sentence pair under the baseline models.</Paragraph>
    <Paragraph position="3"> It is clear from Equation 5 that the MBR decoder is determined by the loss function. The Sentence Alignment Error refers to the loss function that gives a penalty of 1 for any errorful alignment: , where is the indicator function of the set . The MBR decoder under this loss can easily be seen to be the Maximum Likelihood (ML) alignment under the MT models: . This illustrates why we are interested in MBR decoders based on other loss functions: the ML decoder is optimal with respect to a loss function that is overly harsh. It does not distinguish between different types of alignment errors and good alignments receive the same penalty as poor alignments. Moreover, such a harsh penalty is particularly inappropriate when unambiguous word-to-word alignments cannot be provided in all cases even by human translators who produce the reference alignments. The AER makes an explicit distinction between ambiguous and unambiguous word alignments. Ideally, the decoder should be able to do so as well. Motivated by this, the MBR hypothesis can be thought of as the consensus hypothesis under a particular loss function: Equation 5 selects the hypothesis that is, in an average sense, close to the other likely hypotheses. In this way, ambiguity can be reduced by selecting the hypothesis that is &amp;quot;most similar&amp;quot; to the collection of most likely competing hypotheses.</Paragraph>
    <Paragraph position="4"> We now describe the alignment lattice (Section 4.1) and introduce the lattice based probabilities required for the MBR alignment (Section 4.2). The derivation of the MBR alignment under the AE and GAE loss functions is presented in Sections 4.3 and 4.4.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Alignment Lattice
</SectionTitle>
      <Paragraph position="0"> The lattice is represented as a Weighted Finite State Transducer (WFST) (Mohri et al., 2000) with a finite set of states , a set of transition labels , an initial state , the set of final states , and a finite set of transitions . A transition in this WFST is given by where is the starting state, is the ending state, is the alignment link and is the weight. For an English sentence of length and a French sentence of length , we define as .</Paragraph>
      <Paragraph position="1"> A complete path through the WFST is a sequence of transitions given by such that and . Each complete path defines an alignment link set .</Paragraph>
      <Paragraph position="2"> When we write , we mean that is derived from a complete path through . This allows us to use alignment models in which the probability of an alignment can be written as a sum over alignment link weights, i.e. .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Alignment Link Posterior Probability
</SectionTitle>
      <Paragraph position="0"> We first introduce the lattice transition posterior probability of each transition in the</Paragraph>
      <Paragraph position="2"> where is if and otherwise. The lattice transition posterior probability is the sum of the posterior probabilities of all lattice paths passing through the transition . This can be computed very efficiently with a forward-backward algorithm on the alignment lattice (Wessel et al., 1998). is the posterior probability of an alignment link set which can be written as</Paragraph>
      <Paragraph position="4"> We now define the alignment link posterior probability for a link (8) where . This is the probability that any two words are aligned given all the alignments in the lattice .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 MBR Alignment Under
</SectionTitle>
      <Paragraph position="0"> In this section we derive MBR alignment under the Alignment Error loss function (Equation 2). The optimal decoder has the form (Equation 5) (9) The summation is equal to If is the subset of transitions ( ) that do not contain links with the NULL word, we can simplify the bracketed term as</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 MBR Alignment Under
</SectionTitle>
      <Paragraph position="0"> We now derive MBR alignment under the Generalized Alignment Error loss function (Equation 3). The optimal decoder has the form (Equation 5) (12) The summation can be rewritten as where and .</Paragraph>
      <Paragraph position="1"> We can simplify the bracketed term as where and .</Paragraph>
      <Paragraph position="2"> The MBR alignment (Equation 12) can be found in terms of the modified link weight for each align-</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 MBR Alignment Using WFST Techniques
</SectionTitle>
      <Paragraph position="0"> The MBR alignment procedures under the and loss functions begin with a WFST that contains the alignment probabilities as described in Section 4.1. To build the MBR decoder for each loss function the weights on the transitions ( ) of the WFST are modified according to either Equation 11 ( ) or Equation 13 ( ). Once the weights are modified, the search procedure for the MBR alignment is the same in each case. The search is carried out using a shortest-path algorithm (Mohri et al., 2000).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Word Alignment Experiments
</SectionTitle>
    <Paragraph position="0"> We present here examples of Generalized Alignment Error loss functions based on three types of linguistic features and show how they can be incorporated into a statistical MT system to obtain automatic alignments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Syntactic Distances From Parse-Trees
</SectionTitle>
      <Paragraph position="0"> Suppose a parser is available that generates a parse-tree for the English sentence. Our goal is to construct an alignment loss function that incorporates features from the parse. One way to do this is to define a graph distance (14) Here and are the parse-tree leaf nodes corresponding to the English words and . This quantity is computed as the sum of the distances from each node to their closest common ancestor. It gives a syntactic distance between any pair of English words based on the parse-tree. This distance has been used to measure word association for information retrieval (Mittendorfer and Winiwarter, 2001). It reflects how strongly the words and are bound together by the syntactic structure of the English sentence as determined by the parser. Figure 1 shows the parse tree for an English sentence in the test data with the pairwise syntactic distances between the English words corresponding to the leaf nodes.</Paragraph>
      <Paragraph position="2"> pairwise syntactic distances between words.</Paragraph>
      <Paragraph position="3"> To obtain these distances, Ratnaparkhi's part-of-speech (POS) tagger (Ratnaparkhi, 1996) and Collins' parser (Collins, 1999) were used to obtain parse trees for the English side of the test corpus.</Paragraph>
      <Paragraph position="4"> With defined as in Equation 14, the Generalized</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Alignment Error loss function (Equation 3) is called
the Parse-Tree Syntactic Distance ( ).
5.2 Distances Derived From Part-of-Speech
Labels
</SectionTitle>
      <Paragraph position="0"> Suppose a Part-of-Speech(POS) tagger is available to tag each word in the English sentence. If POS denotes the POS of the English word , we can define the word-to-word distance measure (Equation 4) as</Paragraph>
      <Paragraph position="2"> was used to obtain POS tags for each word in the English sentence. With specified by Equation 15, the Generalized Alignment Error loss function (Equation 3) is called the Part-Of-Speech Distance ( ).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Automatic Word Cluster Distances
</SectionTitle>
      <Paragraph position="0"> Suppose we are working in a language for which parsers and POS taggers are not available. In this situation we might wish to construct the loss functions based on word classes determined by automatic clustering procedures. If specifies the word cluster for the English word , then we define the distance (16) In our experiments we obtained word clusters for English words using a statistical learning procedure (Kneser and Ney, 1991) where the total number of word classes is restricted to be 100. With as defined in Equation 16, the Generalized Alignment Error loss function (Equation 3) is called the Auto-</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
matic Word Class Distance ( ).
5.4 IBM-3 Word Alignment Models
</SectionTitle>
      <Paragraph position="0"> Since the true distribution over alignments is not known, we used the IBM-3 statistical translation model (Brown et al., 1993) to approximate . This model is specified through four components: Fertility probabilities for words; Fertility probabilities for NULL; Word Translation probabilities; and Distortion probabilities. We used a modified version of the IBM-3 distortion model (Knight and Al-Onaizan, 1998) in which each of the possible permutations of the French sentence is equally likely. The IBM-3 models were trained on a subset of the Canadian Hansards French-English data which consisted of 50,000 parallel sentences (Och and Ney, 2000b). The vocabulary size was 18,499 for English and 24,198 for French. The GIZA++ toolkit (Och and Ney, 2000a) was used for training the IBM-3 models (as in (Och and Ney, 2000b)).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Word Alignment Lattice Generation
</SectionTitle>
      <Paragraph position="0"> We obtained word alignments under the modified IBM-3 models using the finite state translation framework introduced by Knight and Al-Onaizan (1998). The finite state operations were carried out using the AT&amp;T Finite State Machine Toolkit (Mohri et al., 2001; Mohri et al., 2000).</Paragraph>
      <Paragraph position="1"> The WFST framework involves building a transducer for each constituent of the IBM-3 Alignment Models: the word fertility model ; the NULL fertility model ; and the word translation model (Section 5.4). For each sentence pair we also built a finite state acceptor that accepts the English sentence and another acceptor which accepts all legal permutations of the French sentence. The alignment lattice for the sentence pair was then obtained by the following weighted finite state composition . In practice, the WFST obtained by the composition was pruned to a maximum of 10,000 states using a likelihood based pruning operation. In terms of AT&amp;T Finite State Toolkit shell commands, these operations are given as: fsmcompose E M fsmcompose - N fsmcompose - T fsmcompose - F fsmprune -n 10000 The finite state composition and pruning were performed using lazy implementations of algorithms provided in AT&amp;T Finite State libraries (Mohri et al., 2000). This made the computation efficient because even though five WFSTs are composed into a potentially huge transducer, only a small portion of it is actually searched during the pruning used to generate the final lattice.</Paragraph>
      <Paragraph position="2"> A heavily pruned alignment lattice for a sentence-pair from the test data is shown in Figure 2. For clarity of presentation, each alignment link in the lattice is shown as an ordered pair where and are the English and French words on the link. For each sentence, we also computed the lattice path with the highest probability . This gives the ML alignment under the statistical MT models that will give our baseline performance under the various loss functions.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.6 Performance Under The Alignment Error
Rates
</SectionTitle>
      <Paragraph position="0"> Our unseen test data consisted of 207 French-English sentence pairs from the Hansards corpus (Och and Ney, 2000b). These sentence pairs had at most 16 words in the French sentence; this restriction on the sentence length was necessary to control the memory requirements of the composition.</Paragraph>
      <Paragraph position="1">  In the previous sections we introduced a total of four loss functions: , , and . Using either Equation 11 or 13, an MBR decoder can be constructed for each. These decoders are called MBR-AE, MBR-PTSD, MBR-POSD, and MBR-AWCD, respectively.</Paragraph>
      <Paragraph position="2">  The performance of the four decoders was measured with respect to the alignments provided by human experts (Och and Ney, 2000b). The first evaluation metric used was the Alignment Error Rate (Equation 1). We also evaluated each decoder under the Generalized Alignment Error Rates (GAER). These are defined as: (17) There are six variants of GAER. These arise when is specified by , or . There are two versions of each of these: one version is sensitive only to sure (S) links. The other version considers all (A) links in the reference alignment. We therefore have the following six Generalized Alignment Error Rates: PTSD-S, POSD-S, AWCD-S, and PTSD-A, POSD-A, AWCD-A. We say we have a matched condition when the same loss function is used in both the error rate and the decoder design.  The performance of the decoders under various loss functions is given in Table 1. We observe that in none of these experiments was the ML decoder found to be optimal. In all instances, the MBR decoder tuned for each loss function was the best performing decoder under the corresponding error rate. In particular, we note that alignment performance as measured under the AER metric can be improved by using MBR instead of ML alignment. This demonstrates the value of finding decoding procedures matched to the performance criterion of interest. null We observe some affinity among the loss functions. In particular, the ML decoder performs better under the AER than any of the MBR-GAE decoders. This is because the loss, for which the ML decoder is optimal, is closer to the loss than any of the loss functions. The NULL symbol is treated quite differently under and , and this leads to a large mismatch between the MBR-GAE decoders and the AER metric. Similarly, the performance of the MBR-POS decoder degrades significantly under the AWCD-S and AWCD-A metrics. Since there are more word clusters (100) than POS tags (55), the MBR-POS decoder is therefore incapable of producing hypotheses that can match the word clusters used in the AWCD metrics.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML