XML Viewer - w05-0812

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0812_intro.xml
Size: 8,045 bytes
Last Modified: 2025-10-06 14:03:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0812">
  <Title>Improved HMM Alignment Models for Languages with Scarce Resources</Title>
  <Section position="3" start_page="0" end_page="84" type="intro">
    <SectionTitle>
2 HMMs and Word Alignment
</SectionTitle>
    <Paragraph position="0"> The objective of word alignment is to discover the word-to-word translational correspondences in a bilingual corpus of S sentence pairs, which we denote f(f(s),e(s)) : s 2 [1,S]g. Each sentence pair (f,e) = ( f M1 ,eN1 ) consists of a sentence f in one language and its translation e in the other, with lengths M and N, respectively. By convention we refer to e as the English sentence and f as the French sentence. Correspondences in a sentence are represented by a set of links between words. A link ( f j,ei) denotes a correspondence between the ith word ei of e and the jth word f j of f.</Paragraph>
    <Paragraph position="1"> Many alignment models arise from the conditional distribution P(fje). We can decompose this by introducing the hidden alignment variable a = aM1 . Each element of a takes on a value in the range [1,N]. The value of ai determines a link between the ith French word fi and the aith English word eai. This representation introduces  an asymmetry into the model because it constrains each French word to correspond to exactly one English word, while each English word is permitted to correspond to an arbitrary number of French words. Although the resulting set of links may still be relatively accurate, we can symmetrize by combining it with the set produced by applying the complementary model P(ejf) to the same data (Och and Ney, 2000b). Making a few independence assumptions we arrive at the decomposition in Equation 1. 1</Paragraph>
    <Paragraph position="3"> We refer to d(aijai 1) as the distortion model and t( fijeai) as the translation model. Conveniently, Equation 1 is in the form of an HMM, so we can apply standard algorithms for HMM parameter estimation and maximization.</Paragraph>
    <Paragraph position="4"> This approach was proposed in Vogel et al. (1996) and subsequently improved (Och and Ney, 2000a; Toutanova et al., 2002).</Paragraph>
    <Section position="1" start_page="83" end_page="83" type="sub_section">
      <SectionTitle>
2.1 The Tree Distortion Model
</SectionTitle>
      <Paragraph position="0"> Equation 1 is adequate in practice, but we can improve it. Numerous parameterizations have been proposed for the distortion model. In our surface distortion model, it depends only on the distance ai ai 1 and an automatically determined word class C(eai[?]1) as shown in Equation 2. It is similar to (Och and Ney, 2000a). The word class C(eai[?]1) is assigned using an unsupervised approach (Och, 1999).</Paragraph>
      <Paragraph position="2"> The surface distortion model can capture local movement but it cannot capture movement of structures or the behavior of long-distance dependencies across translations. The intuitive appeal of capturing richer information has inspired numerous alignment models (Wu, 1995; Yamada and Knight, 2001; Cherry and Lin, 2003). However, we would like to retain the simplicity and good performance of the HMM Model.</Paragraph>
      <Paragraph position="3"> We introduce a distortion model which depends on the tree distance t(ei,ek) = (w,x,y) between each pair of English words ei and ek. Given a dependency parse of eM1 , w and x represent the respective number of dependency links separating ei and ek from their closest common ancestor node in the parse tree. 2 The final element y = f1 1We ignore the sentence length probability p(MjN), which is not relevant to word alignment. We also omit discussion of HMM start and stop probabilities, and normalization of t( fijeai ), although we find in practice that attention to these details can be beneficial.</Paragraph>
      <Paragraph position="5"> the Romanian-English development set.</Paragraph>
      <Paragraph position="6"> if i &gt; k; 0 otherwiseg is simply a binary indicator of the linear relationship of the words within the surface string. Tree distance is illustrated in Figure 2.</Paragraph>
      <Paragraph position="7"> In our tree distortion model, we condition on the tree distance and the part of speech T (ei 1), giving us Equa-</Paragraph>
      <Paragraph position="9"> Since both the surface distortion and tree distortion models represent p(aijai 1), we can combine them using linear interpolation as in Equation 4.</Paragraph>
      <Paragraph position="11"> The lC;T parameters can be initialized from a uniform distribution and trained with the other parameters using EM. In principle, any number of alternative distortion models could be combined with this framework.</Paragraph>
    </Section>
    <Section position="2" start_page="83" end_page="84" type="sub_section">
      <SectionTitle>
2.2 Improving Initialization
</SectionTitle>
      <Paragraph position="0"> Our HMM produces reasonable results if we draw our initial parameter estimates from a uniform distribution.</Paragraph>
      <Paragraph position="1"> However, we can do better. We estimate the initial translation probability t( f jjei) from the smoothed log-likelihood ratio LLR(ei, f j)ph1 computed over sentence cooccurrences. Since this method works well, we apply LLR(ei, f j) in a single reestimation step shown in Equation 5.</Paragraph>
      <Paragraph position="3"> In reestimation LLR( fje) is computed from the expected counts of f and e produced by the EM algorithm. This is similar to Moore (2004); as in that work, jVj = 100,000, and ph1, ph2, and n are estimated on development data.</Paragraph>
      <Paragraph position="4"> We can also use an improved initial estimate for distortion. Consider a simple distortion model p(aijai ai 1).</Paragraph>
      <Paragraph position="5"> We expect this distribution to have a maximum near P(aij0) because we know that words tend to retain their locality across translation. Rather than wait for this to occur, we use an initial estimate for the distortion model given in Equation 6.</Paragraph>
      <Paragraph position="6">  For each corpus, the last row shown represents the results that were actually submitted. Note that for English-Hindi, our self-reported results in the unlimited task are slightly lower than the original results submitted for the workshop, which contained an error.</Paragraph>
      <Paragraph position="8"> We choose Z to normalize the distribution. We must optimize a on a development set. This distribution has a maximum when jai ai 1j 2 f 1,0,1g. Although we could reasonably choose any of these three values as the maximum for the initial estimate, we found in development that the maximum of the surface distortion distribution varied with C(eai 1), although it was always in the range [ 1,2].</Paragraph>
    </Section>
    <Section position="3" start_page="84" end_page="84" type="sub_section">
      <SectionTitle>
2.3 Does NULL Matter in Asymmetric Alignment?
</SectionTitle>
      <Paragraph position="0"> Och and Ney (2000a) introduce a NULL-alignment capability to the HMM alignment model. This allows any word f j to link to a special NULL word - by convention denoted e0 - instead of one of the words eN1 . A link ( f j,e0) indicates that f j does not correspond to any word in e. This improved alignment performance in the absence of symmetrization, presumably because it allows the model to be conservative when evidence for an alignment is lacking.</Paragraph>
      <Paragraph position="1"> We hypothesize that NULL alignment is unnecessary for asymmetric alignment models when we symmetrize using intersection-based methods (Och and Ney, 2000b).</Paragraph>
      <Paragraph position="2"> The intuition is simple: if we don't permit NULL alignments, then we expect to produce a high-recall, lowprecision alignment; the intersection of two such alignments should mainly improve precision, resulting in a high-recall, high-precision alignment. If we allow NULL alignments, we may be able produce a high-precision, low-recall asymmetric alignment, but symmetrization by intersection will not improve recall.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML