File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1012_metho.xml

Size: 24,891 bytes

Last Modified: 2025-10-06 14:08:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1012">
  <Title>A Probability Model to Improve Word Alignment</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Probability Model
</SectionTitle>
    <Paragraph position="0"> In this section we describe our probability model.</Paragraph>
    <Paragraph position="1"> To do so, we will first introduce some necessary notation. Let E be an English sentence e1,e2,...,em and let F be a French sentence f1,f2,...,fn. We define a link l(ei,fj) to exist if ei and fj are a trans- null ipates in no other links. If e occurs in E x times and f occurs in F y times, we say that e and f co-occur xy times in this sentence pair.</Paragraph>
    <Paragraph position="2"> We define the alignment problem as finding the alignment A that maximizes P(A|E,F). This corresponds to finding the Viterbi alignment in the IBM translation systems. Those systems model P(F,A|E), which when maximized is equivalent to maximizing P(A|E,F). We propose here a system which models P(A|E,F) directly, using a different decomposition of terms.</Paragraph>
    <Paragraph position="3"> In the IBM models of translation, alignments exist as artifacts of which English words generated which French words. Our model does not state that one sentence generates the other. Instead it takes both sentences as given, and uses the sentences to determine an alignment. An alignment A consists of t links {l1,l2,...,lt}, where each lk = l(eik,fjk) for some ik and jk. We will refer to consecutive subsets of A as lji = {li,li+1,...,lj}. Given this notation,</Paragraph>
    <Paragraph position="5"> At this point, we must factor P(lk|E,F,lk[?]11 ) to make computation feasible. Let Ck = {E,F,lk[?]11 } represent the context of lk. Note that both the context Ck and the link lk imply the occurrence of eik and fjk. We can rewrite P(lk|Ck) as:</Paragraph>
    <Paragraph position="7"> Here P(lk|eik,fjk) is link probability given a co-occurrence of the two words, which is similar in spirit to Melamed's explicit noise model (Melamed, 2000). This term depends only on the words involved directly in the link. The ratio P(Ck|lk)P(Ck|ei k,fjk)modifies the link probability, providing context-sensitive information.</Paragraph>
    <Paragraph position="8"> Up until this point, we have made no simplifying assumptions in our derivation. Unfortunately, Ck = {E,F,lk[?]11 } is too complex to estimate context probabilities directly. Suppose FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik,fjk,FTk). Let Cprimek = {eik,fjk}[?]FTk. P(lk|Cprimek) can then be decomposed using the same derivation as above.</Paragraph>
    <Paragraph position="10"> In the second line of this derivation, we can drop eik and fjk from Cprimek, leaving only FTk, because they are implied by the events which the probabilities are conditionalized on. Now, we are left with the task of approximating P(FTk|lk) and P(FTk|eik,fjk).</Paragraph>
    <Paragraph position="11"> To do so, we will assume that for all ft [?] FTk, ft is conditionally independent given either lk or (eik,fjk). This allows us to approximate alignment probability P(A|E,F) as follows:</Paragraph>
    <Paragraph position="13"> In any context, only a few features will be active. The inner product is understood to be only over those features ft that are present in the current context. This approximation will cause P(A|E,F) to no longer be a well-behaved probability distribution, though as in Naive Bayes, it can be an excellent estimator for the purpose of ranking alignments.</Paragraph>
    <Paragraph position="14"> If we have an aligned training corpus, the probabilities needed for the above equation are quite easy to obtain. Link probabilities can be determined directly from |lk |(link counts) and |eik,fj,k| (co-occurrence counts). For any co-occurring pair of words (eik,fjk), we check whether it has the feature ft. If it does, we increment the count of |ft,eik,fjk|. If this pair is also linked, then we increment the count of |ft,lk|. Note that our definition of FTk allows for features that depend on previous links. For this reason, when determining whether or not a feature is present in a given context, one must impose an ordering on the links. This ordering can be arbitrary as long as the same ordering is used in training1 and probability evaluation. A simple solution would be to order links according their French words. We choose to order links according to the link probability P(lk|eik,fjk) as it has an intuitive appeal of allowing more certain links to provide context for others.</Paragraph>
    <Paragraph position="15"> We store probabilities in two tables. The first table stores link probabilities P(lk|eik,fjk). It has an entry for every word pair that was linked at least once in the training corpus. Its size is the same as the translation table in the IBM models. The second table stores feature probabilities, P(ft|lk) and P(ft|eik,fjk). For every linked word pair, this table has two entries for each active feature. In the worst case this table will be of size 2x|FT|x|E|x|F|. In practice, it is much smaller as most contexts activate only a small number of features.</Paragraph>
    <Paragraph position="16"> In the next subsection we will walk through a simple example of this probability model in action. We will describe the features used in our implementation of this model in Section 3.2.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 An Illustrative Example
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows an aligned corpus consisting of one sentence pair. Suppose that we are concerned with only one feature ft that is active2 for eik and fjk if an adjacent pair is an alignment, i.e., l(eik[?]1,fjk[?]1) [?] lk[?]11 or l(eik+1,fjk+1) [?] lk[?]11 . This example would produce the probability tables shown in Table 1.</Paragraph>
      <Paragraph position="1"> Note how ft is active for the (a,v) link, and is not active for the (b,u) link. This is due to our selected ordering. Table 1 allows us to calculate the probability of this alignment as:</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Word-Alignment Algorithm
</SectionTitle>
    <Paragraph position="0"> In this section, we describe a world-alignment algorithm guided by the alignment probability model derived above. In designing this algorithm we have selected constraints, features and a search method in order to achieve high performance. The model, however, is general, and could be used with any instantiation of the above three factors. This section will describe and motivate the selection of our constraints, features and search method.</Paragraph>
    <Paragraph position="1"> The input to our word-alignment algorithm consists of a pair of sentences E and F, and the dependency tree TE for E. TE allows us to make use of features and constraints that are based on linguistic intuitions.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Constraints
</SectionTitle>
      <Paragraph position="0"> The reader will note that our alignment model as described above has very few factors to prevent undesirable alignments, such as having all French words align to the same English word. To guide the model to correct alignments, we employ two constraints to limit our search for the most probable alignment.</Paragraph>
      <Paragraph position="1"> The first constraint is the one-to-one constraint (Melamed, 2000): every word (except the null words e0 and f0) participates in exactly one link.</Paragraph>
      <Paragraph position="2"> The second constraint, known as the cohesion constraint (Fox, 2002), uses the dependency tree (Mel'Vcuk, 1987) of the English sentence to restrict possible link combinations. Given the dependency tree TE, the alignment can induce a dependency tree for F (Hwa et al., 2002). The cohesion constraint requires that this induced dependency tree does not have any crossing dependencies. The details about how the cohesion constraint is implemented are outside the scope of this paper.3 Here we will use a simple example to illustrate the effect of the constraint. Consider the partial alignment in Figure 2. When the system attempts to link of and de, the new link will induce the dotted dependency, which crosses a previously induced dependency between service and donn'ees. Therefore, of and de will not be linked.</Paragraph>
      <Paragraph position="3"> the status of the data service l' etat du service de donnees</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Features
</SectionTitle>
      <Paragraph position="0"> In this section we introduce two types of features that we use in our implementation of the probability model described in Section 2. The first feature 3The algorithm for checking the cohesion constraint is presented in a separate paper which is currently under review.</Paragraph>
      <Paragraph position="1"> the host discovers all the devices  type fta concerns surrounding links. It has been observed that words close to each other in the source language tend to remain close to each other in the translation (Vogel et al., 1996; Ker and Change, 1997). To capture this notion, for any word pair (ei,fj), if a link l(eiprime,fjprime) exists where i[?]2 [?] iprime [?] i + 2 and j [?]2 [?] jprime [?] j + 2, then we say that the feature fta(i[?]iprime,j[?]jprime,eiprime) is active for this context. We refer to these as adjacency features.</Paragraph>
      <Paragraph position="2"> The second feature type ftd uses the English parse tree to capture regularities among grammatical relations between languages. For example, when dealing with French and English, the location of the determiner with respect to its governor4 is never swapped during translation, while the location of adjectives is swapped frequently. For any word pair (ei,fj), let eiprime be the governor of ei, and let rel be the relationship between them. If a link l(eiprime,fjprime) exists, then we say that the feature ftd(j[?]jprime,rel) is active for this context. We refer to these as dependency features.</Paragraph>
      <Paragraph position="3"> Take for example Figure 3 which shows a partial alignment with all links completed except for those involving 'the'. Given this sentence pair and English parse tree, we can extract features of both types to assist in the alignment of the1. The word pair (the1,lprime) will have an active adjacency feature fta(+1,+1,host) as well as a dependency feature ftd([?]1,det). These two features will work together to increase the probability of this correct link. In contrast, the incorrect link (the1,les) will have only ftd(+3,det), which will work to lower the link probability, since most determiners are located be4The parent node in the dependency tree.</Paragraph>
      <Paragraph position="4"> fore their governors.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Search
</SectionTitle>
      <Paragraph position="0"> Due to our use of constraints, when seeking the highest probability alignment, we cannot rely on a method such as dynamic programming to (implicitly) search the entire alignment space. Instead, we use a best-first search algorithm (with constant beam and agenda size) to search our constrained space of possible alignments. A state in this space is a partial alignment. A transition is defined as the addition of a single link to the current state. Any link which would create a state that does not violate any constraint is considered to be a valid transition. Our start state is the empty alignment, where all words in E and F are linked to null. A terminal state is a state in which no more links can be added without violating a constraint. Our goal is to find the terminal state with highest probability.</Paragraph>
      <Paragraph position="1"> For the purposes of our best-first search, non-terminal states are evaluated according to a greedy completion of the partial alignment. We build this completion by adding valid links in the order of their unmodified link probabilities P(l|e,f) until no more links can be added. The score the state receives is the probability of its greedy completion. These completions are saved for later use (see Section 4.2).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Training
</SectionTitle>
    <Paragraph position="0"> As was stated in Section 2, our probability model needs an initial alignment in order to create its probability tables. Furthermore, to avoid having our model learn mistakes and noise, it helps to train on a set of possible alignments for each sentence, rather than one Viterbi alignment. In the following sub-sections we describe the creation of the initial alignments used for our experiments, as well as our sampling method used in training.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Initial Alignment
</SectionTitle>
      <Paragraph position="0"> We produce an initial alignment using the same algorithm described in Section 3, except we maximize summed ph2 link scores (Gale and Church, 1991), rather than alignment probability. This produces a reasonable one-to-one word alignment that we can refine using our probability model.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Alignment Sampling
</SectionTitle>
      <Paragraph position="0"> Our use of the one-to-one constraint and the cohesion constraint precludes sampling directly from all possible alignments. These constraints tie words in such a way that the space of alignments cannot be enumerated as in IBM models 1 and 2 (Brown et al., 1993). Taking our lead from IBM models 3, 4 and 5, we will sample from the space of those high-probability alignments that do not violate our constraints, and then redistribute our probability mass among our sample.</Paragraph>
      <Paragraph position="1"> At each search state in our alignment algorithm, we consider a number of potential links, and select between them using a heuristic completion of the resulting state. Our sample S of possible alignments will be the most probable alignment, plus the greedy completions of the states visited during search. It is important to note that any sampling method that concentrates on complete, valid and high probability alignments will accomplish the same task.</Paragraph>
      <Paragraph position="2"> When collecting the statistics needed to calculate P(A|E,F) from our initial ph2 alignment, we give each s [?] S a uniform weight. This is reasonable, as we have no probability estimates at this point. When training from the alignments produced by our model, we normalize P(s|E,F) so that summationtexts[?]S P(s|E,F) = 1. We then count links and features in S according to these normalized probabilities. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We adopted the same evaluation methodology as in (Och and Ney, 2000), which compared alignment outputs with manually aligned sentences. Och and Ney classify manual alignments into two categories: Sure (S) and Possible (P) (S[?]P). They defined the following metrics to evaluate an alignment A: recall = |A[?]S||S |precision = |A[?]P||P| alignment error rate (AER) = |A[?]S|+|A[?]P||S|+|P| We trained our alignment program with the same 50K pairs of sentences as (Och and Ney, 2000) and tested it on the same 500 manually aligned sentences. Both the training and testing sentences are from the Hansard corpus. We parsed the training  and testing corpora with Minipar.5 We then ran the training procedure in Section 4 for three iterations. We conducted three experiments using this methodology. The goal of the first experiment is to compare the algorithm in Section 3 to a state-of-the-art alignment system. The second will determine the contributions of the features. The third experiment aims to keep all factors constant except for the model, in an attempt to determine its performance when compared to an obvious alternative.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Comparison to state-of-the-art
</SectionTitle>
      <Paragraph position="0"> Table 2 compares the results of our algorithm with the results in (Och and Ney, 2000), where an HMM model is used to bootstrap IBM Model 4. The rows IBM-4 F-E and IBM-4 E-F are the results obtained by IBM Model 4 when treating French as the source and English as the target or vice versa. The row IBM-4 Intersect shows the results obtained by taking the intersection of the alignments produced by IBM-4 E-F and IBM-4 F-E. The row IBM-4 Refined shows results obtained by refining the intersection of alignments in order to increase recall. Our algorithm achieved over 44% relative error reduction when compared with IBM-4 used in either direction and a 25% relative error rate reduction when compared with IBM-4 Refined. It also achieved a slight relative error reduction when compared with IBM-4 Intersect. This demonstrates that we are competitive with the methods described in (Och and Ney, 2000). In Table 2, one can see that our algorithm is high precision, low recall. This was expected as our algorithm uses the one-to-one constraint, which rules out many of the possible alignments present in the evaluation data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Contributions of Features
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the contributions of features to our algorithm's performance. The initial (ph2) row is the score for the algorithm (described in Section 4.1) that generates our initial alignment. The without features row shows the score after 3 iterations of refinement with an empty feature set. Here we can see that our model in its simplest form is capable of producing a significant improvement in alignment quality.</Paragraph>
      <Paragraph position="1"> The rows with ftd only and with fta only describe the scores after 3 iterations of training using only dependency and adjacency features respectively. The two features provide significant contributions, with the adjacency feature being slightly more important.</Paragraph>
      <Paragraph position="2"> The final row shows that both features can work together to create a greater improvement, despite the independence assumptions made in Section 2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Model Evaluation
</SectionTitle>
      <Paragraph position="0"> Even though we have compared our algorithm to alignments created using IBM statistical models, it is not clear if our model is essential to our performance. This experiment aims to determine if we could have achieved similar results using the same initial alignment and search algorithm with an alternative model.</Paragraph>
      <Paragraph position="1"> Without using any features, our model is similar to IBM's Model 1, in that they both take into account only the word types that participate in a given link. IBM Model 1 uses P(f|e), the probability of f being generated by e, while our model uses P(l|e,f), the probability of a link existing between e and f.</Paragraph>
      <Paragraph position="2"> In this experiment, we set Model 1 translation probabilities according to our initial ph2 alignment, sampling as we described in Section 4.2. We then use theproducttext n j=1 P(fj|eaj) to evaluate candidate alignments in a search that is otherwise identical to our algorithm. We ran Model 1 refinement for three iterations and  recorded the best results that it achieved.</Paragraph>
      <Paragraph position="3"> It is clear from Table 4 that refining our initial ph2 alignment using IBM's Model 1 is less effective than using our model in the same manner. In fact, the Model 1 refinement receives a lower score than our initial alignment.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Probability models
</SectionTitle>
      <Paragraph position="0"> When viewed with no features, our probability model is most similar to the explicit noise model defined in (Melamed, 2000). In fact, Melamed defines a probability distribution P(links(u, v)|cooc(u, v),l+,l[?]) which appears to make our work redundant. However, this distribution refers to the probability that two word types u and v are linked links(u, v) times in the entire corpus. Our distribution P(l|e,f) refers to the probability of linking a specific co-occurrence of the word tokens e and f. In Melamed's work, these probabilities are used to compute a score based on a probability ratio. In our work, we use the probabilities directly.</Paragraph>
      <Paragraph position="1"> By far the most prominent probability models in machine translation are the IBM models and their extensions. When trying to determine whether two words are aligned, the IBM models ask, &amp;quot;What is the probability that this English word generated this French word?&amp;quot; Our model asks instead, &amp;quot;If we are given this English word and this French word, what is the probability that they are linked?&amp;quot; The distinction is subtle, yet important, introducing many differences. For example, in our model, E and F are symmetrical. Furthermore, we model P(l|e,fprime) and P(l|e,fprimeprime) as unrelated values, whereas the IBM model would associate them in the translation probabilities t(fprime|e) and t(fprimeprime|e) through the constraintsummationtext f t(f|e) = 1. Unfortunately, by conditionalizing on both words, we eliminate a large inductive bias.</Paragraph>
      <Paragraph position="2"> This prevents us from starting with uniform probabilities and estimating parameters with EM. This is why we must supply the model with a noisy initial alignment, while IBM can start from an unaligned corpus.</Paragraph>
      <Paragraph position="3"> In the IBM framework, when one needs the model to take new information into account, one must create an extended model which can base its parameters on the previous model. In our model, new information can be incorporated modularly by adding features. This makes our work similar to maximum entropy-based machine translation methods, which also employ modular features. Maximum entropy can be used to improve IBM-style translation probabilities by using features, such as improvements to P(f|e) in (Berger et al., 1996). By the same token we can use maximum entropy to improve our estimates of P(lk|eik,fjk,Ck). We are currently investigating maximum entropy as an alternative to our current feature model which assumes conditional independence among features.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Grammatical Constraints
</SectionTitle>
      <Paragraph position="0"> There have been many recent proposals to leverage syntactic data in word alignment. Methods such as (Wu, 1997), (Alshawi et al., 2000) and (Lopez et al., 2002) employ a synchronous parsing procedure to constrain a statistical alignment. The work done in (Yamada and Knight, 2001) measures statistics on operations that transform a parse tree from one language into another.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Future Work
</SectionTitle>
    <Paragraph position="0"> The alignment algorithm described here is incapable of creating alignments that are not one-to-one. The model we describe, however is not limited in the same manner. The model is currently capable of creating many-to-one alignments so long as the null probabilities of the words added on the &amp;quot;many&amp;quot; side are less than the probabilities of the links that would be created. Under the current implementation, the training corpus is one-to-one, which gives our model no opportunity to learn many-to-one alignments.</Paragraph>
    <Paragraph position="1"> We are pursuing methods to create an extended algorithm that can handle many-to-one alignments.</Paragraph>
    <Paragraph position="2"> This would involve training from an initial alignment that allows for many-to-one links, such as one of the IBM models. Features that are related to multiple links should be added to our set of feature types, to guide intelligent placement of such links.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML