File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0707_intro.xml

Size: 9,895 bytes

Last Modified: 2025-10-06 14:00:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0707">
  <Title>Incorporating Position Information into a Maximum Entropy/Minimum Divergence Translation Model</Title>
  <Section position="3" start_page="37" end_page="39" type="intro">
    <SectionTitle>
2 Models
2.1 Linear Model
</SectionTitle>
    <Paragraph position="0"> As a baseline for comparison I used a linear combination as in (2) of a standard interpolated tri-gram language model and the IBM translation model 2 (IBM2), with the combining weight A optimized using the EM algorithm. IBM2 is derived as follows: 2</Paragraph>
    <Paragraph position="2"> where I = \[s\[, and the hidden variable j gives the position in s of the (single) source token sj assumed to give rise to w, or 0 if there is none.</Paragraph>
    <Paragraph position="3"> The model consists of a set of word-pair parameters p(t\[s) and position parameters p(j\[i,/); in model 1 (IBM1) the latter are fixed at 1/(1 + 1), as each position, including the empty position 0, is considered equally likely to contain a translation for w. Maximum likelihood estimates for these parameters can be obtained with the EM algorithm over a bilingual training corpus, as described in (Brown et al., 1993).</Paragraph>
    <Section position="1" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
2.2 MEMD Model 1
</SectionTitle>
      <Paragraph position="0"> A MEMD model for p(w\[hi, s) has the general form:</Paragraph>
      <Paragraph position="2"> since target words are predicted independently it can also be used for p(wlhi , s). The only necessary modification in this case is that the position parameters can no longer be conditioned on It\[.</Paragraph>
      <Paragraph position="3"> where q(w\[hi,s) is a reference distribution, f(w, hi, s) maps (w, hi, s) into an n-dimensional feature vector, (~ is a corresponding vector of feature weights (the parameters of the model), and Z(hi, s) = ~w q(w\[hi, s) exp((~-f(w, hi)) is a normalizing factor. For a given choice of q and f, the IIS algorithm (Berger et al., 1996) can be used to find maximum likelihood values for the parameters ~. It can be shown (Della Pietra et al., 1995) that these are the also the values which minimize the Kullback-Liebler divergence D(p\[\[q) between the model and the reference distribution under the constraint that the expectations of the features (ie, the components of f) with respect to the model must equal their expectations with respect to the empirical distribution derived from the training corpus.</Paragraph>
      <Paragraph position="4"> Thus the reference distribution serves as a kind of prior, and should reflect some initial knowledge about the true distribution; and the use of any feature is justified to the extent that its empirical expectation is accurate.</Paragraph>
      <Paragraph position="5"> In the present context, the natural choice for the reference distribution q is a trigram language model. To create a MEMD analog to IBM model 1 (MEMD1), I used boolean features corresponding to bilingual word pairs: 1, sEsandt----w fst(W,S) = 0, else where (s, t) is a (source,target) word pair. Using the notational convention that ast is 0 whenever the corresponding feature fst does not exist in the model, MEMD1 can be written compactly as:</Paragraph>
      <Paragraph position="7"> Due to the theoretical properties of MEMD outlined above, it is necessary to select a sub-set of all possible features fst to avoid overfitting the training corpus. Using a reduced feature set is also computationally advantageous, since the time taken to calculate the normalization constant Z(hi, s) grows linearly with the expected number of features which are active per source word s E s. This is in contrast to IBM1, where use of all available word-pair parameters p(tls ) is standard, and engenders only a very slight overfitting effect. In (Foster, 2000) I describe an  effective technique for selecting MEMD word-pair features.</Paragraph>
    </Section>
    <Section position="2" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
2.3 MEMD Model 2
</SectionTitle>
      <Paragraph position="0"> IBM2 incorporates position information by introducing a hidden position variable and making independence hypotheses. This approach is not applicable to MEMD models, whose features must capture events which are directly observable in the training corpus. 3 It would be possible to use pure position features of the form fi#, which capture the presence of any word pair at position (i, j, l) and are superficially similar to IBM2's position parameters, but these would add almost no information to MEMD1.</Paragraph>
      <Paragraph position="1"> On the other hand, features like fstijl, indicating the presence of a specific pair (s, t) at position (i, j,/), would cause severe data sparseness problems.</Paragraph>
      <Paragraph position="2"> Encoding Positions as Feature Values A simple solution to this dilemma is to let the value of a word-pair feature reflect the current position of the pair rather just its presence or absence. A reasonable choice for this is the value of the corresponding IBM2 position parameter p(jli, /): fst(W, i, s) = { P(Jsli'o, l), elseS E s and t = w where js is the position of s in s, or the most likely position according to IBM2 if it occurs more than once: 5s = argmaxj:sj=s P(jli, l). Using the same convention as in the previous section, the resulting model (MEMD2R) can be written: q(wlhi) exP(E~es aswP(5~ li, l)) p(wlhi, s ) = Z(hi, s) MEMD2R is simple and compact but poses a technical difficulty due to its use of real-valued features, in that the IIS training algorithm requires integer or boolean features for efficient implemention. Since likelihood is a concave function of ~, any hillclimbing method such as gradient ascent 4 is guaranteed to find maximum  rithm, in which model parameters are updated after each training example, gave the best performance.</Paragraph>
      <Paragraph position="3"> likelihood parameter values, but convergence is slower than IIS and requires tuning a gradient step parameter. Unfortunately, apart from this problem, MEMD2R also turns out to perform slightly worse than MEMD1, as described below. null</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
Using Class-based Position Features
</SectionTitle>
      <Paragraph position="0"> Since the basic problem with incorporating position information is one of insufficient data, a natural solution is to try to group word pair and position combinations with similar behaviour into classes such that the frequency of each class in the training corpus is high enough for reliable estimation. To do this, I made two preliminary assumptions: 1) word pairs with similar MEMD1 weights should be grouped together; and 2) position configurations with similar IBM2 probabilities should be grouped together. This converts the problem from one of finding classes in the five-dimensional space (s, t, i, j, l) to one of identifying rectangular areas on a 2-dimensional grid where one axis contains position configurations (i, j, l), ordered by p(jli,/); and the other contains word pairs (s, t), ordered by ast. To simplify further, I partitioned both axes so as to approximately balance the total corpus frequency of all word pairs or position configurations within each partition. Thus the only parameters required to completely specify a classification are the number of position and word-pair partitions. Each combination of a position partition and a word pair partition corresponds to a class, and all classes can be expected to have roughly the same empirical counts.</Paragraph>
      <Paragraph position="1"> The model (MEMD2B) based on this scheme has one feature for each class; if A designates the set of triples (i, j, l) in a position partition and B designates the set of pairs (s, t) in a word-pair partition, then for all A, B there is a feature:</Paragraph>
      <Paragraph position="3"> where 5\[X\] is 1 when X is true and 0 otherwise. For robustness, I used these position features along with pure MEMDl-style word-pair features fst. The weights O~A, s on the position features can thus be interpreted as correction terms for the pure word-pair weights as,t which  segment was used for combining weights for the trigram and the overall linear model; and the held-out 2 segment was used for the MEMD2B partition search. reflect the proximity of the words in the pair. p(TIS) -1~IT\], where p is the model being eval-The model is: uated, and (S, T) is the test corpus, considered .to be a set of statistically independent sentence p(w\[hi,s) = q(wlhi)exp(~ses asw + aA(i,j~,O,B(s,t))pair s (s,t). Perplexity is a good indicator of Z(hi,s) where A(i,Ss,l) gives the partition for the current position, B(s, t) gives the partition for the current word pair, and following the usual convention, aA(i,j~,0,S(s,t) is zero if these are undefined. null To find the optimal number of position partitions m and word-pair partitions n, I performed a greedy search, beginning at a small initial point (m, n) and at each iteration training two MEMD2B models characterized by (km, n) and (m, kn), where k &gt; 1 is a scaling factor (note that both these models contain kmn position features). The model which gives the best performance on a validation corpus is used as the starting point for the next iteration.</Paragraph>
      <Paragraph position="4"> Since training MEMD models is very expensive, to speed up the search I relaxed the convergence criterion from a training corpus perplexity 5 drop of &lt; .1% (requiring 20-30 IIS iterations) to &lt; .6% (requiring approximately 10 IIS iterations). I stopped the search when the best model's performance on the validation corpus did not decrease significantly from that of the model at the previous step, indicating that overtraining was beginning to occur.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML