File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1064_metho.xml

Size: 22,608 bytes

Last Modified: 2025-10-06 14:08:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1064">
  <Title>Aligning words using matrix factorisation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Matrix factorisation
</SectionTitle>
    <Paragraph position="0"> Alignments between source and target words may be represented by a I J alignment matrix A = [aij], such that aij &gt; 0 if fi is aligned with ej and aij = 0 otherwise. Similarly, given K cepts, words to cepts alignments may be represented by a I K matrix F and a J K matrix E, with positive elements indicating alignments. It is easy to see that matrix A = F E&gt; then represents the resulting word-to-word alignment (fig. 3).</Paragraph>
    <Paragraph position="1"> Let us now assume that we start from a I J matrix M = [mij], which we call the translation matrix, such that mij 0 measures the strength of the association between fi and ej (large values mean close association). This may be estimated using a translation table, a count (eg from a N-best list), etc. Finding a suitable alignment matrix A corresponds to finding factors F and E such that:</Paragraph>
    <Paragraph position="3"> where without loss of generality, we introduce a diagonal K K scaling matrix S which may give different weights to the different cepts. As factors F and E contain only positive elements, this is an instance of non-negative matrix factorisation, aka NMF (Lee and Seung, 1999). Because NMF decomposes a matrix into additive, positive components, it naturally yields a sparse representation. In addition, the propriety constraint imposes that words are aligned to exactly one cept, ie each row of E and F has exactly one non-zero component, a property we call extreme sparsity. With the notation</Paragraph>
    <Paragraph position="5"> As line i contains a single non-zero element, either Fik or Fil must be 0. An immediate consequence is that Pi Fik:Fil = 0: columns of F are orthogonal, that is F is an orthogonal matrix (and similarly, E is orthogonal). Finding the best alignment starting from M therefore reduces to performing a decomposition into orthogonal non-negative factors.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 An algorithm for Orthogonal
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Non-negative Matrix Factorisation
</SectionTitle>
      <Paragraph position="0"> Standard NMF algorithms (Lee and Seung, 2001) do not impose orthogonality between factors. We propose to perform the Orthogonal Non-negative Matrix Factorisation (ONMF) in two stages: We first factorise M using Probabilistic Latent Semantic Analysis, aka PLSA (Hofmann, 1999), then we orthogonalise factors using a Maximum A Posteriori (MAP) assignment of words to cepts. PLSA models a joint probability P(f;e) as a mixture of conditionally independent multinomial distributions: null</Paragraph>
      <Paragraph position="2"> With F = [P(fjc)], E = [P(ejc)] and S = diag(P(c)), this is exactly eq. 1. Note also that despite the additional matrix S, if we set E = [P(e;c)], then P(fje) would factor as F E&gt;, the original NMF formulation). All factors in eq. 2 are (conditional) probabilities, and therefore positive, so PLSA also implements NMF.</Paragraph>
      <Paragraph position="3"> The parameters are learned from a matrix M = [mij] of (fi;ej) counts, by maximising the likelihood using the iterative re-estimation formula of the Expectation-Maximisation algorithm (Dempster et al., 1977), cf. fig. 4. Convergence is guaranteed, leading to a non-negative factorisation of M. The second step of our algorithm is to orthogonalise</Paragraph>
      <Paragraph position="5"> M-steps until convergence.</Paragraph>
      <Paragraph position="6"> the resulting factors. Each source word fi is assigned the most probable cept, ie cept c for which</Paragraph>
      <Paragraph position="8"> where proportionality ensures that column of F sum to 1, so that F stays a conditional probability matrix. We proceed similarly for target words ej to orthogonalise E. Thanks to the MAP assignment, each line of F and E contains exactly one non-zero element. We saw earlier that this is equivalent to having orthogonal factors. The result is therefore an orthogonal, non-negative factorisation of the original translation matrix M.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Number of cepts
</SectionTitle>
      <Paragraph position="0"> In general, the number of cepts is unknown and must be estimated. This corresponds to choosing the number of components in PLSA, a classical model selection problem. The likelihood may not be used as it always increases as components are added. A standard approach to optimise the complexity of a mixture model is to maximise the likelihood, penalised by a term that increases with model complexity, such as AIC (Akaike, 1974) or BIC (Schwartz, 1978). BIC asymptotically chooses the correct model size (for complete models), while AIC always overestimates the number of components, but usually yields good predictive performance. As the largest possible number of cepts is min(I;J), and the smallest is 1 (all fi aligned to all ej), we estimate the optimal number of cepts by maximising AIC or BIC between these two extremes. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Dealing with null alignments
</SectionTitle>
      <Paragraph position="0"> Alignment to a &amp;quot;null&amp;quot; word may be a feature of the underlying statistical model (eg IBM models), or it may be introduced to accommodate words which have a low association measure with all other words.</Paragraph>
      <Paragraph position="1"> Using PLSA, we can deal with null alignments in a principled way by introducing a null word on each side (f0 and e0), and two null cepts (&amp;quot;f-null&amp;quot; and &amp;quot;e-null&amp;quot;) with a 1-1 alignment to the corresponding null word, to ensure that null alignments will only be 1-N or M-1. This constraint is easily implemented using proper initial conditions in EM.</Paragraph>
      <Paragraph position="2"> Denoting the null cepts as cf; and ce;, 1-1 alignments between null cepts and the corresponding null words impose the conditions:  1. P(f0jcf;) = 1 and 8i 6= 0;P(fijcf;) = 0; 2. P(e0jce;) = 1 and 8j 6= 0;P(ejjce;) = 0.</Paragraph>
      <Paragraph position="3">  Stepping through the E-step and M-step equations (3-6), we see that these conditions are preserved by each EM iteration. In order to deal with null alignments, the model is therefore augmented with two null cepts, for which the probabilities are initialised according to the above conditions. As these are preserved through EM, we maintain proper 1-N and M-1 alignments to the null words. The main difference between null cepts and the other cepts is that we relax the propriety constraint and do not force null cepts to be aligned to at least one word on either side. This is because in many cases, all words from a sentence can be aligned to non-null words, and do not require any null alignments.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Modelling noise
</SectionTitle>
      <Paragraph position="0"> Most elements of M usually have a non-zero association measure. This means that for proper alignments, which give zero probability to alignments outside identified blocks, actual observations have exactly 0 probability, ie the log-likelihood of parameters corresponding to proper alignments is undefined. We therefore refine the model, adding a noise component indexed by c = 0:</Paragraph>
      <Paragraph position="2"> The simplest choice for the noise component is a uniform distribution, P(f;ejc = 0) / 1. E-step and M-steps in eqs. (3-6) are unchanged for c &gt; 0, and the E-step equation for c = 0 is easily adapted:</Paragraph>
      <Paragraph position="4"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Example
</SectionTitle>
    <Paragraph position="0"> We first illustrate the factorisation process on a simple example. We use the data provided for the French-English shared task of the 2003 HLT-NAACL Workshop on Building and Using Parallel Texts (Mihalcea and Pedersen, 2003). The data is from the Canadian Hansard, and reference alignments were originally produced by Franz Och and Hermann Ney (Och and Ney, 2000). Using the entire corpus (20 million words), we trained English!French and French!English IBM4 models using GIZA++. For all sentences from the trial and test set (37 and 447 sentences), we generated up to 100 best alignments for each sentence and in each direction. For each pair of source and target words (fi;ej), the association measure mij is simply the number of times these words were aligned together in the two N-best lists, leading to a count between 0 (never aligned) and 200 (always aligned).</Paragraph>
    <Paragraph position="1"> We focus on sentence 1023, from the trial set.</Paragraph>
    <Paragraph position="2"> Figure 5 shows the reference alignments together with the generated counts. There is a background &amp;quot;noise&amp;quot; count of 3 to 5 (small dots) and the largest counts are around 145-150. The N-best counts seem to give a good idea of the alignments, although clearly there is no chance that our factorisation algorithm will recover the alignment of the two instances of 'de' to 'need', as there is no evidence for it in the data. The ambiguity that the factorisation will have to address, and that is not easily resolved using, eg, thresholding, is whether 'ont' should be aligned to 'They' or to 'need'.</Paragraph>
    <Paragraph position="3"> The N-best count matrix serves as the translation matrix. We estimate PLSA parameters for</Paragraph>
    <Paragraph position="5"> their maximum for K = 6. We therefore select 6 cepts for this sentence, and produce the alignment matrices shown on figure 6. Note that the order of the cepts is arbitrary (here the first cept correspond 'et' -- 'and'), except for the null cepts which are fixed. There is a fixed 1-1 correspondence between these null cepts and the corresponding null words on each side, and only the French words 'de' are mapped to a null cept. Finally, the estimated noise level is P(c = 0) = 0:00053. The ambiguity associated with aligning 'ont' has been resolved through cepts 4 and 5. In our resulting model,</Paragraph>
    <Paragraph position="7"> The MAP assignment forces 'ont' to be aligned to cept 5 only, and therefore to 'need'.</Paragraph>
    <Paragraph position="8"> Note that although the count for (need,ont) is slightly larger than the count for (they,ont) (cf. fig. 5), this is not a trivial result. The model was able to resolve the fact that they and need had to be aligned to 2 different cepts, rather than eg a larger cept corresponding to a 2-4 alignment, and to produce proper alignments through the use of these cepts.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> In order to perform a more systematic evaluation of the use of matrix factorisation for aligning words, we tested this technique on the full trial and test data from the 2003 HLT-NAACL Workshop. Note that the reference data has both &amp;quot;Sure&amp;quot; and &amp;quot;Probable&amp;quot; alignments, with about 77% of all alignments in the latter category. On the other hand, our system proposes only one type of alignment. The evaluation is done using the performance measures described in (Mihalcea and Pedersen, 2003): precision, recall and F-score on the probable and sure alignments, as well as the Alignment Error Rate (AER), which in our case is a weighted average of the recall on the sure alignments and the precision on the probable.</Paragraph>
    <Paragraph position="1"> Given an alignment A and gold standards GS and GP (for sure and probable alignments, respectively):</Paragraph>
    <Paragraph position="3"> Using these measures, we first evaluate the performance on the trial set (37 sentences): as we produce only one type of alignment and evaluate against &amp;quot;Sure&amp;quot; and &amp;quot;Probable&amp;quot;, we observe, as expected, that the recall is very good on sure alignments, but precision relatively poor, with the reverse situation on the probable alignments (table 1).</Paragraph>
    <Paragraph position="4"> This is because we generate an intermediate number of alignments. There are 338 sure and 1446 probable alignments (for 721 French and 661 English words) in the reference trial data, and we produce 707 (AIC) or 766 (BIC) alignments with ONMF.</Paragraph>
    <Paragraph position="5"> Most of them are at least probably correct, as attested by PP , but only about half of them are in the &amp;quot;Sure&amp;quot; subset, yielding a low value of PS. Similarly, because &amp;quot;Probable&amp;quot; alignments were generated as the union of alignments produced by two annotators, they sometimes lead to very large M-N alignments, which produce on average 2:5 to 2:7 alignments per word. By contrast ONMF produces less than 1:2 alignments per word, hence the low value of RP . As the AER is a weighted average of RS and PP , the resulting AER are relatively low for our method.</Paragraph>
    <Paragraph position="6">  We also compared the performance on the 447 test sentences to 1/ the intersection of the alignments produced by the top IBM4 alignments in either directions, and 2/ the best systems from (Mihalcea and Pedersen, 2003). On limited resources, Ralign.EF.1 (Simard and Langlais, 2003) produced the best F-score, as well as the best AER when NULL alignments were taken into account, while XRCE.Nolem.EF.3 (Dejean et al., 2003) produced the best AER when NULL alignments were discounted. Tables 2 and 3 show that ONMF improves on several of these results. In particular, we get better recall and F-score on the probable alignments (and even a better precision than Ralign.EF.1 in table 2). On the other hand, the performance, and in particular the precision, on sure alignments is dismal. We attribute this at least partly to a key difference between our model and the reference data:  onal non-negative matrix factorisation (ONMF) using the AIC and BIC criterion for choosing the number of cepts. HLT-03 best F is Ralign.EF.1 and best AER is XRCE.Nolem.EF.3 (Mihalcea and Pedersen, 2003). our model enforces coverage and makes sure that all words are aligned, while the &amp;quot;Sure&amp;quot; reference alignments have no such constraints and actually have a very bad coverage. Indeed, less than half the words in the test set have a &amp;quot;Sure&amp;quot; alignment, which means that a method which ensures that all words are aligned will at best have a sub 50% precision. In addition, many reference &amp;quot;Probable&amp;quot; alignments are not proper alignments in the sense defined above.</Paragraph>
    <Paragraph position="7"> Note that the IBM4 intersection has a bias similar to the sure reference alignments, and performs very well in FS, PP and especially in AER, even though it produces very incomplete alignments. This points to a particular problem with the AER in the context of our study. In fact, a system that outputs exactly the set of sure alignments achieves a perfect AER of 0, even though it aligns only about 23% of words, clearly an unacceptable drawback in many applications. We think that this issue may be addressed in two different ways. One time-consuming possibility would be to post-edit the reference alignment to ensure coverage and proper alignments. Another possibility would be to use the probabilistic model to mimic the reference data and generate both &amp;quot;Sure&amp;quot; and &amp;quot;Probable&amp;quot; alignments using eg thresholds on the estimated alignment probabilities. This approach may lead to better performance according to our metrics, but it is not obvious that the produced alignments will be more reasonable or even useful in a practical application.</Paragraph>
    <Paragraph position="8"> We also tested our approach on the Romanian-English task of the same workshop, cf. table 4. The 'HLT-03 best' is our earlier work (Dejean et al., 2003), simply based on IBM4 alignment using an additional lexicon extracted from the corpus.</Paragraph>
    <Paragraph position="9"> Slightly better results have been published since (Barbu, 2004), using additional linguistic processing, but those were not presented at the workshop. Note that the reference alignments for Romanian-English contain only &amp;quot;Sure&amp;quot; alignments, and therefore we only report the performance on those. In addition, AER = 1 FS in this setting. Table 4 shows that the matrix factorisation approach does not offer any quantitative improvements over these results. A gain of up to 10 points in recall does not offset a large decrease in precision. As a consequence, the AER for ONMF+AIC is about 10% higher than in our earlier work. This seems mainly due to the fact that the 'HLT-03 best' produces alignments for only about 80% of the words, while our technique ensure coverage and therefore aligns all words. These results suggest that remaining 20% seem particularly problematic. These quantitative results are disappointing given the sofistication of the method.</Paragraph>
    <Paragraph position="10"> It should be noted, however, that ONMF provides the qualitative advantage of producing proper alignments, and in particular ensures coverage. This may be useful in some contexts, eg training a phrase-based translation system.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Model selection and stability
</SectionTitle>
      <Paragraph position="0"> Like all mixture models, PLSA is subject to local minima. Although using a few random restarts seems to yield good performance, the results on difficult-to-align sentences may still be sensitive to initial conditions. A standard technique to stabilise the EM solution is to use deterministic annealing or tempered EM (Rose et al., 1990). As a side effect, deterministic annealing actually makes model selection easier. At low temperature, all components are identical, and they differentiate as the temperature increases, until the final temperature, where we recover the standard EM algorithm. By keeping track of the component differentiations, we may consider multiple effective numbers of components in one pass, therefore alleviating the need for costly multiple EM runs with different cept numbers and multiple restarts.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Other association measures
</SectionTitle>
      <Paragraph position="0"> ONMF is only a tool to factor the original translation matrix M, containing measures of associations between fi and ej. The quality of the resulting alignment greatly depends on the way M is  orthogonal non-negative matrix factorisation (ONMF) using the AIC and BIC criterion for choosing the number of cepts. HLT-03 best is Ralign.EF.1 (Mihalcea and Pedersen, 2003). no NULL alignments with NULL alignments  non-negative matrix factorisation (ONMF) using the AIC and BIC criterion for choosing the number of cepts. HLT-03 best is XRCE.Nolem (Mihalcea and Pedersen, 2003). filled. In our experiments we used counts from N-best alignments obtained from IBM model 4. This is mainly used as a proof of concept: other strategies, such as weighting the alignments according to their probability or rank in the N-best list would be natural extensions. In addition, we are currently investigating the use of translation and distortion tables obtained from IBM model 2 to estimate M at a lower cost. Ultimately, it would be interesting to obtain association measures mij in a fully non-parametric way, using corpus statistics rather than translation models, which themselves perform some kind of alignment. We have investigated the use of co-occurrence counts or mutual information between words, but this has so far not proved successful, mostly because common words, such as function words, tend to dominate these measures.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.3 M-1-0 alignments
</SectionTitle>
      <Paragraph position="0"> In our model, cepts ensure that resulting alignments are proper. There is however one situation in which improper alignments may be produced: If the MAP assigns f-words but no e-words to a cept (because e-words have more probable cepts), we may produce &amp;quot;orphan&amp;quot; cepts, which are aligned to words only on one side. One way to deal with this situation is simply to remove cepts which display this behaviour. Orphaned words may then be re-assigned to the remaining cepts, either directly or after re-training PLSA on the remaining cepts (this is guaranteed to converge as there is an obvious solution for K = 1).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.4 Independence between sentences
</SectionTitle>
      <Paragraph position="0"> One natural comment on our factorisation scheme is that cepts should not be independent between sentences. However it is easy to show that the factorisation is optimally done on a sentence per sentence basis. Indeed, what we factorise is the association measures mij. For a sentence-aligned corpus, the association measure between source and target words from two different sentence pairs should be exactly 0 because words should not be aligned across sentences. Therefore, the larger translation matrix (calculated on the entire corpus) is block diagonal, with non-zero association measures only in blocks corresponding to aligned sentence. As blocks on the diagonal are mutually orthogonal, the optimal global orthogonal factorisation is identical to the block-based (ie sentence-based) factorisation. Any corpus-induced dependency between alignments from different sentences must therefore be built in the association measure mij, and cannot be handled by the factorisation method. Note that this is the case in our experiments, as model 4 alignments rely on parameters obtained on the entire corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML