File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0801_metho.xml

Size: 18,964 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0801">
  <Title>Identifying Word Correspondences in Parallel Texts. In Proceedings of the Speech and Natural</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 The Log-Likelihood-Ratio Association
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
Measure
</SectionTitle>
      <Paragraph position="0"> We base all our association-based word-alignment methods on the log-likelihood-ratio (LLR) statistic introduced to the NLP community by Dunning (1993). We chose this statistic because it has previously been found to be effective for automatically constructing translation lexicons (e.g., Melamed, 2000). We compute LLR scores using the following formula presented by Moore (2004):</Paragraph>
      <Paragraph position="2"> In this formula f and e mean that the words whose degree of association is being measured occur in the respective target and source sentences of an aligned sentence pair, !f and !e mean that the corresponding words do not occur in the respective sentences, f? and e? are variables ranging over these values, and C(f?,e?) is the observed joint count for the values of f? and e?. The probabilities in the formula refer to maximum likelihood estimates.</Paragraph>
      <Paragraph position="3"> Since the LLR score for a pair of words is high if the words have either a strong positive association or a strong negative association, we discard any negatively associated word pairs by requiring that p(f,e) &gt;p(f) * p(e). Initially, we computed the LLR scores for all positively associated English/French word pairs in our 500K sentence pair corpus. To reduce the memory requirements of our algorithms we discarded any word pairs whose LLR score was less than 1.0. This left us with 12,797,697 word pairs out of a total of 21,451,083 pairs that had at least one co-occurrence.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="3" type="metho">
    <SectionTitle>
4 One-to-One, Word Type Alignment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
Methods
4.1 Method 1
</SectionTitle>
      <Paragraph position="0"> The first set of association-based word-aligment methods we consider permit only one-to-one alignments and do not take word position into account.</Paragraph>
      <Paragraph position="1"> The simplest method we consider uses the LLR scores to link words according to Melamed's (2000) &amp;quot;competitive linking algorithm&amp;quot; for aligning words in a pair of sentences. Since competitive linking has  no way to distinguish one instance of a particular word type from another, we operate with counts of linked and unlinked instances of word types, without trying to designate the particular instances the counts refer to. This version of competitive linking can be described as follows: * Find the pair consisting of an English word type and a French word type that have the highest association score of any pair of words types that both have remaining unlinked instances.</Paragraph>
      <Paragraph position="2"> * Increase by 1 the count of linked occurrences of this pair of word types, and decrease by 1 the count of unlinked instances of each of these word types.</Paragraph>
      <Paragraph position="3"> * Repeat until no more words can be linked.</Paragraph>
      <Paragraph position="4"> We will refer to this version of the competitive linking algorithm using LLR scores as Method 1. This is the method that Melamed uses to generate an initial alignment that he refines by re-estimation in his &amp;quot;Method A&amp;quot; (Melamed, 2000).</Paragraph>
      <Paragraph position="5"> Method 1 can terminate either because one or both sentences of the pair have no more unlinked words, or because no association scores exist for the remaining unlinked words. We can use this fact to trade off recall for precision by discarding association scores below a given threshold. Table 1 shows the precision/recall trade-off for Method 1 on our development set. Since Method 1 produces only word type alignments, these recall and precision scores are computed with respect to an oracle that makes the best possible choice among multiple occurrences of the same word type.</Paragraph>
      <Paragraph position="6">  The best (oracular) AER is 0.216, with recall of 0.840 and precision of 0.747, occurring at an LLR threshold of 11.7.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Method 2
</SectionTitle>
      <Paragraph position="0"> A disadvantage of Method 1 is that it makes alignment decisions for each sentence pair independently of the decisions for the same words in other sentence pairs. It turns out that we can improve alignment  The oracle goes through the word type pairs in the same order as the competitive linking algorithm, linking particular instances of the word types. It prefers a pair that has a sure alignment in the annotated test data to a pair that has a possible alignment; and prefers a pair with a possible alignment to one with no alignment.</Paragraph>
      <Paragraph position="1">  accuracy by biasing the alignment method towards linking words in a given sentence that are also linked in many other sentences. A simple way to do this is to perform a second alignment based on the conditional probability of a pair of words being linked according to Method 1, given that they both occur in a given sentence pair. We estimate this link proba- null (f,e) is the number of times f and e are linked according to Method 1, and cooc(f,e) is the number of times f and e co-occur in aligned sentences. null  We now define alignment Method 2 as follows: * Count the number of links in the training corpus for each pair of words linked in any sentence pair by Method 1.</Paragraph>
      <Paragraph position="2"> * Count the number of co-occurrences in the training corpus for each pair of words linked in any sentence pair by Method 1.</Paragraph>
      <Paragraph position="3">  * Compute LP scores for each pair of words linked in any sentence pair by Method 1.</Paragraph>
      <Paragraph position="4"> * Align sentence pairs by competitive linking using LP scores.</Paragraph>
      <Paragraph position="5">  Melamed (1998) points out there are at least three ways to count the number of co-ccurrences of f and e in a given sentence pair if one or both of f and e have more than one occurrence. Based on preliminary explorations, we chose to count the co-occurrences of f and e as the maximum of the number of occurrences of f and the number of occurrences of e, if both f and e occur; otherwise cooc(f,e)=0.</Paragraph>
      <Paragraph position="6">  Method 2 on our development set. Again, an oracle is used to choose among multiple occurrences of the same word type. The best (oracular) AER is 0.126, with recall of 0.830 and precision of 0.913, occurring at an LP threshold of 0.215.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Method 3
</SectionTitle>
      <Paragraph position="0"> It is apparent that Method 2 performs much better than Method 1 at any but the lowest recall levels.</Paragraph>
      <Paragraph position="1"> However, it fails to display a monotonic relationship between recall and precision as the score cut-off threshold is tightened or loosened. This seems to be due to the fact that the LP measure, unlike LLR, does not discount estimates made on the basis of little data. Thus a pair of words that has one co-occurrence in the corpus, which is linked by Method 1, gets the same LP score of 1.0 as a pair of words that have 100 co-occurrences in the corpus and are linked by Method 1 every time they co-occur.</Paragraph>
      <Paragraph position="2"> A simple method of compensating for this overconfidence in rare events is to apply absolute discounting. We will define the discounted link proba-</Paragraph>
      <Paragraph position="4"> timal value of d for our development set to be approximately 0.9, using the optimal, oracular AER as our objective function.</Paragraph>
      <Paragraph position="5"> Table 3 shows the precision/recall trade-off for  and use of an oracle to choose among multiple occurrences of the same word type. The best (oracular) AER is 0.119, with recall of 0.827 and precision of 0.929, occurring at an LP d threshold of 0.184. This is an improvement of 0.7% absolute in AER, but perhaps as importantly, the monotonic trade-off between precision and recall is essentially restored. We can see in Table 3 that we can achieve recall of 60% on this development set with precision of 98.7%, and we can obtain even higher precision by sacrificing recall slightly more. With Method 2, 96.7% was the highest precision that could be obtained at any recall level measured.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="4" type="metho">
    <SectionTitle>
5 Allowing Many-to-One Alignments
</SectionTitle>
    <Paragraph position="0"> It appears from the results for Methods 2 and 3 on the development set that reasonable alignment accuracy may be achievable using association-based techniques (pending a way of selecting the best word token alignments for a given word type alignment).</Paragraph>
    <Paragraph position="1"> However, we can never learn any many-to-one alignments with methods based on competitive linking, as either we or Melamed have used it so far.</Paragraph>
    <Paragraph position="2"> To address this issue, we introduce the notion of bilingual word clusters and show how iterated applications of variations of Method 3 can learn many-to-one mappings by building up clusters incrementally. Consider the abstract data structure to which competitive linking is applied as a tuple of bags (multisets). In Methods 1-3, for each sentence pair, competitive linking is applied to a tuple of a bag of French words and a bag of English words. Suppose we apply Method 3 with a high LP d cut-off threshold so that we can be confident that almost all  the links we produce are correct, but many French and English words remain unlinked. We can regard this as producing for each sentence pair a tuple of three bags: bags of the remaining unlinked English and French words, plus a third bag of word clusters consisting of the linked English and French words. To produce more complex alignments, we can then carry out an iteration of a generalized version of Method 3, in which competitive linking connects remaining unlinked English and French words to each other or to previously derived bilingual clusters.</Paragraph>
    <Paragraph position="3">  As just described, the approach does not work very well, because it tends to build clusters too often when it should produce one-to-one alignments. The problem seems to be that translation tends to be nearly one-to-one, especially with closely related languages, and this bias is not reflected in the method so far. To remedy this, we introduce two biases in favor of one-to-one alignments. First, we discount the LLR scores between words and clusters, so the competitive linking pass using these scores must find a substantially stronger association for a given word to a cluster than to any other unlinked word before it will link the word to the cluster. Second, we apply the same high LP d cut-off on wordto-cluster links that we used in the first iteration of Method 3 to generate word-to-word links. This leaves many unlinked words, so we apply one more iteration of yet another modified version of Method 3 in which competitive linking is allowed to link the remaining unlinked words to other unlinked words, but not to clusters. We refer to this sequence of three iterations of variations of Method 3 as Method 4.</Paragraph>
    <Paragraph position="4"> To evaluate alignments involving clusters according Och and Ney's method, we translate clusters back into all possible word-to-word alignments consistent with the cluster. We found the optimal value on the development set for the LLR discount for clusters to be about 2000, and the optimal value for the LP d cut-off for the first two iterations of Method 3 to be about 0.7. With these parameter values, the best (oracular) AER for Method 4 is 0.110, with recall of 0.845 and precision of 0.929, occurring at a final LP d threshold of 0.188. This is an improve- null In principle, the process can be further iterated to build up clusters of arbitrary size, but at this stage we have not yet found an effective way of deciding when a cluster should be expanded beyond two-to-one or one-to-two.</Paragraph>
    <Paragraph position="5"> ment of 0.9% absolute in AER over Method 3, resulting from an improvement of 1.7% absolute in recall, with virtually no change in precision.</Paragraph>
  </Section>
  <Section position="7" start_page="4" end_page="5" type="metho">
    <SectionTitle>
6 Token Alignment Selection Methods
</SectionTitle>
    <Paragraph position="0"> Finally, we turn to the problem of selecting the best word token alignment for a given word type alignment, and more generally to the incorporation of positional information into association-based wordalignment. We consider three token alignment selection methods, each of which can be combined with any of the word type alignment methods we have previously described. We will therefore refer to these methods by letter rather than number, with a complete word token alignment method being designated by a number/letter combination.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
6.1 Method A
</SectionTitle>
      <Paragraph position="0"> The simplest method for choosing a word token alignment for a given word type alignment is to make a random choice (without replacement) for each word type in the alignment from among the tokens of that type. We refer to this as Method A.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
6.2 Method B
</SectionTitle>
      <Paragraph position="0"> In Method B, we find the word token alignment consistent with a given word type alignment that is the most nearly mononotonic. We decide this by defining the degree of nonmonotonicity of an alignment, and minimizing that. If more than one word token alignment has the lowest degree of nonmonotonicity, we pick one of them arbitrarily.</Paragraph>
      <Paragraph position="1"> To compute the nonmonotonicity of a word token alignment, we arbitrarily designate one of the languages as the source and the other as the target.</Paragraph>
      <Paragraph position="2"> We sort the word pairs in the alignment, primarily by source word position, and secondarily by target word position. We then iterate through the sorted alignment, looking only at the target word positions.</Paragraph>
      <Paragraph position="3"> The nonmonotonicity of the alignment is defined as the sum of the absolute values of the backward jumps in this sequence of target word positions.</Paragraph>
      <Paragraph position="4"> For example, suppose we have the sorted alignment ((1,1)(2,4)(2,5)(3,2)). The sequence of target word positions in this sorted alignment is (1,4,5,2).</Paragraph>
      <Paragraph position="5"> This has only one backwards jump, which is of size 3, so that is the nonmonotonicity value for this alignment. For a complete or partial alignment, the  nonmonotonicity is clearly easy to compute, and nonmonotonicity can never be decreased by adding links to a partial alignment. The least nonmonotonic alignment is found by an incremental best-first search over partial alignments kept in a priority queue sorted by nonmonotonicity.</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.3 Method C
</SectionTitle>
      <Paragraph position="0"> Method C is similiar to Method B, but it also uses nonmonotonicity in deciding which word types to align. In Method C, we modify the last pass of competitive linking of the word type alignment method to stop at a relatively high score threshold, and we compute all minimally nonmonotonic word token alignments for the resulting word type alignment.</Paragraph>
      <Paragraph position="1"> We then continue the final competitive linking pass applied to word tokens rather than types, but we select only word token links that can be added to one of the remaining word token alignments without increasing its nonmonotonicity. Specifically, for each remaining word type pair (in order of decreasing score) we make repeated passes through all of the word token alignments under consideration, adding one link between previously unlinked instances of the two word types to each alignment where it is possible to do so without increasing nonmonotonicity, until there are no longer unlinked instances of both word types or no more links between the two word types can be added to any alignment without increasing its nonmonotonicity. At the end of each pass, if some, but not all of the alignments have had a link added, we discard the alignments that have not had a link added; if no alignments have had a link added, we go on to the next word type pair. This final competitive linking pass continues until another, lower score threshold is reached.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.4 Comparison of Token Alignment Selection
Methods
</SectionTitle>
      <Paragraph position="0"> Of these three methods, only C has additional free parameters, which we jointly optimized on the development set for each of the word type alignment methods. All other parameters were left at their optimal values for the oracular choice of word token alignment.</Paragraph>
      <Paragraph position="1"> Table 4 shows the optimal AER on the development set, for each combination of word type alignment method and token alignment selection method  that we have described. For comparison, the oracle for each of the pure word type alignment methods is added to the table as a token alignment selection method. As we see from the table, Method 4 is the best word type alignment method for every token alignment selection method, and Method C is the best actual token alignment selection method for every word type alignment method. Method C even beats the token alignment selection oracle for every word alignment type method except Method 1. This is possible because Method C incorporates nonmonotonicity information into the selection of linked word types, whereas the oracle is applied after all word type alignments have been chosen.</Paragraph>
      <Paragraph position="2"> The best combined overall method is 4C. For this combination, the optimal value on the development set for the first score threshold of Method C was about 0.65 and the optimal value of the second score threshold of Method C was about 0.075.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML