File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2178_metho.xml

Size: 4,498 bytes

Last Modified: 2025-10-06 14:13:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2178">
  <Title>amp;quot;Using Cognates to Align Sentences in Bilingual Corpora,&amp;quot; Fourth International Conference on Theoretical and Methodological Issues in Machine</Title>
  <Section position="4" start_page="1096" end_page="1097" type="metho">
    <SectionTitle>
4. Mutual Information
</SectionTitle>
    <Paragraph position="0"> Intuitively, these statements seem to be true, but we need to make them more precise. One could have chosen quite a number of similarity metrics for this purpose. We use mutual information:</Paragraph>
    <Paragraph position="2"> That is, we want to compare the probability of seeing fisheries and p~ches in the same piece to chance. The probability of seeing the two words in the same piece is simply: a prob(Vf, Vp)- a+b+c+d The marginal probabilities are: a+b prob(Vf)- a+b+c+d a+c prob(Vp) = a+b+c+d For fisheries --~ p~ches, prob(Vf, Vp) =prob(Vf) =prob(Vp) =0.2. Thus, the mutual information is log25 or 2.32 bits, meaning that the joint probability is 5 times more likely than chance. In contrast, for fisheries ~ lections, prob ( V f, V p ) = O, prob(Vf) =0.5 and prob(Vp) = 0.4. Thus, the mutual information is log 2 0, meaning that the joint is infinitely less likely than chance. We conclude that it is quite likely that fisheries and p~ches are translations of one another, much more so than fisheries and lections.</Paragraph>
  </Section>
  <Section position="5" start_page="1097" end_page="7098" type="metho">
    <SectionTitle>
5. Significance
</SectionTitle>
    <Paragraph position="0"> Unfortunately, mutual information is often unreliable when the counts are small. For example, there are lots of infrequent words. If we pick a pair of these words at random, there is a very large chance that they would receive a large mutual information value by chance. For example, let e be an English word that appeared just once and letfbe a French word that appeared just once. Then, there a non-trivial chance (-~) that e andf will appear is in the same piece, as shown in Table 7. If this should happen, the mutual information estimate would be very large, i.e., logK, and probably misleading.</Paragraph>
    <Paragraph position="2"> In order to avoid this problem, we use a t-score to filter out insignificant mutual information values.</Paragraph>
    <Paragraph position="4"> Using the numbers in Table 7, t=l, which is not significant. (A t of 1.65 or more would be significant at the p &gt; 0.95 confidence level.) Similarly, if e and f appeared in just two pieces 1 each, then there is approximately a ~ chance that they would both appear in the same two pieces, and then the mutual information score would be quite log,, ~--, but we probably wouldn't believe it high, Z.</Paragraph>
    <Paragraph position="5"> because the t-score would be only &amp;quot;~-. By this definition of significance, we need to see the two words in at least 3 different pieces before the result would be considered significant.</Paragraph>
    <Paragraph position="6"> This means, unfortunately, that we would reject fisheries --+ p~ches because we found them in only two pieces. The problem, of course, is that we don't have enough pieces. When K=10, there simply isn't enough resolution to see what's going on. At K=100, we obtain the contingency matrix shown in Table 8, and the t-score is significant  How do we choose K? As we have seen, if we choose too small a K, then the mutual information values will be unreliable. However, we can only increase K up to a point. If we set K to a ridiculously large value, say the size of the English text, then an English word and its translations are likely to fall in slightly different pieces due to random fluctuations and we would miss the signal. For this work, we set K to the square root of the size of the corpus.</Paragraph>
    <Paragraph position="7"> K should be thought of as a scale parameter. If we use too low a resolution, then everything turns into a blur and it is hard to see anything. But if we use too high a resolution, then we can miss the signal if  it isn't just exactly where we are looking.</Paragraph>
    <Paragraph position="8"> Ideally, we would like to apply the K-vec algorithm to all pairs of English and French words, but unfortunately, there are too many such pairs to consider. We therefore limited the search to pairs of words in the frequency range: 3-10. This heuristic makes the search practical, and catches many interesting pairs)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML