File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2178_intro.xml

Size: 4,374 bytes

Last Modified: 2025-10-06 14:05:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2178">
  <Title>amp;quot;Using Cognates to Align Sentences in Bilingual Corpora,&amp;quot; Fourth International Conference on Theoretical and Methodological Issues in Machine</Title>
  <Section position="3" start_page="0" end_page="1096" type="intro">
    <SectionTitle>
2. The K-vec Algorithm
</SectionTitle>
    <Paragraph position="0"> K-vec starts by estimating the lexicon. Consider the example: fisheries --~ p~ches. The K-vec algorithm will discover this fact by noting that the distribution of fisheries in the English text is similar to the distribution of p~ches in the French.</Paragraph>
    <Paragraph position="1"> The concordances for fisheries and p~ches are shown in Tables 1 and 2 (at the end of this paper). 1 1. These tables were computed from a small fragment of the Canadian Hansards that has been used in a number of other studies: Church (1993) and Simard et al (1992). The English text has 165,160 words and the French text has 185,615 words.</Paragraph>
    <Paragraph position="2">  There are 19 instances of fisheries and 21 instances of p~ches. The numbers along the left hand edge show where the concordances were found in the texts. We want to know whether the distribution of numbers in Table 1 is similar to those in Table 2, and if so, we will suspect that fisheries and p~ches are translations of one another. A quick look at the two tables suggests that the two distributions are probably very similar, though not quite identical. 2 We use a simple representation of the distribution of fisheries and p~ches. The English text and the French text were each split into K pieces. Then we determine whether or not the word in question appears in each of the K pieces. Thus, we denote the distribution of fisheries in the English text with a K-dimensional binary vector, VU, and similarly, we denote the distribution of p~ches in the French text with a K-dimensional binary vector, Vp. The i th bit of Vf indicates whether or not Fisheries occurs in the i th piece of the English text, and similarly, the ith bit of Vp indicates whether or not p~ches occurs in the i th piece of the French text.</Paragraph>
    <Paragraph position="3"> If we take K be 10, the first three instances of fisheries in Table 1 fall into piece 2, and the remaining 16 fall into piece 8. Similarly, the first 4 instances of pgches in Table 2 fall into piece 2, and the remaining 17 fall into piece 8. Thus, VT= Vp = &lt;2 0,0,1,0,0,0,0,0,1,0 &gt; Now, we want to know if VT is similar to Vp, and if we find that it is, then we will suspect that fisheries ---&gt; p~ches. In this example, of course, the vectors are identical, so practically any reasonable similarity statistic ought to produce the desired result.</Paragraph>
    <Paragraph position="4"> 3. fisheries is not file translation of lections Before describing how we estimate the similarity of Vf and Vp, let us see what would happen if we tried to compare fisheries with a completely unrelated word, eg., lections. (This word should be the translation of elections, not fisheries.) 2. At most, fisheries can account for only 19 instances of p~ches, leaving at least 2 instances ofp~ches unexplained.</Paragraph>
    <Paragraph position="5"> As can be seen in the concordances in Table 3, for K=10, the vector is &lt;1, 1, 0, 1, 1,0, 1, 0, 0, 0&gt;. By almost any measure of similarity one could imagine, this vector will be found to be quite different from the one for fisheries, and therefore, we will correctly discover that fisheries is not the translation of lections.</Paragraph>
    <Paragraph position="6"> To make this argument a little more precise, it might help to compare the contingency matrices in Tables 5 and 6. The contingency matrices show: (a) the number of pieces where both the English and French word were found, (b) the number of pieces where just the English word was found, (c) the number of pieces where just the French word was found, and (d) the number of peices where neither word was found.</Paragraph>
    <Paragraph position="7">  In general, if the English and French words are good translations of one another, as in Table 5, then a should be large, and b and c should be small. In contrast, if the two words are not good translations of one another, as in Table 6, then a should be small, and b and c should be large.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML