File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1106_metho.xml
Size: 7,808 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1106"> <Title>Lower and higher estimates of the number of &quot;true analogies&quot; between sentences contained in a large multilingual corpus</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The corpus used </SectionTitle> <Paragraph position="0"> For this study, we used the Basic Traveler's Expression Corpus, or BTEC, for short7. This is a multilingual corpus of expressions from the travel and tourism domain. It contains 162,318 aligned translations in several languages. Here, we shall use Chinese, English and Japanese. There are 96,234 different sentences in Chinese, 97,769 in English and 103,274 in Japanese8. The sentences in BTEC are quite short as the figures in Table 1 show.</Paragraph> <Paragraph position="1"> 3 Analogies on the level of form</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Method </SectionTitle> <Paragraph position="0"> On the level of form, a possible formalisation of analogy between strings of symbols has been proposed (LEPAGE, 2001) which renders an account of some analogies9.</Paragraph> <Paragraph position="2"> Here, a is a character, whatever the writing system, and A, B, C and D are strings of characters. |A|a stands for the number of occurrences of a's in A.</Paragraph> <Paragraph position="3"> dist(A,B) is the edit distance between strings A and B, i.e., the minimal number of insertions and deletions10 of characters necessary to transform A into B.</Paragraph> <Paragraph position="4"> Obviously, applied to sentences considered as strings of characters (not strings of words), this formalisation can only render an account of analogies on the level of form. Figure 1 shows examples of analogies meeting the above definition.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> It takes some ten days to gather all possible analogies of form using the above definition on a Pentium 4 computer at 2.8 Hz with 2 Gb memory for a corpus of around 100,000 sentences. Of course, we tion, like reduplication: e.g., I play tennis. : I play tennis. Do you play tennis too? :: I play guitar. : I play guitar. Do you play guitar too?, or mirroring: stressed : desserts :: reward : drawer. Also, in reality, this formalisation is only an implication. But we shall use it as if it were an equivalence. 10Substitutions and transpositions are not considered as basic edit operations.</Paragraph> <Paragraph position="1"> do not inspect all possible quadruples of sentences.</Paragraph> <Paragraph position="2"> Rather, a hierarchical coding of sentences based on counts of characters allows us to infer the absence of any analogy within large sets of sentences. This cuts the computational load. To compute edit distances, a fast bit string similarity computation algorithm (ALLISON and DIX, 1986) is used.</Paragraph> <Paragraph position="3"> We counted the number of analogies of form in each of the monolingual Chinese, English and Japanese parts of the corpus using the previous formula. The examples of Figure 1 are actual examples of analogies retrieved. Table 2 shows the counts for each language. The numbers obtained are quite large. For English, we report around 2.5 million analogies of form involving more than 50,000 sentences. That is to say, half of the sentences of the corpus are already in immediate analogy with other sentences of the same corpus.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Discussion </SectionTitle> <Paragraph position="0"> The average number of analogies of form per sentence in each different language over all unique sentences may be estimated in the following way: 1,639,068 / 96,234 = 17.03 for Chinese, 2,384,202 / 97,769 = 24.39 for English and 1,910,065 / 103,274 = 18.50 for Japanese. Averaging the sentences involved, this becomes: 5,059,979 / 49,675 = 33,00 for Chinese, 2,384,202 / 53,250 = 44.77 for English and 1,910,065 / 53,572 = 35.65 for Japanese, which indicates that, on average, there are dozens of different ways to obtain these sentences by analogy with other sentences.</Paragraph> <Paragraph position="1"> These counts are necessarily higher bounds of the numbers of &quot;true analogies&quot;, as they rely on form only. For instance, the first analogy in Figure 1 is not a &quot;true analogy&quot;. However, it is quite difficult to spot such analogies, so that the overall impression is that analogies of form which are not analogies of meaning are exceptions. So, our next problem will be to try to retain only those analogies which are also analogies of meaning.</Paragraph> <Paragraph position="2"> 4 A lower estimate: meaning preservation through translation</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Method </SectionTitle> <Paragraph position="0"> Computing analogies between structural representations is possible11. Unfortunately, the corpus we have at our disposal does not offer any structural representation. And it does not seem that tools are yet available which would deliver semantic (not syntactic) representations for all sentences of our corpus in all three languages we deal with.</Paragraph> <Paragraph position="1"> Fortunately, common sense has it that translation preserves meaning12, and, by definition, a multilingual corpus, like the one we use, contains corresponding utterances in different languages. Consequently, we shall assume that if two sentences A1 and A2 in two different languages are translations of one another (noted A1 - A2), then, they should be the linguistic realisations of the same meaning, and reciprocally13.</Paragraph> <Paragraph position="3"> Suppose that at least one analogy of form can be found to hold in every possible language of the world for some possible realisations of four given meanings. Then, for sure, the analogy of meaning can be said to hold.</Paragraph> <Paragraph position="4"> If we suppose that the number of languages is finite, let us denote it n, counting the number of &quot;true analogies&quot; in a set of sentences in a given language, say L1, is tantamount to counting the cases described by the following formula (ii).</Paragraph> <Paragraph position="5"> 11(ITKONEN and HAUKIOJA, 1997) show how &quot;true analogies&quot; can be computed by relying at the same time on the surface and the structural representation of sentences. 12See (CARL, 1998) for an attempt at classifying machine translation systems relying on this idea.</Paragraph> <Paragraph position="6"> 13Note that, in this formula, L1 and L2 need not be different. If the language is the same, then, A1 and A2 are paraphrases.</Paragraph> <Paragraph position="8"> Of course, the problem is: how to test again all possible languages? Obviously, relying on more languages should give a higher accuracy to the method. Here, we have only three languages at our disposal. By relying on languages which are typologically different like Chinese, English and Japanese, it is reasonable to think that we somewhat counterbalance the small number of languages used.</Paragraph> <Paragraph position="9"> To summarize, by using Equivalence (i), and by considering only sentences attested in our corpus, Formula (ii) can be restated as follows, when restricted to three languages.</Paragraph> <Paragraph position="11"> Practically, thus, the number of &quot;true analogies&quot; is just the cardinal of the intersection set of the sets of analogies for each possible language.</Paragraph> </Section> </Section> class="xml-element"></Paper>