File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1050_metho.xml
Size: 3,570 bytes
Last Modified: 2025-10-06 14:14:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1050"> <Title>Identifying Word Translations in Non-Parallel Texts</Title> <Section position="4" start_page="0" end_page="320" type="metho"> <SectionTitle> 3 Simulation </SectionTitle> <Paragraph position="0"> A simulation experiment was conducted in order to see whether the above assumptions concerning the similarity of co-occurrence patterns actually hold.</Paragraph> <Paragraph position="1"> In this experiment, for an equivalent English and German vocabulary two co-occurrence matrices were computed and then compared. As the English vocabulary a list of 100 words was used, which h~l been suggested by Kent & Rosanoff (1910) for association experiments. The German vocabulary consisted of one by one translations of these words as chosen by Russell (1970).</Paragraph> <Paragraph position="2"> The word co-occurrences were computed on the basis of an English corpus of 33 and a German corpus of 46 million words. The English corpus consists of the German matrix correspond, the dot patterns of the two matrices are identical.</Paragraph> <Paragraph position="3"> the Brown Corpus, texts from the Wall Street Yourhal, Grolier's Electronic Encyclopedia and scientific abstracts from different fields. The German corpus is a compilation of mainly newspaper texts from Frankfurter Rundschau, Die Zei~ and Mannl~eimer Morgen. To the knowledge of the author, the English and German corpora contain no parallel passages.</Paragraph> <Paragraph position="4"> For each pair of words in the English vocabulary its frequency of common occurrence in the English corpus was counted. The common occurrence of two words was defined as both words being separated by at most 11 other words. The co-occurrence frequencies obtained in this way were used to build up the English matrix. Equivalently, the German co-occurrence matrix was created by counting the co-occurrences of German word pairs in the German corpus. As a starting point, word order in the two matrices was chosen such that word n in the German matrix was the translation of word n in the English matrix.</Paragraph> <Paragraph position="5"> Co-occurrence studies like that conducted by Wettler & Rapp (1993) have shown that for many purposes it is desirable to reduce the influence of word frequency on the co-occurrence counts. For the prediction of word associations they achieved best results when modifying each entry in the co-occurrence matrix using the following formula:</Paragraph> <Paragraph position="7"> Hereby f(i&j) is the frequency of common occurrence of the two words i and j, and f(i) is the corpus frequency of word i. However, for comparison, the simulations described below were also conducted using the original co-occurrence matrices (formula 2) and a measure similar to mutual information (formula 3). 1</Paragraph> <Paragraph position="9"> Regardless of the formula applied, the English and the German matrix where both normalized. 2 Starting from the normalized English and German matrices, the aim was to determine how far the similarity of the two matrices depends on the correspondence of word order. As a measure for matrix similarity the sum of the absolute differences of the values at corresponding matrix positions was used.</Paragraph> <Paragraph position="11"> This similarity measure leads to a value of zero for identical matrices, and to a value of 20 000 in the case that a non-zero entry in one of the 100 * 100 matrices always corresponds to a zero-value in the other.</Paragraph> </Section> class="xml-element"></Paper>