File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0107_metho.xml
Size: 5,601 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0107"> <Title>Automatic Extraction of Word Sequence Correspondences in Parallel Corpora</Title> <Section position="5" start_page="80" end_page="80" type="metho"> <SectionTitle> 3 Overview of the Method </SectionTitle> <Paragraph position="0"> Figure 1 shows the flow of the process to find the correspondences of Japanese and English word sequences. Both Japanese and English texts are analyzed morphologically.</Paragraph> <Paragraph position="1"> We make use of two types of co-occurrences: Word co-occurrences within each language corpus and corresponding co-occurrences of those in the parallel corpus. In the current setting, all words and word sequences of two or more occurrences are taken into account. Since frequent co-occurrence suggests higher plausibility of correspondence, we set a similarity measure that takes co-occurrence frequencies into consideration. Deciding the similarity measure in this way reduces the computational overhead in the later processes. If every possible correspondence of word sequences is to be calculated, the combination is large. Since high similarity value is supported by high co-occurrence frequency, a gradual strategy can be taken by setting a threshold value for the similarity and by iteratively lowering it. Though our method does not assume any bilingual dictionary in advance, once words or word sequences are identified in an earlier stage, they are regarded as decisive entries of the translation dictionary. Such translation pairs are taken away from the co-occurrence data, then only the remaining word sequences need be taken into consideration in the subsequent iterative steps. Next section describes the details of the algorithm.</Paragraph> </Section> <Section position="6" start_page="80" end_page="81" type="metho"> <SectionTitle> 4 The Algorithm </SectionTitle> <Paragraph position="0"> The step numbering of the following procedure corresponds to the numbers appearing in Figure 1.</Paragraph> <Paragraph position="1"> In the current implementation, the Translation Dictionary is empty at the beginning. Steps 1 and 2 are performed on each language corpus separately.</Paragraph> <Paragraph position="2"> 1. Japanese and English texts are analyzed morphologically and all content words (nouns, verbs, adjectives and adverbs) are identified.</Paragraph> <Paragraph position="3"> 2. All content words of two or more occurrences are extracted. Then, word sequences of length two that are headed by a previously extracted word are extracted, provided they appear at least twice in the corpus. In the same way, a word sequence w of length i + 1 is taken into consideration only when its prefix of length i has been extracted and w appears at least twice in the corpus. This process is repeated until no new word sequences are obtained. The subsequent steps handle only those extracted word sequences. It would be natural to set a maximum length for the candidate word sequences, which we really have it be between 5 and 10 in the experiments.</Paragraph> <Paragraph position="4"> 3. A threshold for minimum frequency of occurrence (.f,~in) is decided, and the following process is repeated, every time decrementing the threshold by some extent.</Paragraph> <Paragraph position="5"> 4. For the word sequence occurring more than fmin times, the numbers of total occurrence and total bilingual co-occurrence are counted. This is done for all the pairs of such Japanese and English word sequences. It is not the case for a pair that already appeared in the Translation Dictionary.</Paragraph> <Paragraph position="6"> 5. For each pair of bilingual word sequences, the following similarity value (sim(w.r, wE)) is calculated, where wj and WE are Japanese and English word sequences, and fj, fe and fie are the total frequency of wj in the Japanese corpus, that of wE in the English corpus and the total co-occurrence frequency of wj and WE appearing in corresponding sentences.</Paragraph> <Paragraph position="8"> This formula is a modification of the Dice coefficient, weighting their similarity measure by logarithm of the pair's co-occurrence frequency. Only the pairs with their sire(w j, wE) value greater than log 2 frain are considered in this step. The fact that no word sequence occurring less than frnin times cannot yield greater similarity value than log 2 frnin assures that all pairs of word sequences with the occurrence more than fmin times are surely taken into consideration.</Paragraph> <Paragraph position="9"> 6. The most plausible correspondences are then identified using the similarity values so calculated: null (a) For an English word sequence WE, let WJ = {Wjl,Wj2,&quot;',wjn} be the set of all Japanese word sequences such that sim(wji, wE) > log2 f,~i,~. The set is called the candidate set for WE. For each Japanese word sequence w.t its candidate set is constructed in the same way.</Paragraph> <Paragraph position="10"> (b) Of the candidate set WJ for wE, if the candidate with the highest similarity value (w.ti = arg max sim(wjk,WE)) again selects wE as the candidate with the highest w~kEWJ similarity (wE = arg max sirn(wji,WEm)), where WE is the candidate set for w.tl, wEmEWE the pair (wji, WE) is regarded as a translation pair.</Paragraph> <Paragraph position="11"> 7. The approved translation pairs are registered in the Translation Dictionary until no new pair is obtained, then the threshold value fmin is lowered, and the steps 4 through 6 are repeated until fmin reaches a predetermined value.</Paragraph> </Section> class="xml-element"></Paper>