File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/96/w96-0107_relat.xml
Size: 5,368 bytes
Last Modified: 2025-10-06 14:16:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0107"> <Title>Automatic Extraction of Word Sequence Correspondences in Parallel Corpora</Title> <Section position="4" start_page="79" end_page="80" type="relat"> <SectionTitle> 2 Related Work and Some Results </SectionTitle> <Paragraph position="0"> Brown et. al., used mutual information to construct corresponding pairs of French and English words. A French word f is considered to be translated into English word ej that gives the maximum mutual information:</Paragraph> <Paragraph position="2"> and co-occurrences of ej and f.</Paragraph> <Paragraph position="3"> Kay & Rbscheisen used the following Dice coefficient for calculating the similarity between English word we and French word w I. In the formula, f(we), f(wl) represent the numbers of occurrences of we and wl, and f(we, wl) is the number of simultaneous occurrences of those words in corresponding sentences.</Paragraph> <Paragraph position="4"> 2f(we, w f) sire(we, = f(w ) + f(wj) Kitamura & Matsumoto used the same formula for calculating word similarity from Japanese-English parallel corpora. A comparison between the above two method is done on a parallel corpus and the results are reported in \[Ohmori 96\]. They applied both approaches to a French-English corpus of about one thousand sentence pairs. The results are shown in Table 1 where the correctness is checked by human inspection. Since both methods show very inaccurate results for the words of one occurrence, only the words of two or more occurrences are selected for inspection. Table 1 shows the proportion that a French word is paired with the correct English words checked with the top, three and five highest candidates.</Paragraph> <Paragraph position="5"> Num. of words The results show that though Dice coefficient gives a slightly better correctness both methods do not generate satisfactory translation pairs.</Paragraph> <Paragraph position="6"> \[Kupiec 93\] and \[Kumano & Hirakawa 94\] broaden the target to correspondences between word sequences such as compound nouns. Kupiec uses NP recognizer for both English and French and proposed a method to calculate the probabilities of correspondences using an iterative algorithm like the EM algorithm. He reports that in one hundred highest ranking correspondences ninety of them were correct. Although the NP recognizers detect about 5000 distinct noun phrases in both languages, the correctness ratio of the total data is not reported.</Paragraph> <Paragraph position="7"> Kumono & Hirakawa's objective is to obtain English translation of Japanese compound nouns (noun sequences) and unknown words using a statistical method similar to Brown's together with an ordinary Japanese-English dictionary. Japanese compound nouns and unknown words are detected by the morphological analysis stage and are determined before the later processes. Though they assume unaligned Japanese-English parallel corpora, alignment is performed beforehand. In an experiment with two thousand sentence pairs, 72.9% correctness is achieved by the best correspondences and 83.8% correctness by the top three candidates in the case of compound nouns. The correctness ratios for unknown words are 54.0% and 65.0% respectively.</Paragraph> <Paragraph position="8"> Smadja proposes a method of finding translation patterns of continuous as well as discontinuous collocations between English and French \[Smadja 96\]. The method first extracts meaningful collocations in the source language(English) in advance by the XTRACT system. Then, aligned corpora are statistically analized for finding the corresponding collocation patterns in the target language(French). To avoid possible combinational explosion, some heuristics is introduced to filter implausible correspondences.</Paragraph> <Paragraph position="9"> Getting translation pairs of complex expression is of great importance especially for technical domains where most domain specific terminologies appear as complex nouns. There are still a ............................................... _P~((eJ.C_9.~or~ ............................................. Japanese Corpus English Corpus I ............ I l ............... ..................... ;;: ........................................................... ..... ;;;;; 1. Morphological Analysis ---ff-~..~&quot;~- ...... &quot;--'&quot;1 Morphological Analysis ...... ~ ~'~ Content Word Extraction \[ r,.~'. C/fC/_. \[ Content Word Extraction I English | ~__,~uunm~j Word Sequence ~action ~ctionary__.~ 2. Word_Selquenc e Extraction I |_ 3. Setting of Minu~um Occurence Condition ~-ranslation |~i~ ..... 4. Extraction of Translation Canditates re er o 5. Similarity Calculation threshold decrement . ........ :: .......... 6:.O? r?J.ation -'4degf rransistion ......... .......... _____ 7~7,;g'-&quot; .~....~ Japanese-English Pair ~ ......</Paragraph> <Paragraph position="10"> number of other interesting and meaningful expression that should be translated in a specific way. We propose a method of finding corresponding translation pairs of arbitrary length word sequences appearing in parallel corpora and an algorithm that gradually produces &quot;good&quot; correspondences earlier so as to reduce noises when extracting less plausible correspondences.</Paragraph> </Section> class="xml-element"></Paper>