File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2208_intro.xml
Size: 25,452 bytes
Last Modified: 2025-10-06 14:02:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2208"> <Title>Satoshi Sekine ++</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Construction of Aligned Parallel </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Treebank Corpus Reflecting Contextual Information 2.1 Human Translation of Existing Monolingual Treebank </SectionTitle> <Paragraph position="0"> The Penn Treebank is a tagged corpus of Wall Street Journal material, and it is divided into 24 sections. The Kyoto University text corpus is a tagged corpus of the Mainichi newspaper, which isdividedinto16sectionsaccordingtothecategories of articles such as the sports section and the economy section. To maintain the consistency of expressions in translation, a few particular translators were assigned to translate articles in a particular section, and the same translator was assigned to the same section. The instructions to translators for Japanese-English translation is basically as follows.</Paragraph> <Paragraph position="1"> 1. One-sentence to one-sentence translation as a rule Translate a source sentence into a target sentence. In case the translated sentence becomes unnatural by pursuing this policy, leave a comment. null 2. Natural translation reflecting contextual infor- null mation Except in the case that the translated sentence becomes unnatural by pursuing policy 1, translate a source sentence into a target sentence naturally.</Paragraph> <Paragraph position="2"> By deletion, replacement, or supplementation, let the translated sentence be natural in the context.</Paragraph> <Paragraph position="3"> In an entire article, the translated sentences must maintain the same meaning and information as those of the original sentences. 3. Translations of proper nouns Find out the translations of proper nouns by looking up the nouns in a dictionary or by using a web search. In case a translation cannot be found, use a temporary name and report it.</Paragraph> <Paragraph position="4"> We started the construction of Japanese-Chinese parallel corpus in 2002. The Japanese sentences of the Kyoto University text corpus were also translated into Chinese by human translators. Then each translated Chinese sentence was revised by a second Chinese native. The instruction to the translators is the same as that given in the Japanese-English human translations.</Paragraph> <Paragraph position="5"> The breakdown of the parallel corpora is shown in Table 1. We are planning to translate the remaining 18,714 sentences of the Kyoto University text corpus and the remaining 30,890 sentences of the Penn Treebank. As for the naturalness of the translated sentences, there are 207 (1%) unnatural English sentences of the Kyoto University text corpus, and 462 (2.5%) unnatural Japanese sentences of the Penn Tree-bank generated by pursuing policy 1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Morphological and Syntactic Annotation </SectionTitle> <Paragraph position="0"> In the following sections, we describe the annotated information of the parallel treebank corpus based on the Kyoto University text corpus.</Paragraph> <Paragraph position="1"> Information of Japanese-English corpus Translated English sentences were analyzed by using the Charniak Parser (Charniak, 1999). Then, the parsed sentences were manually revised. The definitions of part-of-speech (POS) categories and syntactic labels follow those of the Treebank I style (Marcus et al., 1993). We have finished revising the 10,328 parsed sentences that appeared from January 1st to 11th. An example of morphological and syntactic structures is shown in Figure 1. In this figure, &quot;S-ID&quot; means the sentence ID in the Kyoto University text corpus. EOJ means the boundary between a Japanese parsed sentence andanEnglishparsedsentence.Thedefinition of Japanese morphological and syntactic information follows that of the Kyoto University text corpus (Version 3.0). The syntactic structure is represented by dependencies between Japanese phrasal units called bunsetsus. The phrasal units or bunsetsus are minimal linguistic units obtained by segmenting a sentence naturally in terms of semantics and phonetics, and each of them consists of one or more morphemes.</Paragraph> <Paragraph position="2"> Information of Japanese-Chinese corpus Chinese sentences are composed of strings of Hanzi and there are no spaces between words. The morphological annotation, therefore, includes providing tags of word boundaries and POSs of words. We analyzed the Chinese sentences by using the morphological analyzer developed by Peking University (Zhou and Duan, 1994). There are 39 categories in this POS set. Then the automatically tagged sentences were revised by the third native Chinese. In this pass the Chinese translations were revised again while the results of word segmentation and POS tagging were revised. Therefore the Chinese translations are obtained with a high quality. We have finished revising the 12,000 tagged sentences. The revision of the remaining sentences is ongoing. An example of tagged Chinese sentences is shown in Figure 2. The letters shown after '/' indicate POSs. The Chinese sentence is the translation of the Japanese sentence in Figure 1. The Chinese sentences are GB encoded. The 38,383 translated Chinese sentences have 1,410,892 Hanzi and 926,838 words.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Phrasal Alignment </SectionTitle> <Paragraph position="0"> This section describes the annotated information of 19,669 sentences of the Kyoto University text corpus.</Paragraph> <Paragraph position="1"> The minimum alignment unit should be as small as possible, because bigger units can be constructed from units of the minimum size.</Paragraph> <Paragraph position="2"> However, we decided to define a bunsetsu as the minimum alignment unit. One of the main reasons for this is that the smaller the unit is, the higher the human annotation cost is. Another reason is that if we define a word or a morpheme as a minimum alignment unit, expressions such as post-positional particles in Japanese and articles in English often do not have alignments. To effectively absorb those expressions and to align as many parts as possible, we found that a bigger unit than a word or a morpheme is suitable as the minimum alignment unit. We call the minimum alignment based on bunsetsu alignment units the bunsetsu unit translation pair. Bigger pairs than the bunsetsu unit translation pairs can be automatically extracted based on the bunsetsu unit translation pairs. We call all of the pairs, including bunsetsu unit translation pairs, translation pairs.Thebunsetsu unit translation pairs for idiomatic expressions often become unnatural. In this case, two or more bunsetsu units are combined and handled as a minimum alignment unit. The breakdown of the bunsetsu unit translation pairs is shown in lation pairs.</Paragraph> <Paragraph position="3"> (1) total # of translation pairs 172,255 (2) # of different translation pairs 146,397 (3) # of Japanese expressions 110,284 (4) # of English expressions 111,111 (5) average # of English expressions 1.33 corresponding to a Japanese expression ((2)/(3)) (6) average # of Japanese expressions 1.32 corresponding to a English expression ((2)/(4)) (7) # of ambiguous Japanese expressions 15,699 (8) # of ambiguous English expressions 12,442 (9) # of bunsetsu unit translation pairs 17,719 consisting of two or more bunsetsus An example of phrasal alignment is shown in the line after the S-ID to the EOJ. Each line indicates a bunsetsu. Each rectangular line indicates a dependency between bunsetsus. The leftmost number in each line indicates the bunsetsu ID. The corresponding English sentence is showninthenextlineafterthatoftheEOJ (End of Japanese) until the EOE (End of English). The English expressions corresponding to each bunsetsu are tagged with the corresponding bunsetsu ID such as <Pid=&quot;bunsetsu ID&quot;></P>. When there are two or more figures in the tag id such as id=&quot;1,2&quot;, it means two or more bunsetsus are combined and handled as a minimum alignment unit.</Paragraph> <Paragraph position="4"> For example, we can extract the following translation pairs from Figure 3.</Paragraph> <Paragraph position="6"> first cargo / of apples imported from the U.S. / was brought to the market.</Paragraph> <Paragraph position="7"> Here, Japanese and English expressions are divided by the symbol &quot;;&quot;, and &quot;/&quot; means a bunsetsu boundary.</Paragraph> <Paragraph position="8"> An overview of the criteria of the alignment is as follows. Align as many parts as possible, except if a certain part is redundant. More detailed criteria will be attached with our corpus when it is open to the public.</Paragraph> <Paragraph position="9"> 1. Alignment of English grammatical elements that are not expressed in Japanese English articles, possessive pronouns, infinitive to, and auxiliary verbs are joined with nouns and verbs.</Paragraph> <Paragraph position="10"> 2. Alignment between a noun and its substitute expression A noun can be aligned with its substitute expression such as a pronoun.</Paragraph> <Paragraph position="11"> 3. Alignment of Japanese ellipses An English expression is joined with its related elements. For example, the English subject is joined with its related verb.</Paragraph> <Paragraph position="12"> 4. Alignment of supplementary or explanatory expression in English Supplementary or explanatory expressions in English are joined with their related words.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Use for Evaluation of Conventional Methods </SectionTitle> <Paragraph position="0"> The corpus as described in Section 2 can be used for the evaluation of English-Japanese and Japanese-English machine translation. We can directly compare various methods of machine translation by using this corpus. It can be summarized as follows in terms of the characteristics of the corpus.</Paragraph> <Paragraph position="1"> One-sentence to one-sentence translation can be simply used for the evaluation of various methods of machine translation.</Paragraph> <Paragraph position="2"> Morphological and syntactic information canbeusedfortheevaluationofmethods that actively use morphological and syntactic information, such as methods for example-based machine translation (Nagao, 1981; Watanabe et al., 2003), or transfer-based machine translation (Imamura, 2002).</Paragraph> <Paragraph position="3"> Phrasal alignment is used for the evaluation of automatically acquired translation knowledge (Yamamoto and Matsumoto, 2003).</Paragraph> <Paragraph position="4"> An actual comparison and evaluation is our future work.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Analysis of Translation </SectionTitle> <Paragraph position="0"> One-sentence to one-sentence translation reflects contextual information. Therefore, it is suitable to investigate the influence of the context on the translation. For example, we can investigate the difference in the use of demonstratives and pronouns between English and Japanese. We can also investigate the difference in the use of anaphora.</Paragraph> <Paragraph position="1"> Morphological and syntactic information and phrasal alignment canbeusedtoinvestigate the appropriate unit and size of translation rules and the relationship between syntactic structures and phrasal alignment.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Use in Conventional Systems </SectionTitle> <Paragraph position="0"> One-sentence to one-sentence translation can be used for training a statistical translation model such as GIZA++ (Och and Ney, 2000), which could be a strong baseline system for machine translation.</Paragraph> <Paragraph position="1"> Morphological and syntactic information and phrasal alignment canbeusedtoacquire translation knowledge for example-based machine translation and transfer-based machine translation.</Paragraph> <Paragraph position="2"> In order to show what kind of units are helpful for example-based machine translation, we investigated whether the Japanese sentences of newspaper articles appearing on January 17, 1995, which we call test-set sentences, could be translated into English sentences by using translation pairs appearing from January 1st to 16th as a database. First, we found that only one out of 1,234 test-set sentences agreed with one out of 18,435 sentences in the database. Therefore, a simple sentence search will not work well. On the other hand, 6,659 bunsetsus out of 12,632 bunsetsus in the test-set sentences agreed with those in the database. If words in bunsetsusare expanded into their synonyms, the combination of the expanded bunsetsus sets in the database may cover the test-set sentences. Next, therefore, we investigated whether the Japanese test-set sentences could be translated into English sentences by simply combining translation pairs appearing in the database. Given a Japanese sentence, words were extracted from it and translation pairs that include those words or their synonyms, which were manually evaluated, were extracted from the database. Then, the English sentence was manually generated by just combining English expressions in the extracted translation pairs. One hundred two relatively short sentences (the average number of bunsetsus is about 9.8) were selected as inputs.</Paragraph> <Paragraph position="3"> The number of equivalent translations, which mean that the translated sentence is grammatical and has the same meaning as the source sentence, was 9. The number of similar translations, which mean that the translated sentence is ungrammatical, or different or wrong meanings of words, tenses, and prepositions are used in the translated sentence, was 83. The number of other translations, which mean that some words are missing, or the meaning of the translated sentence is completely different from that of the original sentence, was 10. For example, the original parallel translation is as follows:</Paragraph> <Paragraph position="5"> English: New Party Sakigake proposed that towards the ordinary session, both parties found a council to discuss policy and Diet management.</Paragraph> <Paragraph position="6"> Given the Japanese sentence, the translated sentence was: Translation:Sakigake Party suggested to set up an organization between the two parties towards the regular session of the Diet to discuss under the theme of policies and the management of the Diet.</Paragraph> <Paragraph position="7"> This result shows that only 9% of input sentences can be translated into sentences equivalent to the original ones. However, we found that approximately 90% of input sentences can be translated into English sentences that are equivalent or similar to the original ones.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Similar Parallel Translation Generation </SectionTitle> <Paragraph position="0"> The original aim of constructing an aligned parallel treebank corpus as described in Section 2 is to achieve a new framework for translation aid as described below.</Paragraph> <Paragraph position="1"> It would be very convenient if multilingual sentences could be generated by just writing sentences in our mother language. Today, it can be formally achieved by using commercial machine translation systems. However, the automatically translated sentences are often incomprehensible. Therefore, we have to revise the original and translated sentences by finding and referring to parallel translation whose source language sentence is similar to the original one. In many cases, however, we cannot find such similar parallel translations to the input sentence. Therefore, it is difficult for users who do not have enough knowledge of the target languages to generate comprehensible sentences in several languages by just searching similar parallel translations in this way. Therefore, we propose to generate similar parallel translations whose source language sentence is similar to the input sentence. We call this framework for translation aid similar parallel translation generation. null We investigated whether the framework can be achieved by using our aligned parallel tree-bank corpus. As the first step of this study, we investigated whether an appropriate parallel translation can be generated by simply combining translation pairs extracted from our aligned parallel treebank corpus in the following steps.</Paragraph> <Paragraph position="2"> 1. Extract each content word with its adjacent function word in each bunsetsu in a given sentence null 2. The extracted content words and their adjacent function words are expanded into their synonyms and class words whose major and minor POS categories are the same 3. Find translation pairs including the expanded content words with their expanded adjacent function words in the given sentence 4. For each bunsetsu, select a translation pair that has similar dependency relationship to those in the given sentence 5. Generate a parallel translation by combining the selected translation pairs The input sentences were randomly selected from 102 sentences described in Section 3.3.</Paragraph> <Paragraph position="3"> The above steps, except the third step, were basically conducted manually. The Examples of the input sentences and generated parallel translations are shown in Figure 4.</Paragraph> <Paragraph position="4"> The basic unit of translation pairs in our aligned parallel treebank corpus is a bunsetsu, and the basic unit in the selection of translation pairs is also a bunsetsu. One of the advantages of using a bunsetsu as a basic unit is that a Japanese expression represented as one of various expressions in English, or omitted in English, such as Japanese post-positional particles, is paired with a content word. Therefore, the translation of such an expression is appropriately selected together with the translation of a content word when a certain translation pair is selected. If the translation of such an expression was selected independently of the translation of a content word, the combination of each translation would be ungrammatical or unnatural. Another advantage of the basic unit, bunsetsu, is that we can easily refer to dependency information between bunsetsus when we select an appropriate translation pair because the original treebank has the dependency information between bunsetsus. These advantages are utilized in the above generation steps. For example, in the first step, a content word &quot;q(kokkai, Diet session)&quot; in the second example in Figure 4 was extracted from the bunsetsu &quot;q(tsuujo-kokkai, the ordinary Diet session)t(ni, case marker)&quot;, and it was expanded into its class word &quot;q(kai, meeting)&quot; in the second step. Then, a translation pair &quot;(J)rwVbq(kokuren-kodomono-kenri-iinkai)t(ni, case marker); (E)the UN Committee on the Rights of the Child /(J) 0`(taishi); (E)towards&quot; was extracted as a translation pair in the third step. Since the dependency between &quot;rwVbq (kokuren-kodomo-no-kenri-iinkai,theUNCommittee on the Rights of the Child)&quot; and &quot;0` (taishi, towards)&quot; is similar to that between &quot; q(tsuujo-kokkai, the ordinary Diet session)t(ni, case marker)&quot; and &quot;Z(muke,towards)&quot; in the input sentence, this translation pair was selected in the fourth step. Finally, the bunsetsu &quot;rwVbq(kokurenkodomo-no-kenri-iinkai, the UN Committee on the Rights of the Child)t(ni, case marker)&quot; and its translation &quot;the UN Committee on the Rights of the Child&quot; was used for generation of a parallel translation in the fifth step.</Paragraph> <Paragraph position="5"> When we use the generated parallel translation for the exact translation of the input sentence, we should replace &quot;rwVb q(kokuren-kodomo-no-kenri-iinkai)&quot; and its translation &quot;the UN Committee on the Rights of the Child&quot; with &quot;q(tsuujo-kokkai,the ordinary Diet session)&quot; and its translation &quot;the ordinary Diet session&quot; by consulting a bilingual dictionary. In this example, &quot;fw(sono)&quot; and &quot;them&quot; should also be replaced with &quot;X(ryoto)&quot; and &quot;both parties&quot;. It is easy to identify words in the generated translation that should be replaced with words in the input sentence because each bunsetsu in translation pairs is already aligned. In such cases, templates such as &quot;[q^(kaigi)]t(ni)Z(muke)&quot; and &quot;towards [council]&quot; can be automatically generated by generalizing content words expanded in the second step and their translation in the generated translation. The average number of English expressions corresponding to a Japanese expression is 1.3 as shown in Table 2. Even when there are two or more possible English expressions, an appropriate English expression can be chosen by selecting a Japanese expression by referring to dependencies in extracted translation pairs.</Paragraph> <Paragraph position="6"> Therefore, in many cases, English sentences can be generated just by reordering the selected expressions. The English word order was estimated manually in this experiment. However, we can automatically estimate English word order by using a language model or an English surface sentence generator such as FERGUS (Bangalore and Rambow, 2000). Unnatural or ungrammatical parallel translations are sometimes generated in the above steps. However, comprehensible translations can be generated as shown in Figure 4. The biggest advantage of this framework is that comprehensible target sentences can be generated basically by referring only to source sentences. Although it is costly to search and select appropriate translation pairs, we believe that human labor can be reduced by developing a human interface. For example, when we use a Japanese text generation system from keywords (Uchimoto et al., 2002), users should only select appropriate keywords. null We are investigating whether or not we can generate similar parallel translations to all of the Japanese sentences appearing on January 17, 1995. So far, we found that we can generate similar parallel translations to 691 out of 840 sentences (the average number of bunsetsus is about 10.3) including the 102 sentences described in Section 3.3. We found that we could not generate similar parallel translations to 149 out of 840 sentences.</Paragraph> <Paragraph position="7"> In the proposed framework of similar parallel translation generation, the language appearing in a corpus corresponds to a controlled language, and users are allowed to use only the controlled language to write sentences in the source language. We believe that high-quality bilingual or multilingual documents can be generated by letting us adapt ourselves to the controlled environment in this way.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4Conclusion </SectionTitle> <Paragraph position="0"> This paper described aligned parallel treebank corpora of newspaper articles between languages whose syntactic structures are different from each other; they meet the following conditions. null 1. It is easy to investigate the influence of the context on the translation.</Paragraph> <Paragraph position="1"> 2. The annotated information in the existing monolingual high-quality treebanks can be utilized. null 3. It is open to the public.</Paragraph> <Paragraph position="2"> To construct parallel corpora that satisfy these conditions, each sentence in the existing monolingual high-quality treebanks has been translated into a corresponding natural sentence reflecting its contextual information in a target language by skilled translators, and each parallel translation has been annotated with morphological and syntactic structures and phrasal alignment.</Paragraph> <Paragraph position="3"> This paper also described the possible applications of the parallel corpus and proposed a similar parallel translation generation framework. In this framework, a parallel translation whose source language sentence is similar to a given sentence can be semi-automatically generated. In this paper we demonstrated that the framework could be achieved by using our aligned parallel treebank corpus.</Paragraph> <Paragraph position="4"> In the near future, the aligned parallel tree-bank corpora will be open to the public, and expanded. We are planning to use the corpora actively for machine translation, as a translation aid, and for second language learning. We are also planning to develop automatic or semi-automatic alignment system and an efficient interface for machine translation aid.</Paragraph> </Section> </Section> class="xml-element"></Paper>