File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2015_intro.xml

Size: 10,139 bytes

Last Modified: 2025-10-06 14:02:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2015">
  <Title>Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora</Title>
  <Section position="2" start_page="0" end_page="86" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A parallel corpus is a collection of articles, paragraphs, or sentences in two different languages.</Paragraph>
    <Paragraph position="1"> Since a parallel corpus contains translation correspondences between the source text and its translations at different level of constituents, it is a critical resource for extracting translation knowledge in machine translation (MT). Although recently some versions of machine translation software have become available in the market, translation quality is still a significant problem.</Paragraph>
    <Paragraph position="2"> Therefore, a detailed examination into human translation is still required. This will provide a basis for radically improving machine translation in the near future. In addition, in MT system development, the example-based method and the statistics-based method are widely researched and applied. So, parallel corpora are required by the translation studies and practical system development.</Paragraph>
    <Paragraph position="3"> The raw text of a parallel corpus contains implicit knowledge. If we annotate some information, we can get explicit knowledge from the corpus. The more information that is annotated on a parallel corpus, the more knowledge we can get from the corpus. The parallel corpora of European languages are usually raw texts without annotation on syntactic structure since their syntactic structures are similar and MT does not require such annotation information. However, when language pairs are different in syntactic structures, such as the pair of English and Japanese and the pair of Japanese and Chinese, transformation between syntactic structures is difficult. A parallel corpus annotated with syntactic structures would thus be helpful to MT. Besides MT, an annotated parallel corpus can be applied to cross-lingual information retrieval, language teaching, machine-aided translation, bilingual lexicography, and word-sense disambiguation.</Paragraph>
    <Paragraph position="4"> Parallel corpora between European languages are well developed and are available through the Linguistic Data Consortium (LDC). However, parallel corpora between European languages and Asian languages are less developed, and parallel corpora between two Asian languages are even less developed.</Paragraph>
    <Paragraph position="5"> The National Institute of Information and Communications Technology therefore started a project to build multilingual parallel corpora in 2002 (Uchimoto et al., 2004). The project focuses on Asian language pairs and annotation of detailed information, including syntactic structure and alignment at word and phrase levels. We call the corpus the NICT Multilingual Corpora. The corpus will be open to the public in the near future.</Paragraph>
    <Paragraph position="6"> 2 Overview of the NICT Multilingual Corpora At present, a Japanese-English parallel corpus and a Japanese-Chinese parallel corpus are under construction following systematic specifications.</Paragraph>
    <Paragraph position="7"> The parallel texts in each corpus consist of the original text in the source language and its translations in the target language. The original data is from newspaper articles or journals, such as  Mainichi Newspaper in Japanese. The original articles were translated by skilled translators. In human translation, the articles of one domain were all assigned to the same translators to maintain consistent terminology in the target language.</Paragraph>
    <Paragraph position="8"> Different translators then revised the translated articles. Each article was translated one sentence to one sentence, so the obtained parallel corpora are already sentence aligned.</Paragraph>
    <Paragraph position="9"> The details of the current version of the NICT  The following is an example of English and Chinese translations of a Japanese sentence from Mainichi Newspaper.</Paragraph>
    <Paragraph position="10"> [Ex. 1] J: izuremoShi Jiu Sui Qian Hou noRuo Zhe de, Zhi Wen niDa eruQi Li moCan tuteinai.</Paragraph>
    <Paragraph position="11"> E: They were all about nineteen years old and had no strength left even to answer questions.</Paragraph>
    <Paragraph position="12"> C: Zhe Xie E Jun Shi Bing Jun Wei Shi Jiu Sui Zuo You De Nian Qing Ren ,Ta Men Shen Zhi Lian Hui Da Wen Ti De Qi Li Ye Mei You .</Paragraph>
    <Paragraph position="13"> In addition to the human translation, another big task is annotating the information. We finish the task by two steps: automatic annotation and human revision. In automatic annotation, we applied existing analysis techniques and tag sets. In human revision, we developed assisting tools that have powerful functions to help annotators in revision. The annotation task for each language included morphological and syntactic structure annotation. The annotation task for each language pair included alignments at word and phrase level.</Paragraph>
    <Paragraph position="14"> The NICT Multilingual Corpora constructed in this way have the following characteristics.</Paragraph>
    <Paragraph position="15">  (1) Since the original data is from newspaper and journals, the domain of each corpus is therefore rich. (2) Each corpus consists of original sentences and  their translations, so they are already sentence aligned. In translation of each sentence, the context of the article is also considered. Thus, the context of each original article is also well maintained in its translation, which can be exploited in the future. (3) The corpora are annotated at high quality with morphological and syntactic structures and word/phrase alignment.</Paragraph>
    <Paragraph position="16"> In the following section, we will describe the details in the construction of the Japanese-Chinese parallel corpus.</Paragraph>
    <Paragraph position="17">  3 Human Translation from Japanese to Chinese About 40,000 Japanese sentences from issues of Mainichi Newspaper were translated by skilled translators. The translation guidelines were as follows.</Paragraph>
    <Paragraph position="18"> (1) One Japanese sentence is translated into one Chinese sentence.</Paragraph>
    <Paragraph position="19"> (2) Among several translation candidates, the one that is close to the original sentence in syntactic structure is preferred. The aim is to avoid translating a sentence too freely, i.e., paraphrasing.</Paragraph>
    <Paragraph position="20"> (3) To obtain intelligible Chinese translations,  information of the proceeding sentences in the same article should be added. Especially, a subject should be supplemented because a subject is usually required in Chinese, while in Japanese subjects are often omitted .</Paragraph>
    <Paragraph position="21"> (4) To obtain natural Chinese translations, supplement, deletion, replacement, and paraphrase should be made when necessary.</Paragraph>
    <Paragraph position="22"> When a translation is very long, word order can be changed or commons can be inserted. These are the restrictions on (2), i.e., the naturalness of the Chinese translations is the priority.</Paragraph>
    <Paragraph position="23"> One problem in translation is how to translate proper nouns in the newspaper articles. We pay special attentions to them in the following way.  (1) Proper nouns  When proper nouns did not exist in Japanese-Chinese dictionaries, new translations were created and then confirmed using the Chinese web. For kanji in proper nouns, if there was a Chinese character having the same orthography as the kanji, the Chinese character was used in the Chinese translation; if there was a traditional Chinese character having the same orthography as the kanji, the simplified character of the traditional Chinese character was used in the translation; otherwise, a Chinese character whose orthography is similar to that of the kanji was used in the translation.</Paragraph>
    <Paragraph position="24"> (2) Special things in Japan  Explanations were added if necessary. For example, &amp;quot;Da Xiang Bu &amp;quot;, translated from &amp;quot;Da Xiang Pu &amp;quot; (grand sumo tournament), is well known in China, while &amp;quot;Chun Dou &amp;quot;, translated from &amp;quot;Chun Dou &amp;quot; (spring labor offensive), is not known in China. In this case, an explanation &amp;quot;Chun Ji Lao Zi Jiu Fen &amp;quot; was added behind the unfamiliar term. We attempt to introduce new words about Japanese culture into Chinese through the construction of the corpus.</Paragraph>
    <Paragraph position="25">  crucial to this parallel corpus. We controlled the quality by the following treatments.</Paragraph>
    <Paragraph position="26"> (1) The first revision of a translated article was conducted by a different translator after the first translation. The reviewers checked whether the meanings of the Chinese translations corresponded accurately to the meanings of the original sentences and modified the Chinese translations if necessary. (2) The second revision was conducted by Chinese natives without referring to the original sentences. The reviewers checked whether the Chinese translations were natural and passed the unnatural translations back to translators for modification. (3) The third revision was conducted by a Chinese native in the annotation process of Chinese morphological information. The words that did not  exist in the dictionary of contemporary Chinese were checked to determine whether they were new words. If not, the words were designated as informal or not written language and were replaced with suitable words. The word sequences that missed the Chinese language model's part-of-speech chain were also adjusted.</Paragraph>
    <Paragraph position="27"> Until now, 38,383 Japanese sentences have been translated to Chinese, and of those, 22,000 Chinese translations have been revised three times, and we are still working on the remaining 18,000 Chinese translations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML