File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2015_metho.xml
Size: 9,646 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2015"> <Title>Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora</Title> <Section position="3" start_page="86" end_page="88" type="metho"> <SectionTitle> 4 Morphological Information Annotation </SectionTitle> <Paragraph position="0"> Annotation consists of automatic analyses and manual revision.</Paragraph> <Section position="1" start_page="86" end_page="87" type="sub_section"> <SectionTitle> 4.1 Annotation on Japanese Sentences </SectionTitle> <Paragraph position="0"> Japanese morphological and syntactic analyses follow the definitions of part-of-speech categories and syntactic labels of the Corpus of Spontaneous Japanese (Maekawa, 2000).</Paragraph> <Paragraph position="1"> A morphological analyzer developed in that project was applied for automatic annotation on the Japanese sentences and then the automatically tagged sentences were revised manually. An annotated senetence is illustrated in Figure 1, which is the Japanese sentence in Ex. 1 in Section 2.</Paragraph> <Paragraph position="2"> The data of one sentence begins from the line &quot;# S-ID... &quot; and ends with the mark &quot;EOJ&quot;. The line headed by &quot;*&quot; indicates the beginning of a phrase and the following lines are morphemes in that phrase. For example, the line &quot;* 0 2D&quot; indicates the phrase whose number is 0. The following line &quot;izu remo izu remo * Fu Ci * * *&quot; indicates the morpheme in the phrase. There are seven fields in each morpheme line, token form, phonetic alphabet, dictionary form, part-of-speech, sub-part-of-speech, verbal category and conjugation form. In the line &quot;* 0 2D&quot;, the numeral 2 in &quot;2D&quot; indicates that the phrase 0 &quot;i zuremo &quot; modifies the phrase 2 &quot;Ruo Zhe de, &quot;. The syntactic structure analysis adopts dependency-structure analysis in which modifier-modified relations between phrases are determined. The dependency-structure of the example in Figure</Paragraph> </Section> <Section position="2" start_page="87" end_page="87" type="sub_section"> <SectionTitle> 4.2 Annotation on Chinese Sentences </SectionTitle> <Paragraph position="0"> For Chinese morphological analysis, we used the analyser developed by Peking University, where the research on definition of Chinese words and the criteria of word segmentation has been conducted for over ten years. The achievements include a grammatical knowledge base of contemporary Chinese, an automatic morphological analyser, and an annotated People's Daily Corpus. Since the definition and tagset are widely used in Chinese language processing, we also took the criteria as the basis of our guidelines.</Paragraph> <Paragraph position="1"> A morphological analyzer developed by Peking University (Zhou and Yu, 1994) was applied for automatic annotation of the Chinese sentences and then the automatically tagged sentences were revised by humans. An annotated sentence is illustrated in Figure 3, which is the Chinese sentence in Ex. 1 in Section 2.</Paragraph> </Section> <Section position="3" start_page="87" end_page="88" type="sub_section"> <SectionTitle> 4.3 Tool for Manual Revision </SectionTitle> <Paragraph position="0"> We developed a tool to assist annotators in revision.</Paragraph> <Paragraph position="1"> The tool has both Japanese and Chinese versions.</Paragraph> <Paragraph position="2"> Here, we introduce the Chinese version. The input of the tool is the automatically segmented and part-of-speech tagged sentences and the output is revised data. The basic functions include separating a sequence of characters into two words, combining two segmented words into one word, and selecting a part-of-speech for a segmented word from a list of parts-of-speech. In addition, the tool has the following functions.</Paragraph> <Paragraph position="3"> (1) Retrieves a word in the grammatical knowledge base of contemporary Chinese of Peking University (Yu et al., 1997).</Paragraph> <Paragraph position="4"> This is convenient when annotators want to confirm whether a segmented word is authorized by the grammatical knowledge base, and when they want to know the parts-of-speech of a word defined by the grammatical knowledge base.</Paragraph> <Paragraph position="5"> (2) Retrieves a word in other annotated corpora or the sentences that have been revised.</Paragraph> <Paragraph position="6"> This is convenient when annotators want to see how the same word has been annotated before. (3) Retrieves a word in the current file.</Paragraph> <Paragraph position="7"> It collects all the sentences in the current file that contain the same word and then sorts their context on the left and right of the word. By referring to the sorted contexts, annotators can select words with the same syntactic roles and change all of the parts-of-speech to a certain one all in one operation. This is convenient when annotators want to process the same word in different sentences, aiming for consistency in annotation.</Paragraph> <Paragraph position="8"> (4) Adds new words to the grammatical knowledge base dynamically.</Paragraph> <Paragraph position="9"> The updated grammatical knowledge base can be used by the morphological analyser in the next analysis.</Paragraph> <Paragraph position="10"> (5) Indexes to sentences by an index file.</Paragraph> <Paragraph position="11"> The automatically discovered erroneous annotations can be stored in one index file, pointing to the sentences that are to be revised.</Paragraph> <Paragraph position="12"> The interface of the tool is shown in Figure 4 and Figure 5.</Paragraph> <Paragraph position="13"> (Retrieves a word in the current file) In Figure 4, the small window in the lower left displays the retrieved result of the word &quot; Tou Zi &quot; in the grammatical knowledge base; the lower right window displays the retrieved result of the same word in the annotated People's Daily Corpus. In Figure 5, the small window in the lower left is used to define retrieval conditions in the current file. In this example, the orthography of &quot;Nu Li &quot; is defined. The lower right window displays the sentences containing the word &quot;Nu Li &quot; retrieved from the current file. The left and right contexts of one word are shown with the retrieved word. The contents of any column can be sorted by clicking the top line of the column.</Paragraph> </Section> </Section> <Section position="4" start_page="88" end_page="88" type="metho"> <SectionTitle> 5 Annotation of word alignment </SectionTitle> <Paragraph position="0"> Since automatic word alignment techniques cannot reach as high a level as the morphological analyses, we adopt a practical method of using multiple aligners. One aligner is a lexical knowledge-based approach, which was implemented by us based on the work of Ker (Ker and Chang, 1997). Another aligner is the well-known GIZA++ toolkit, which is a statistics-based approach. For GIZA++, two directions were adopted: the Chinese sentences were used as source sentences and the Japanese sentences as target sentences, and vice versa.</Paragraph> <Paragraph position="1"> The results produced by the lexical knowledge-based aligner, C - J of GIZA++, and J - C of GIZA++ were selected in a majority decision. If an alignment result was produced by two or three aligners at the same time, the result was accepted.</Paragraph> <Paragraph position="2"> Otherwise, was abandoned. In this way, we aimed to utilize the results of each aligner and maintain high precision at the same time. Table 2 showed the evaluation results of the multi-aligner on 1,127 test sentence pairs, which were manually annotated with gold standards, totally 17,332 alignments.</Paragraph> <Paragraph position="3"> The multi-aligner produced satisfactory results.</Paragraph> <Paragraph position="4"> This performance is evidence that the multi-aligner is feasible for use in assisting word alignment annotation.</Paragraph> <Paragraph position="5"> For manual revision, we also developed an assisting tool, which consist of a graphical interface and internal data management. Annotators can correct the output of the automatic aligner and add alignments that it has not identified. In addition to assisting with word alignment, the tool also supports annotation on phrase alignment. Since Japanese sentences have been annotated with phrase structures, annotators can select each phrase on the Japanese side and then align them with words on the Chinese side. For idioms in Japanese sentences, two or more phrases can be selected.</Paragraph> <Paragraph position="6"> The input and output file of the manual annotation is in XML format. The data of one sentence pair consists of the Chinese sentence annotated with morphological information, the Japanese sentence annotated with morphological and syntactic structure information, word alignment, and phrase alignment.</Paragraph> <Paragraph position="7"> The alignment annotation at word and phrase is ongoing, the former focusing on lexical translations and the latter focusing on pattern translations. After a certain amount of data is annotated, we plan to exploit the annotated data to improve the performance of automatic word alignment. We will also investigate a method to automatically identify phrase alignments from the annotated word alignment and a method to automatically discover the syntactic structures on the Chinese side from the annotated phrase alignments.</Paragraph> </Section> class="xml-element"></Paper>