File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1314_intro.xml
Size: 3,526 bytes
Last Modified: 2025-10-06 14:01:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1314"> <Title>Word Alignment of English-Chinese Bilingual Corpus Based on Chunks</Title> <Section position="2" start_page="0" end_page="110" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With the easier access to bilingual corpora, there is a tendency in NLP community to process and refine the bilingual corpora, which can serve as the knowledge base in support of many NLP applications, such as automatic or human-aid translation, multilingual terminology and lexicography, multilingual information retrieval system, etc.</Paragraph> <Paragraph position="1"> Different NLP applications need different bilingual corpora, which are aligned at different level. They can be divided by the nature of the segment to section level, paragraph level, sentence level, phrase level, word level, byte level, etc.</Paragraph> <Paragraph position="2"> As for our applications, we choose the chunk level to do alignment based on following considerations. Firstly, our applications, which include an example-based machine translation system, a computer aid translation system and a multilingual information retrieval system, need the alignment below the sentence level, on which we can acquire bilingual word and phrase dictionaries and. other useful translation information. Secondly, the word level alignment between English and Chinese language is difficult to deal with. There are no cognate words. The change in Chinese word order and word POS always produce many null and mistake correspondences. Next, we observe the phenomenon that when we translate the English sentence to Chinese sentence, all the words in one English chunk tend to be translated as one block of Chinese words which are coterminous.</Paragraph> <Paragraph position="3"> The word orders within these blocks tend to keep with the English chunk, also. So there are stronger boundaries between chunks than between words when we translate texts. Finally, as we all known, chunk has been assigned syntactic structure (Steven Abney, 1991), which comprises a connected sub-graph of the sentence's parse tree. So it's possible to align sentence structure and obtain translation grammars based on chunks by parsing.</Paragraph> <Paragraph position="4"> Many researchers have studied the text alignment problem and a number of quite encouraging results have been reported to different level alignments. With sentence-aligned corpus ready in hand, we focus our attention on the intra-sentence alignment between the sentence pairs. In this paper, a method for the word alignment of English-Chinese corpus based on chunks is proposed. The chunks of English sentences are identified firstly. Then the chunk boundaries of Chinese sentences are predicted by the bilingual lexicon and synonymy Chinese dictionary and heuristic information. The ambiguities of Chinese chunk boundaries are resolved by the coterminous words in English chunks. With the chunk aligned bilingual corpus, a translation relation probability is proposed to align words. Although this paper is related to English-Chinese word alignment, the idea can be used to any other language bilingual corpora. In the following sections, we first present a brief review of related work in word alignment. Then discuss our alignment algorithm based on chunks in detail. Following this is an analysis of our experimental results. Finally, we close our paper with a discussion of future work.</Paragraph> </Section> class="xml-element"></Paper>