File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1121_intro.xml
Size: 3,548 bytes
Last Modified: 2025-10-06 14:02:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1121"> <Title>Aligning Bilingual Corpora Using Sentences Location Information*</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There have been a number of papers on aligning parallel texts at the sentence level in the last century, e.g., (Brown et al. 1991; Gale and Church, 1993; Simard et al. 1992; Wu DeKai 1994). On clean inputs, such as the Canadian Hansards and the Hong Kang Hansards, these methods have been very successful.</Paragraph> <Paragraph position="1"> (Church, Kenneth W, 1993; Chen, Stanley, 1993) proposed some methods to resolve the problem in noisy bilingual texts. Cognate information between Indo-European languages pairs are used to align noisy texts. But these methods are limited when aligning the language pairs which are not in the same genre or have no cognate information. (Fung, 1994) proposed a new algorithm to resolve this problem to some extent. The algorithm uses frequency, position and recency information as features for pattern matching. (W. Bin, 2000) adapted the similar idea with (Fung, 1994) to align special domain bilingual texts. Their algorithms need some high frequency word pairs as anchor points.</Paragraph> <Paragraph position="2"> When processing the texts that include less high-frequency words, these methods will perform weakly and with less precision because of the scarcity of the data problem.</Paragraph> <Paragraph position="3"> (Haruno and Yamazaki, 1996) tried to align short texts without enough repeated words in structurally different languages, such as English and Japanese. They applied the POS information of content words and an online dictionary to find matching word pairs. But this is only suitable for the short texts.</Paragraph> <Paragraph position="4"> The real text always includes some noisy information. It has the following characteristics as follows: null 1) There are no strict aligned paragraph boundaries in real bilingual text; 2) Some paragraphs may be merged into a larger paragraph because of the translator's individual idea; 3) There are many complex translation patterns in real text; 4) There exist different styles and themes; 5) Different genres have different inherent char null acteristics.</Paragraph> <Paragraph position="5"> The tradition approaches to alignment fall into two main classes: lexical and length. All these methods have limitations when facing the real text according to the characteristics mentioned above. * This research was supported by National Natural Science Foundation (60203020) and Science Foundation of Harbin Institute of technology (hit.2002.73). We proposed a new alignment method based on the sentences location information. Its basic idea is that the location of a sentence pair with certain length is distributed in the whole text similarly. The local and global location information of a sentence pair is fully combined together to determine the probability with which the sentence pair is a sentence bead.</Paragraph> <Paragraph position="6"> In the first of the following sections, we describe several concepts. The subsequent section reports the mathematical model of our alignment approach. Section 4 presents the process of anchors selection and algorithm implementation is shown in section 5. The experiment results and discussion are shown in section 6. In the final section, we conclude with a discussion of future work.</Paragraph> </Section> class="xml-element"></Paper>