File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/p94-1012_intro.xml
Size: 2,867 bytes
Last Modified: 2025-10-06 14:05:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1012"> <Title>ALIGNING A PARALLEL ENGLISH-CHINESE CORPUS STATISTICALLY WITH LEXICAL CRITERIA</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Recently, a number of automatic techniques for aligning sentences in parallel bilingual corpora have been proposed (Kay & RSscheisen 1988; Catizone e~ al. 1989; Gale & Church 1991; Brown et al. 1991; Chen 1993), and coarser approaches when sentences are difficult to identify have also been advanced (Church 1993; Dagan e~ al. 1993).</Paragraph> <Paragraph position="1"> Such corpora contain the same material that has been translated by human experts into two languages. The goal of alignment is to identify matching sentences between the languages. Alignment is the first stage in extracting structural information and statistical parameters from bilingual corpora.</Paragraph> <Paragraph position="2"> The problem is made more difficult because a sentence in one language may correspond to multiple sentences in the other; worse yet, *sometimes several sentences' content is distributed across multiple translated sentences.</Paragraph> <Paragraph position="3"> Approaches to alignment fall into two main classes: lexical and statistical. Lexically-based techniques use extensive online bilingual lexicons to match sentences. In contrast, statistical techniques require almost no prior knowledge and are based solely on the lengths of sentences. The empirical results to date suggest that statistical methods yield performance superior to that of currently available lexical techniques.</Paragraph> <Paragraph position="4"> However, as far as we know, the literature on automatic alignment has been restricted to alphabetic Indo-European languages. This methodological flaw weakens the arguments in favor of either approach, since it is unclear to what extent a technique's superiority depends on the similarity between related languages. The work reported herein moves towards addressing this problem. 1 In this paper, we describe our experience with automatic alignment of sentences in parallel English-Chinese texts, which was performed as part of the SILC machine translation project. Our report concerns three related topics. In the first of the following sections, we describe the objectives of the HKUST English-Chinese Parallel Bilingual Corpus, and our progress. The subsequent sections report experiments addressing the applicability of a suitably modified version of Gale & Church's (1991) length-based statistical method to the task of aligning English with Chinese. In the final section, we describe an improved statistical method that also permits domain-specific lexical cues to be incorporated probabilistically.</Paragraph> </Section> class="xml-element"></Paper>