File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/p00-1055_abstr.xml
Size: 4,547 bytes
Last Modified: 2025-10-06 13:41:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1055"> <Title>Using Confidence Bands for Parallel Texts Alignment</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper describes a language independent method for alignment of parallel texts that makes use of homograph tokens for each pair of languages. In order to filter out tokens that may cause misalignment, we use confidence bands of linear regression lines instead of heuristics which are not theoretically supported. This method was originally inspired on work done by Pascale Fung and Kathleen McKeown, and Melamed, providing the statistical support those authors could not claim.</Paragraph> <Paragraph position="1"> Introduction Human compiled bilingual dictionaries do not cover every term translation, especially when it comes to technical domains. Moreover, we can no longer afford to waste human time and effort building manually these ever changing and incomplete databases or design language specific applications to solve this problem. The need for an automatic language independent task for equivalents extraction becomes clear in multi-lingual regions like Hong Kong, Macao, Quebec, the European Union, where texts must be translated daily into eleven languages, or even in the U.S.A. where Spanish and English speaking communities are intermingled.</Paragraph> <Paragraph position="2"> Parallel texts (texts that are mutual translations) are valuable sources of information for bilingual lexicography. However, they are not of much use unless a computational system may find which piece of text in one language corresponds to which piece of text in the other language. In order to achieve this, they must be aligned first, i.e. the various pieces of text must be put into correspondence. This makes the translations extraction task easier and more reliable. Alignment is usually done by finding correspondence points - sequences of characters with the same form in both texts (homographs, e.g. numbers, proper names, punctuation marks), similar forms (cognates, like Region and Regiao in English and Portuguese, respectively) or even previously known translations.</Paragraph> <Paragraph position="3"> Pascale Fung and Kathleen McKeown (1997) present an alignment algorithm that uses term translations as correspondence points between English and Chinese. Melamed (1999) aligns texts using correspondence points taken either from orthographic cognates (Michel Simard et al., 1992) or from a seed translation lexicon.</Paragraph> <Paragraph position="4"> However, although the heuristics both approaches use to filter noisy points may be intuitively quite acceptable, they are not theoretically supported by Statistics.</Paragraph> <Paragraph position="5"> The former approach considers a candidate correspondence point reliable as long as, among some other constraints, &quot;[...] it is not too far away from the diagonal [...]&quot; (Pascale Fung and Kathleen McKeown, 1997, p.72) of a rectangle whose sides sizes are proportional to the lengths of the texts in each language (henceforth, 'the golden translation diagonal'). The latter approach uses other filtering parameters: maximum point ambiguity level, point dispersion and angle deviation (Melamed, 1999, pp. 115-116).</Paragraph> <Paragraph position="6"> Antonio Ribeiro et al. (2000a) propose a method to filter candidate correspondence points generated from homograph words which occur only once in parallel texts (hapaxes) using linear regressions and statistically supported noise filtering methodologies. The method avoids heuristic filters and they claim high precision alignments.</Paragraph> <Paragraph position="7"> In this paper, we will extend this work by defining a linear regression line with all points generated from homographs with equal frequencies in parallel texts. We will filter out those points which lie outside statistically defined confidence bands (Thomas Wonnacott and Ronald Wonnacott, 1990). Our method will repeatedly use a standard linear regression line adjustment technique to filter unreliable points until there is no misalignment. Points resulting from this filtration are chosen as correspondence points.</Paragraph> <Paragraph position="8"> The following section will discuss related work. The method is described in section 2 and we will evaluate and compare the results in section 3. Finally, we present conclusions and future work.</Paragraph> </Section> class="xml-element"></Paper>