File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1013_intro.xml
Size: 3,254 bytes
Last Modified: 2025-10-06 14:01:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1013"> <Title>From Words to Corpora: Recognizing Translation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> As in most areas of natural language processing, recent approaches to machine translation have turned increasingly to statistical modeling of the phenomenon (translation models) (Berger et al., 1994). Such models are learned automatically from data, typically parallel corpora: texts in two or more languages that are mutual translations. As computational resources have become more powerful and less expensive, the task of training translation models has become feasible (Al-Onaizan et al., 1999), as has the task of translating (or \decoding&quot;) text using such models (Germann et al., 2001). However, the success of the statistical approach to translation (and also to other multilingual applications that utilize parallel text) hangs crucially on the quality, quantity, and diversity of data used in parameter estimation.</Paragraph> <Paragraph position="1"> If translation is a generative process, then one might consider its reverse process of recognition: Given two documents, might it be determined fully automatically whether they are translations of each other? The ability to detect translations of a document has numerous applications. The most obvious is as a means to build a parallel corpus from a set of multilingual documents that contains some translation pairs. Examples include mining the World-Wide Web for parallel text (Resnik, 1999; Nie et al., 1999; Ma and Liberman, 1999) and building parallel corpora from comparable corpora such as multilingual collections of news reports. Another use of translation detection might be as an aid in alignment tasks at any level. For example, consider the task of aligning NP chunks (and perhaps also the extra-NP material) in an NP-bracketed parallel corpus; a chunk-level similarity score (Fluhr et al., 2000) built from a word-level model could be incorporated into a framework that involves bootstrapping more complex models of translation from simpler ones (Berger et al., 1994). Finally, reliable cross-lingual duplicate detection might improve performance in n-best multilingual information retrieval systems; at the same time, by detecting the existence of a translation in a multilingual corpus, the cost of translating a document of interest is eliminated.</Paragraph> <Paragraph position="2"> I present here an algorithm for classifying document pairs as either translationally equivalent or not, which can be built upon any kind of word-to-word translation lexicon (automatically learned or hand-crafted). I propose a score of translational similarity, then describe an evaluation task involving a constrained search for texts (of arbitrary size) that are translation pairs, in a noisy space, and present precision/recall results. Finally, I show that this algorithm performs competitively with the approach of Resnik (1999), in which only structural information (HTML-markup) is used to detect translation pairs, though the new algorithm does not require structural information.</Paragraph> </Section> class="xml-element"></Paper>