File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1039_intro.xml
Size: 2,616 bytes
Last Modified: 2025-10-06 14:06:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1039"> <Title>A Portable Algorithm for Mapping Bitext Correspondence</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Texts that are available in two languages (bitexts) are immensely valuable for many natural language processing applications z. Bitexts are the raw material from which translation models are built. In addition to their use in machine translation (Sato & Nagao, 1990; Brown et al., 1993; Melamed, 1997), translation models can be applied to machine-assisted translation (Sato, 1992; Foster et al., 1996), cross-lingual information retrieval (SIGIR, 1996), and gisting of World Wide Web pages (Resnik, 1997). Bitexts also play a role in less automated applications such as concordancing for bilingual lexicography (Catizone et al., 1993; Gale & Church, 1991b), computer-assisted language learning, and tools for translators (e.g. (Macklovitch, 1 &quot;Multitexts&quot; in more than two languages are even more valuable, but they are much more rare.</Paragraph> <Paragraph position="1"> 1995; Melamed, 1996b). However, bitexts are of little use without an automatic method for constructing bitext maps.</Paragraph> <Paragraph position="2"> Bitext maps identify corresponding text units between the two halves of a bitext. The ideal bitext mapping algorithm should be fast and accurate, use little memory and degrade gracefully when faced with translation irregularities like omissions and in.</Paragraph> <Paragraph position="3"> versions. It should be applicable to any text genre in any pair of languages.</Paragraph> <Paragraph position="4"> The Smooth Injective Map Recognizer (SIMR) algorithm presented in this paper is a bitext mapping algorithm that advances the state of the art on these criteria. The evaluation in Section 5 shows that SIMR's error rates are lower than those of other bitext mapping algorithms by an order of magnitude. At the same time, its expected running time and memory requirements are linear in the size of the input, better than any other published algorithm.</Paragraph> <Paragraph position="5"> The paper begins by laying down SIMR's geometric foundations and describing the algorithm. Then, Section 4 explains how to port SIMR to arbitrary language pairs with minimal effort, without relying on genre-specific information such as sentence boundaries. The last section offers some insights about the optimal level of text analysis for mapping bitext correspondence.</Paragraph> </Section> class="xml-element"></Paper>