File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/e95-1010_intro.xml
Size: 3,489 bytes
Last Modified: 2025-10-06 14:05:52
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1010"> <Title>Text Alignment in the Real World: Improving Alignments of Noisy Translations Using Common Lexical Features, String Matching Strategies and N-Gram Comparisons ~</Title> <Section position="2" start_page="0" end_page="67" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Given texts in two languages that are to some degree translations of one another, an alignment of the texts associates sentences, paragraphs or phrases in one document with their translations in the other. Successful approaches to alignment can be divided into two primary types: those that use comparisons of lexical elements between the documents (Wu, 1994; Chen 1993; Catizone, Russell and Warwick, 1989), and IThis research was funded under DoD Contract #MDA 904-94-C-E086 those that use a statistical decision process derived from byte-length ratios between alignment blocks (Wu, 1994; Church, 1993; Gale and Church, 1991).</Paragraph> <Paragraph position="1"> Methods vary for the former approach, hut in the latter approach, a dynamic programming framework is used to sequentially align blocks as the alignment process proceeds. Under this model, blocks are compared only with nearby blocks as the alignment proceeds, substantially reducing the computational overhead \[O(n 2) ~ O(n!)\] of the alignment process.</Paragraph> <Paragraph position="2"> In the primary literature on alignment, the texts are typically well-behaved. In byte-length ratio approaches, the presence of long stretches of blocks that have roughly similar lengths can be problematic, and some improvement can be achieved by augmenting the byte-length measure by scores derived from lexical feature matching (Wu, 1994). When combined with radical formatting departures between documents that often arise in text translations, the difficulties of producing good alignments are exacerbated by the presence of untranslated segments, textual rearrangements and other problematic text features. The dynamic programming framework makes long runs of segments that have no translation in their parallel text difficult to ignore because the limited window size prevents passing over those segments to reach appropriate areas of the document further downstream.</Paragraph> <Paragraph position="3"> Taken together, these difficulties can be catastrophic to the alignment process. Our experience shows that the fraction of correct alignments can drop to less than 5%.</Paragraph> <Paragraph position="4"> Noisy translations of this sort do reflect human error and the preferences of translators, and they are probably much more prevalent than alignment work on legislative transcriptions has indicated. The purpose of this research was to ascertain what types of information contained in a document could be used to improve the alignment process, while not making gross assumptions about the source text format con- null ventions and peculiarities. The Pan American Health Organization (PAHO) corpus was used as a test corpus for evaluating the performance of the modified alignment algorithm. The PAHO texts are a series of documents on Latin American health issues and our test segment consisted of 180 documents that ranged from 20 to 3825 lines in length. From these documents, several of the more problematic texts were hand aligned for analysis and comparison with the results of automatic alignment methods.</Paragraph> </Section> class="xml-element"></Paper>