File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3006_intro.xml
Size: 3,190 bytes
Last Modified: 2025-10-06 14:02:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3006"> <Title>An Automatic Filter for Non-Parallel Texts</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In June 2003, the U.S. government organized a &quot;Surprise Language Exercise&quot; for the NLP community. The goal was to build the best possible language technologies for a &quot;surprise&quot; language in just one month (Oard, 2003). One of the main technologies pursued was machine translation (MT). Statistical MT (SMT) systems were the most successful in this scenario, because their construction typically requires less time than other approaches. On the other hand, SMT systems require large quantities of parallel text as training data. A significant collection of parallel text was obtained for this purpose from multiple sources. SMT systems were built and tested; results were reported.</Paragraph> <Paragraph position="1"> Much later we were surprised to discover that a significant portion of the training data was not parallel text! Some of the document pairs were on the same topic but not translations of each other. For today's sentence-based SMT systems, this kind of data is noise. How much better would the results have been if the noisy training data were automatically filtered out? This question is becoming more important as SMT systems increase their reliance on automatically collected parallel texts.</Paragraph> <Paragraph position="2"> There is abundant literature on aligning parallel texts at the sentence level. To the best of our knowledge, all published methods happily misalign non-parallel inputs, without so much as a warning. There is also some recent work on distinguishing parallel texts from pairs of unrelated texts (Resnik and Smith, 2003). In this paper, we propose a solution to the more difficult problem of distinguishing parallel texts from texts that are comparable but not parallel.</Paragraph> <Paragraph position="3"> Definitions of &quot;comparable texts&quot; vary in the literature. Here we adopt a definition that is most suitable for filtering SMT training data: Two texts are &quot;comparable&quot; if they are not alignable at approximately the sentence level. This definition is also suitable for other applications of parallel texts, such as machine-assisted translation and computer-assisted foreign language learning.</Paragraph> <Paragraph position="4"> Resnik and Smith (2003) suggested three approaches to filtering non-parallel texts: STRAND, tsim, and a combination of the two. STRAND relies on mark-up within a document to reveal the document's structure. STRAND then predicts that documents with the same structure are parallel. Tsim uses a machine-readable bilingual dictionary to find word-to-word matches between two halves of a bitext. It then computes a similarity score based on the maximum cardinality bipartite matching between the two halves. We chose to compare our method with tsim because we were interested in an approach that works with both marked up and plain text documents. null</Paragraph> </Section> class="xml-element"></Paper>