File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0314_intro.xml

Size: 2,445 bytes

Last Modified: 2025-10-06 14:01:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0314">
  <Title>Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper addresses the problem of identifying &amp;quot;multiword&amp;quot; (sequence-to-sequence) translation correspondences from parallel corpora. It is well-known that translation does not always proceed by word-for-word. This highlights the need for finding multi-word translation correspondences. null Previous works that focus on multi-word translation correspondences from parallel corpora include noun phrase correspondences (Kupiec, 1993), fixed/flexible collocations (Smadja et al., 1996), n-gram word sequences of arbitrary length (Kitamura and Matsumoto, 1996), non-compositional compounds (Melamed, 2001), captoids (Moore, 2001), and named entities 1.</Paragraph>
    <Paragraph position="1"> In all of these approaches, a common problem seems to be an identification of meaningful multi-word translation units. There are a number of factors which make handling of multi-word units more complicated than it appears. First, it is a many-to-many mapping which potentially leads to a combinatorial explosion. Second, multi-word translation units are not necessarily contiguous, so an algorithm should not be hampered by the word adjacency constraint. Third, word segmentation itself is ambiguous for non-segmented languages such as Chinese or Japanese. We need to resolve such ambiguity as well.</Paragraph>
    <Paragraph position="2"> In this paper, we apply sequential pattern mining to solve the problem. First, the method effectively avoids an inherent combinatorial explosion by concatenating pairs of parallel sentences into single bilingual sequences and applying a pattern mining algorithm on those sequences.</Paragraph>
    <Paragraph position="3"> Second, it covers both rigid (gap-less) and gapped sequences. Third, it achieves a systematic way of enumerating all possible translation pair candidates, single- or multi-word. Note that some are overlapped to account for word segmentation ambiguity. Our method is balanced by a conservative discovery of translation correspondences with the rationale that direct associations will win over indirect ones, thereby resolving the ambiguity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML