File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0107_intro.xml

Size: 3,188 bytes

Last Modified: 2025-10-06 14:06:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0107">
  <Title>Automatic Extraction of Word Sequence Correspondences in Parallel Corpora</Title>
  <Section position="3" start_page="0" end_page="79" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A high quality translation dictionary is indispensable for machine translation systems with good performance, especially for domains of expertise. Such dictionaries are only effectively usable for their own domains, much human labour will be mitigated if such a dictionary is obtained in an automatic way from a set of translation examples.</Paragraph>
    <Paragraph position="1"> This paper proposes a method to construct a translation dictionary that consists of not only word pairs but pairs of arbitrary length word sequences of the two languages. All of the pairs are extracted from a parallel corpus of a specific domain. The method is proposed and is evaluated with Japanese-English parallel corpora of three distinct domains.</Paragraph>
    <Paragraph position="2"> Several attempts have been made for similar purposes, but with different settings. (see \[Kupiec 93\]\[Kumano &amp; Hirakawa 94\]\[Smadja 96\]) Kupiec and Kumano ~ Hirakawa propose a method of obtaining translation patterns of noun compound from bilingual corpora. Kumano &amp; Hirakawa stand on a different setting from the other works in that they assume ordinary bilingual dictionary and use non-parallel (non-aligned) corpora. Their target is to find correspondences not only of word level but of noun phrases and unknown words. However, the target noun phrases and unknown words are decided in the preprocessing stage.</Paragraph>
    <Paragraph position="3"> Brown et al. use a probabilistic measure for estimating word similarity of two languages in their statistical approach of language translation \[Brown 88\]. In their work of aligning of parallel texts, Kay &amp; RSscheisen used the Dice coefficient as the word similarity for insuring sentence level correspondence \[Kay &amp; RSscheisen 93\].</Paragraph>
    <Paragraph position="4"> Kitamura &amp; Matsumoto use the same measure to calculate word similarity in their work of extraction of translation patterns. The similarity measure is used as the basis of their structural matching of parallel sentences so as to extract structural translation patterns. In texts of expertise a number of word sequence correspondences, not word-word correspondences, are abundant especially in the form of noun compounds or of fixed phrases, which are keys for better performance. Though the method proposed in this paper deals only with consecutive sequences of words and is intended to provide a better base for the structural matching that follows, the results themselves show very useful and informative translation patterns for the domain.</Paragraph>
    <Paragraph position="5"> Our method extends the usage of the Dice coefficient in two ways: It deals not only with correspondence between the words but with correspondence between word-sequences, and it modifies the formula measure so that more plausible corresponding pairs are identified earlier.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML