File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1037_intro.xml

Size: 5,798 bytes

Last Modified: 2025-10-06 14:05:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1037">
  <Title>j schang@cs.nthu.edu.tw</Title>
  <Section position="2" start_page="0" end_page="210" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Brown et al. (1990) initiated much of the recent interest in bilingual corpora. They advocated applying a statistical approach to machine translation (SMT).</Paragraph>
    <Paragraph position="1"> The SMT approach can be understood as a word by word model consisting of two submodels: a language model for generating a source text segment ST and a translation model for translating ST to a target text segment TT. They recommended using an aligned bilingual corpus to estimate the parameters of translation probability, Pr(ST \[TT) in the translation model. The resolution of alignment can vat3, from low to high: section, paragraph, sentence, phrase, and word (Gale and Church 1993; Matsumoto et al. 1993).</Paragraph>
    <Paragraph position="2"> In addition to machine translation, many applications tbr aligped corpora have been proposed, including bilingual lexicography (Gale and Church 199l, Smadja 1992, Dallie, Gaussier and Lange 1994), and word-sense disambiguation (Gale, Church and Yarowsky 1992, Chen and Chang 1994).</Paragraph>
    <Paragraph position="3"> In the context of statistical machine translation, Brown et al. (1993) presented a series of five models for Pr(ST \[TT). The first two models have been used in research on word alignment. Model 1 assumes that Pr(ST\[TT) depends only on lexical translation probability t(s I t), i.e., the probability of the i-th word in ST producing the j-th word t in TT as its translation. The pair of words (s, t) is called a connection. Model 2 enhances Model 1 by considering the dependence of Pr(ST ITT) on the distortion probability, d(i l J, 1, m) where I and m are the numbers of words in ST and TT, respectively.</Paragraph>
    <Paragraph position="4"> Using an EM algorithm for Model 2, Brown et al.</Paragraph>
    <Paragraph position="5"> (1990) reported the model produced seventeen acceptable translations for twenty-six testing sentences. However, the degree of success in word alignment was not reported.</Paragraph>
    <Paragraph position="6"> Dagan, Church and Gale (1992) proposed directly aligning words without the preprocessing phase of sentence alignment. Under this proposal, a rough chm'acter-by-character alignment is first performed.</Paragraph>
    <Paragraph position="7"> From this rough character alignment, words are aligned using an EM algorithm for Model 2 in a fashion quite similar to the method presented by Brown. Instead of d(i \[ j, 1, m), a smaller set of offset probabilities, o(i - i') were used where the i-th word of ST was connected to the j-th word of TT in the rough alignment. This algorithm was evaluated on a noisy English-French technical document. The authors claimed that 60.5% of 65,000 words in the document were correctly aligned. For 84% of the words, the offset from correct alignment was at most 3.</Paragraph>
    <Paragraph position="8"> Motivated by the need to reduce on the memory requirement and to insure robustness in estimation of probability, Gale and Church (1991) proposed an alternative algorithm in which probabilities are not estimated and stored for all word pairs. Instead, only strongly associated word pairs are Ibund and stored.</Paragraph>
    <Paragraph position="9"> This is achieved by applying dO 2 test, a x~-like statistic. The extracted word pairs are used to match words in ST and TT. The algorithm works from left to right in ST, using a dynamic programming procedure to maximize Pr(ST ITT). The probability t(s \] t) is approximated as a function of thn-in, the number of matches (s', t) for all s' ~ ST, while distortion d(i I J, l, m) is approximated as a probability function, Pr(matchlj'-j) of slope, j'j, where (i', j') is the positions of the nearest connection to the left of s. The authors claim that when a relevant threshold is set, the algorithm can recommend connections for 61% for  the words in 800 sentence pairs. Approximately 95% of the suggested connections are correct.</Paragraph>
    <Paragraph position="10"> in this paper, we propose a word-alignment algorithm based on classes derived from sense-related categories in existing thesauri. We refer to this algorithm as SenseAlign. The proposed algorithm relies on an automatic procedure to acquire class-based rules for alignment. It does not employ word-by-word translation probabilities; nor does it use a lengthy iterative EM algorithm for converging to such probabilities. Results obtained fiom the algorithms demonstrate that classification based on existing thesauri is very effective in broadening coverage while maintaining high precision. When trained with a corpus only one-tenth the size of the corpus used in Gale and Church (1991), the algorithm aligns over 80% of word pairs with comparable precision (93%).</Paragraph>
    <Paragraph position="11"> Besides, since the rules are based on sense distinction, word sense ambiguity can be resolved in favor of the corresponding senses of rules applied in the alignment process.</Paragraph>
    <Paragraph position="12"> The rest of this paper is organized as tbllows. In the next section, we describe SenseAlign and discuss its main components. Examples of its output are provided in Section 3. All examples and their translations are taken from the l~ongman English-Chinese Dictionary of Contemporary English (Procter 1988, I,ecDOCE, hencetbrth). Section 4 summarizes the results of inside and outside tests. In Section 5, we compare SenseAlign to several other approaches that have been proposed in literature involving computational linguistics. Finally, Section 6 summarized the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML