File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3002_intro.xml

Size: 5,253 bytes

Last Modified: 2025-10-06 14:02:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3002">
  <Title>Improving Domain-Specific Word Alignment for Computer Assisted Translation</Title>
  <Section position="2" start_page="0" end_page="21" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Bilingual word alignment is first introduced as an intermediate result in statistical machine translation (SMT) (Brown et al., 1993). In previous alignment methods, some researchers modeled the alignments with different statistical models (Wu, 1997; Och and Ney, 2000; Cherry and Lin, 2003). Some researchers use similarity and association measures to build alignment links (Ahrenberg et al., 1998; Tufis and Barbu, 2002). However, All of these methods require a large-scale bilingual corpus for training.</Paragraph>
    <Paragraph position="1"> When the large-scale bilingual corpus is not available, some researchers use existing dictionaries to improve word alignment (Ker and Chang, 1997).</Paragraph>
    <Paragraph position="2"> However, few works address the problem of domain-specific word alignment when neither the large-scale domain-specific bilingual corpus nor the domain-specific translation dictionary is available.</Paragraph>
    <Paragraph position="3"> This paper addresses the problem of word alignment in a specific domain, where only a small domain-specific corpus is available. In the domain-specific corpus, there are two kinds of words. Some are general words, which are also frequently used in the general domain. Others are domain-specific words, which only occur in the specific domain. In general, it is not quite hard to obtain a large-scale general bilingual corpus while the available domain-specific bilingual corpus is usually quite small. Thus, we use the bilingual corpus in the general domain to improve word alignments for general words and the corpus in the specific domain for domain-specific words. In other words, we will adapt the word alignment information in the general domain to the specific domain.</Paragraph>
    <Paragraph position="4"> In this paper, we perform word alignment adaptation from the general domain to a specific domain (in this study, a user manual for a medical system) with four steps. (1) We train a word alignment model using the large-scale bilingual corpus in the general domain; (2) We train another word alignment model using the small-scale bilingual corpus in the specific domain; (3) We build two translation dictionaries according to the alignment results in (1) and (2) respectively; (4) For each sentence pair in the specific domain, we use the two models to get different word alignment results and improve the results according to the translation dictionaries. Experimental results show that our method improves domain-specific word alignment in terms of both precision and recall, achieving a 21.96% relative error rate reduction.</Paragraph>
    <Paragraph position="5"> The acquired alignment results are used in a generalized translation memory system (GTMS, a kind of computer assisted translation systems) (Simard and Langlais, 2001). This kind of system facilitates the re-use of existing translation pairs to translate documents. When translating a new sentence, the system tries to provide the pre-translated examples matched with the input and recommends a translation to the human translator, and then the translator edits the suggestion to get a final translation. The conventional TMS can only recommend translation examples on the sentential level while GTMS can work on both sentential and sub-sentential levels by using word alignment results. These GTMS are usually employed to translate various documents such as user manuals, computer operation guides, and mechanical operation manuals.</Paragraph>
    <Section position="1" start_page="0" end_page="21" type="sub_section">
      <SectionTitle>
Word Alignment Adaptation
Bi-directional Word Alignment
</SectionTitle>
      <Paragraph position="0"> In statistical translation models (Brown et al., 1993), only one-to-one and more-to-one word alignment links can be found. Thus, some multi-word units cannot be correctly aligned. In order to deal with this problem, we perform translation in two directions (English to Chinese, and Chinese to English) as described in (Och and Ney, 2000). The GIZA++ toolkit  is used to perform statistical word alignment.</Paragraph>
      <Paragraph position="1"> For the general domain, we use and to represent the alignment sets obtained with English as the source language and Chinese as the target language or vice versa. For alignment links in both sets, we use i for English words and j for Chinese words.</Paragraph>
      <Paragraph position="2">  Where, is the position of the source word aligned to the target word in position k. The set indicates the words aligned to the same source word k. For example, if a Chinese word in position j is connect to an English word in position i, then . And if a Chinese word in position j is connect to English words in position i and k, then</Paragraph>
      <Paragraph position="4"> Based on the above two alignment sets, we obtain their intersection set, union set  and subtraction set.</Paragraph>
      <Paragraph position="5"> Intersection:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML