File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1105_intro.xml

Size: 6,255 bytes

Last Modified: 2025-10-06 14:02:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1105">
  <Title>Bilingual-Dictionary Adaptation to Domains</Title>
  <Section position="2" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> It is well known that appropriate translations for a word vary with domains, and bilingual-dictionary adaptation to domains is an effective way to improve the performance of, for example, machine translation and cross-language information retrieval. However, bilingual dictionaries have commonly been adapted to domains on the basis of lexicographers intuition.</Paragraph>
    <Paragraph position="1"> It is thus desirable to develop an automated method for bilingual-dictionary adaptation.</Paragraph>
    <Paragraph position="2"> Technologies for extracting pairs of translation equivalents from parallel corpora have been established (Gale and Church 1991; Dagan, et al. 1993; Fung 1995; Kitamura and Matsumoto 1996; Melamed 1997). They can, naturally, be used to adapt a bilingual dictionary to domains, that is, to select corpus-relevant translation equivalents from among those provided by an existing bilingual dictionary. However, their applicability is limited because of the limited availability of large parallel corpora. Methods of bilingual-dictionary adaptation using weakly comparable corpora, i.e., a pair of two language corpora of the same domain, are therefore required.</Paragraph>
    <Paragraph position="3"> There are a number of previous works related to bilingual-dictionary adaptation using comparable corpora. Tanaka and Iwasakis (1996) optimization method for a translation-probability matrix mainly aims at adapting a bilingual dictionary to domains.</Paragraph>
    <Paragraph position="4"> However, it is hampered by a huge amount of computation, and was only demonstrated in a small-scale experiment. Several researchers have developed a contextual-similarity-based method for extracting pairs of translation equivalents (Kaji and Aizono 1996; Fung and McKeown 1997; Fung and Yee 1998; Rapp 1999). It is computationally efficient compared to Tanaka and Iwasakis method, but the precision of extracted translation equivalents is still not acceptable.</Paragraph>
    <Paragraph position="5"> In the light of these works, the author proposes two methods for bilingual-dictionary adaptation. The first one is a variant of the contextual-similarity-based method for extracting pairs of translation equivalents; it focuses on selecting corpus-relevant translation equivalents from among those provided by a bilingual dictionary. This selecting may be easier than finding new pairs of translation equivalents. The second one is a newly devised method using the ratio of associated words that suggest each translation equivalent; it was inspired by a research on word-sense disambiguation using bilingual comparable corpora (Kaji and Morimoto 2002). The two methods were evaluated and compared by using the EDR (Japan Electronic Dictionary Research Institute) bilingual dictionary together with Wall Street Journal and Nihon Keizai Shimbun corpora. null 2 Method based on contextual similarity This method is based on the assumption that a word in a language and its translation equivalent in another language occur in similar contexts, albeit their contexts are represented by words in their respective languages. In the case of the present task (i.e., bilingual-dictionary adaptation), a bilingual dictionary provides a set of candidate translation equivalents for each target word  . The contextual similarity of each of the candidate translation equivalents to the target word is thus evaluated with the assistance of the bilingual dictionary, and a pre-determined number of translation equivalents are selected in descending order of contextual similarity. Note that it is difficult to preset a threshold for contextual similarity since the distribution of contextual similarity values varies with target words.</Paragraph>
    <Paragraph position="6">  In this paper, target word is used to indicate the word for which translation equivalents are to be selected.</Paragraph>
    <Paragraph position="7"> A flow diagram of the proposed method is shown in Figure 1. The essential issues regarding this method are described in the following.</Paragraph>
    <Paragraph position="8"> Word associations are extracted by setting a threshold for mutual information between words in the same language. The mutual information of a pair of words is defined in terms of their co-occurrence frequency and respective occurrence frequencies (Church and Hanks 1990). A medium-sized window, i.e., a window including a few-dozen words, is used to count co-occurrence frequencies. Only word associations consisting of content words are extracted. This is because function words neither have domain-dependent translation equivalents nor represent contexts.</Paragraph>
    <Paragraph position="9"> Both a target word and each of its candidate translation equivalents are characterized by context vectors. A context vector consists of associated words weighted with mutual information.</Paragraph>
    <Paragraph position="10"> Similarity of a candidate translation equivalent to a target word is defined as the cosine coefficient between the context vector characterizing the target word and the translated context vector characterizing the candidate translation equivalent as follows. Under the assumption that target word x and candidate translation equivalent y are characterized by first-language context vector a(x) = (a  (y), , b n (y)), respectively, b(y) is translated into a first-language vector denoted as a'(y) = (a'</Paragraph>
    <Paragraph position="12"> d , where d i,j =1 if the j-th element of b(y) is a translation of the i-th element of a(x); otherwise, d i,j =0. Elements of b(y) that cannot be translated into elements of a'(y) constitute a residual second-language vector,</Paragraph>
    <Paragraph position="14"> The similarity of candidate translation equivalent y to target word x is then defined as ))()(),(()( yyxcosy,xSim b'a'a += .</Paragraph>
    <Paragraph position="15"> Note that a'(y)+b'(y) is a concatenation of a'(y) and b'(y) since they have no elements in common.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML