File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1022_intro.xml

Size: 9,358 bytes

Last Modified: 2025-10-06 14:02:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1022">
  <Title>Collocation Translation Acquisition Using Monolingual Corpora</Title>
  <Section position="3" start_page="0" end_page="22" type="intro">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> There has been much previous work done on monolingual collocation extraction. They can in general be classified into two types: window-based and syntax-based methods. The former extracts collocations within a fixed window (Church and Hanks 1990; Smadja, 1993). The latter extracts collocations which have a syntactic relationship (Lin, 1998; Seretan et al., 2003). The syntax-based method becomes more favorable with recent significant increases in parsing efficiency and accuracy. Several metrics have been adopted to measure the association strength in collocation extraction. Thanopoulos et al. (2002) give comparative evaluations on these metrics.</Paragraph>
    <Paragraph position="1"> Most previous research in translation knowledge acquisition is based on parallel corpora (Brown et al., 1993). As for collocation translation, Smadja et al. (1996) implement a system to extract collocation translations from a parallel English-French corpus. English collocations are first extracted using the Xtract system, then corresponding French translations are sought based on the Dice coefficient. Echizen-ya et al. (2003) propose a method to extract bilingual collocations using recursive chain-link-type learning. In addition to collocation translation, there is also some related work in acquiring phrase or term translations from parallel corpus (Kupiec, 1993; Yamamoto and Matsumoto 2000).</Paragraph>
    <Paragraph position="2"> Since large aligned bilingual corpora are hard to obtain, some research has been conducted to exploit translation knowledge from non-parallel corpora. Their work is mainly on word level.</Paragraph>
    <Paragraph position="3"> Koehn and Knight (2000) presents an approach to estimating word translation probabilities using unrelated monolingual corpora with the EM algorithm. The method exhibits promising results in selecting the right translation among several options provided by bilingual dictionary. Zhou et al.(2001) proposes a method to simulate translation probability with a cross language similarity score, which is estimated from monolingual corpora based on mutual information. The method achieves good results in word translation selection. In addition, (Dagan and Itai, 1994) and (Li, 2002) propose using two monolingual corpora for word sense disambiguation. (Fung, 1998) uses an IR approach to induce new word translations from comparable corpora. (Rapp, 1999) and (Koehn and Knight, 2002) extract new word translations from non-parallel corpus. (Cao and Li, 2002) acquire noun phrase translations by making use of web data. (Wu and Zhou, 2003) also make full use of large scale monolingual corpora and limited bilingual corpora for synonymous collocation extraction.</Paragraph>
    <Paragraph position="4"> 3 Training a triple translation model from monolingual corpora In this section, we first describe the dependency correspondence assumption underlying our approach. Then a dependency triple translation model and the monolingual corpus based training algorithm are proposed. The obtained triple translation model will be used for collocation translation extraction in next section.</Paragraph>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Dependency correspondence between
Chinese and English
</SectionTitle>
      <Paragraph position="0"> A dependency triple consists of a head, a dependant, and a dependency relation. Using a dependency parser, a sentence can be analyzed into dependency triples. We represent a triple as  are words and r is the dependency relation. It means that w  has a dependency relation r with w  . For example, a triple (overcome, verb-object, difficulty) means that &amp;quot;difficulty&amp;quot; is the object of the verb &amp;quot;overcome&amp;quot;. Among all the dependency relations, we only consider the following three key types that we think, are the most important in text analysis and machine translation: verb-object (VO), nounadj(AN), and verb- adv(AV).</Paragraph>
      <Paragraph position="1"> It is our observation that there is a strong correspondence in major dependency relations in the translation between English and Chinese. For example, an object-verb relation in Chinese (e.g.(Ke Fu , VO, Kun Nan )) is usually translated into the same verb-object relation in English(e.g. (overcome, VO, difficulty)).</Paragraph>
      <Paragraph position="2"> This assumption has been experimentally justified based on a large and balanced bilingual corpus in our previous work (Zhou et al., 2001). We come to the conclusion that more than 80% of the above dependency relations have a one-one mapping between Chinese and English. We can conclude that there is indeed a very strong correspondence between Chinese and English in the three considered dependency relations. This fact will be used to estimate triple translation model using two monolingual corpora.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="21" type="sub_section">
      <SectionTitle>
3.2 Triple translation model
</SectionTitle>
      <Paragraph position="0"> According to Bayes's theorem, given a Chinese triple ),,(</Paragraph>
      <Paragraph position="2"> ecp is usually called the translation model.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="22" type="sub_section">
      <SectionTitle>
Language Model
</SectionTitle>
      <Paragraph position="0"> The language model )( tri ep is calculated with English triples database. In order to tackle with the data sparseness problem, we smooth the language model with an interpolation method, as described below.</Paragraph>
      <Paragraph position="1"> When the given English triple occurs in the corpus, we can calculate it as in Equation (2).</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> The wildcard symbol * means it can be any word or relation. With Equations (2) and (3), we get the interpolated language model as shown in (4).</Paragraph>
      <Paragraph position="7"> Translation Model We simplify the translation model according the following two assumptions.</Paragraph>
      <Paragraph position="8"> Assumption 1: Given an English triple tri e , and the corresponding Chinese dependency relation</Paragraph>
      <Paragraph position="10"> ecp are translation probabilities within triples, they are different from the unrestricted probabilities such as the ones in IBM models (Brown et al., 1993). We distinguish translation probability between head</Paragraph>
      <Paragraph position="12"> ecp ) and dependant ( )|(  ecp ). In the rest of the paper, we use )|( ecp head and )|( ecp dep to denote the head translation probability and dependant translation probability respectively.</Paragraph>
      <Paragraph position="13"> As the correspondence between the same dependency relation across English and Chinese is strong, we simply assume 1)|( =  estimated directly because there is no triple-aligned corpus available. Here, we present an approach to estimating these probabilities from two monolingual corpora based on the EM algorithm.</Paragraph>
    </Section>
    <Section position="4" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
3.3 Estimation of word translation
</SectionTitle>
      <Paragraph position="0"> probability using the EM algorithm Chinese and English corpora are first parsed using a dependency parser, and two dependency triple databases are generated. The candidate English translation set of Chinese triples is generated through a bilingual dictionary and the assumption of strong correspondence of dependency relations. There is a risk that unrelated triples in Chinese and English can be connected with this method. However, as the conditions that are used to make the connection are quite strong (i.e. possible word translations in the same triple structure), we believe that this risk, is not very severe. Then, the expectation maximization (EM) algorithm is introduced to iteratively strengthen the correct connections and weaken the incorrect connections.</Paragraph>
      <Paragraph position="1">  are initially set to a uniform distribution as follows:  The basic idea is that under the restriction of the English triple language model )( tri ep and translation dictionary, we wish to estimate the translation probabilities )|( ecp</Paragraph>
      <Paragraph position="3"> that best explain the Chinese triple database as a translation from the English triple database. In each iteration, the normalized triple translation probabilities are used to update the word translation probabilities. Intuitively, after finding the most probable translation of the Chinese triple, we can collect counts for the word translation it contains. Since the English triple language model provides context information for the disambiguation of the Chinese words, only the appropriate occurrences are counted.</Paragraph>
      <Paragraph position="4"> Now, with the language model estimated using Equation (4) and the translation probabilities estimated using EM algorithm, we can compute the best triple translation for a given Chinese triple using Equations (1) and (7).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML