File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1005_metho.xml

Size: 13,919 bytes

Last Modified: 2025-10-06 14:08:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1005">
  <Title>Improving Statistical Word Alignment with a Rule-Based Machine Translation System</Title>
  <Section position="3" start_page="21" end_page="21" type="metho">
    <SectionTitle>
SSP [?]=
Subtraction: S[?]= PF
</SectionTitle>
    <Paragraph position="0"> Thus, the subtraction set contains two different alignment links for each English word.</Paragraph>
  </Section>
  <Section position="4" start_page="21" end_page="321" type="metho">
    <SectionTitle>
3 Rule-Based Translation System
</SectionTitle>
    <Paragraph position="0"> We use the translation information in a rule-based English-Chinese translation system  to improve the statistical word alignment result. This translation system includes three modules: source language parser, source to target language transfer module, and target language generator. From the transfer phase, we get Chinese translation candidates for each English word. This information can be considered as another word alignment result, which is denoted as</Paragraph>
    <Paragraph position="2"> lation candidates for the k-th English word or phrase. The difference between S and the common alignment set is that each English word or phrase in S has one or more translation candidates. A translation example for the English sentence &amp;quot;He is used to pipe smoking.&amp;quot; is shown in Table 1.</Paragraph>
    <Paragraph position="4"> From Table 1, it can be seen that (1) the translation system can recognize English phrases (e.g.</Paragraph>
    <Paragraph position="5"> is used to); (2) the system can provide one or more translations for each source word or phrase;  (3) the translation system can perform word se- null lection or word sense disambiguation. For example, the word &amp;quot;pipe&amp;quot; has several meanings such as &amp;quot;tube&amp;quot;, &amp;quot;tube used for smoking&amp;quot; and &amp;quot;wind instrument&amp;quot;. The system selects &amp;quot;tube used for smoking&amp;quot; and translates it into Chinese words &amp;quot;Yan Dou &amp;quot; and &amp;quot;Yan Tong &amp;quot;. The recognized translation  This system is developed based on the Toshiba English-Japanese translation system (Amano et al. 1989). It achieves above-average performance as compared with the English-Chinese translation systems available in the market. candidates will be used to improve statistical word alignment in the next section.</Paragraph>
    <Paragraph position="6">  As described in Section 2, we have two alignment sets for each sentence pair, from which we obtain the intersection set S and the subtraction set . We will improve the word alignments in S and with the translation candidates produced by the rule-based machine translation system. In the following sections, we will first describe how to calculate monolingual word similarity used in our algorithm. Then we will describe the algorithm used to improve word alignment results.</Paragraph>
    <Paragraph position="8"> This section describes the method for monolingual word similarity calculation. This method calculates word similarity by using a bilingual dictionary, which is first introduced by Wu and Zhou (2003). The basic assumptions of this method are that the translations of a word can express its meanings and that two words are similar in meanings if they have mutual translations. Given a Chinese word, we get its translations with a Chinese-English bilingual dictionary. The translations of a word are used to construct its feature vector. The similarity of two words is estimated through their feature vectors with the cosine measure as shown in (Wu and Zhou 2003).</Paragraph>
    <Paragraph position="9"> If there are a Chinese word or phrase w and a Chinese word set Z , the word similarity between them is calculated as shown in Equation (1).</Paragraph>
    <Paragraph position="11"/>
    <Section position="1" start_page="21" end_page="321" type="sub_section">
      <SectionTitle>
4.2 Alignment Improvement Algorithm
</SectionTitle>
      <Paragraph position="0"> As the word alignment links in the intersection set are more reliable than those in the subtraction set, we adopt two different strategies for the alignments in the intersection set S and the subtraction set . For alignments in S, we will modify them when they are inconsistent with the translation information in S . For alignments in , we classify them into two cases and make selection between two different alignment links or modify them into a new link.</Paragraph>
      <Paragraph position="2"> In the intersection set S, there are only word to word alignment links, which include no multi-word units. The main alignment error type in this set is that some words should be combined into one phrase and aligned to the same word(s) in the target sentence. For example, for the sentence pair in Figure 1, &amp;quot;used&amp;quot; is aligned to the Chinese word &amp;quot;Xi Guan &amp;quot;, and &amp;quot;is&amp;quot; and &amp;quot;to&amp;quot; have null links in . But in the translation set , &amp;quot;is used to&amp;quot; is a phrase. Thus, we combine the three alignment links into a new link. The words &amp;quot;is&amp;quot;, &amp;quot;used&amp;quot; and &amp;quot; to&amp;quot; are all aligned to the Chinese word &amp;quot;Xi Guan &amp;quot;, denoted as (is used to, Xi Guan ). Figure 2 describes the algorithm employed to improve the word alignment in the intersection set S.</Paragraph>
      <Paragraph position="3">  In the subtraction set, there are two different links for each English word. Thus, we need to select one link or to modify the links according to the translation information in .</Paragraph>
      <Paragraph position="4"> For each English word i in the subtraction set, there are two cases:  We define an operation &amp;quot;combine&amp;quot; on a set consisting of position numbers of words. We first sort the position numbers in the set ascendly and then regard them as a phrase. For example, there is a set {{2,3}, 1, 4}, the result after applying the combine operation is (1, 2, 3, 4).</Paragraph>
      <Paragraph position="5"> Case 1: In , there is a word to word alignment link(. In , there is a word to word or word to multi-word alignment link</Paragraph>
      <Paragraph position="7"> For Case 1, we first examine the translation set . If there is an element(, we calculate the Chinese word similarity between j in and with Equation (1) shown in Section 4.1. We also combine the words in A ) into a phrase and get the word similarity between this new phrase and C . The alignment link with a higher similarity score is selected and added to WA .</Paragraph>
      <Paragraph position="8">  If a phrase consists of three words w , the sub-sequences of this phrase are w .</Paragraph>
      <Paragraph position="10"> For example, given a sentence pair in Figure 4, in S , the word &amp;quot;whipped&amp;quot; is aligned to &amp;quot;Tu Ran &amp;quot; and &amp;quot;out&amp;quot; is aligned to &amp;quot;Chou Chu &amp;quot;. In S , the word &amp;quot;whipped&amp;quot; is aligned to both &amp;quot;Tu Ran &amp;quot; and &amp;quot;Chou Chu &amp;quot; and &amp;quot;out&amp;quot; has a null link. In , &amp;quot;whipped out&amp;quot; is a phrase and translated into &amp;quot;Xun Su Chou Chu &amp;quot;. And the word similarity between &amp;quot;Tu Ran Chou Chu &amp;quot; and &amp;quot;Xun Su Chou</Paragraph>
    </Section>
    <Section position="2" start_page="321" end_page="321" type="sub_section">
      <SectionTitle>
Example
</SectionTitle>
      <Paragraph position="0"> For Case 2, we first examine S to see whether there is an element(. If true, we combine the words in (() into a word or phrase and calculate the similarity between this new word or phrase and C in the same way as in Case 1. If the similarity is higher  If there is an element( and i is a constituent of ph , we combine the English words in A ( ) into a phrase. If it is the same as the phrase and</Paragraph>
      <Paragraph position="2"> we add ( into WA . Otherwise, we use the multi-word to multi-word alignment algorithm in Figure 3 to modify the links.</Paragraph>
      <Paragraph position="4"> After applying the above two strategies, there are still some words not aligned. For each sentence pair, we use E and C to denote the sets of the source words and the target words that are not aligned, respectively. For each source word in E, we construct a link with each target word in C.</Paragraph>
      <Paragraph position="5"> We use L },|),{( CjEiji [?][?]= to denote the alignment candidates. For each candidate in L, we look it up in the translation set S . If there is an element</Paragraph>
    </Section>
    <Section position="3" start_page="321" end_page="321" type="sub_section">
      <SectionTitle>
5.2
Training and Testing Set
</SectionTitle>
      <Paragraph position="0"> We did experiments on a sentence aligned English-Chinese bilingual corpus in general domains.</Paragraph>
      <Paragraph position="1"> There are about 320,000 bilingual sentence pairs in the corpus, from which, we randomly select 1,000 sentence pairs as testing data. The remainder is used as training data.</Paragraph>
      <Paragraph position="2"> The Chinese sentences in both the training set and the testing set are automatically segmented into words. The segmentation errors in the testing set are post-corrected. The testing set is manually annotated. It has totally 8,651 alignment links including 2,149 null links. Among them, 866 alignment links include multi-word units, which accounts for about 10% of the total links.</Paragraph>
    </Section>
    <Section position="4" start_page="321" end_page="321" type="sub_section">
      <SectionTitle>
Experimental Results
</SectionTitle>
      <Paragraph position="0"> There are several different evaluation methods for word alignment (Ahrenberg et al. 2000). In our evaluation, we use evaluation metrics similar to those in Och and Ney (2000). However, we do not classify alignment links into sure links and possible links. We consider each alignment as a sure link.</Paragraph>
      <Paragraph position="1"> If we use S to indicate the alignments identified by the proposed methods and S to denote the reference alignments, the precision, recall and f-measure are calculated as described in Equation (2), (3) and (4). According to the definition of the alignment error rate (AER) in Och and Ney (2000), AER can be calculated with Equation (5).</Paragraph>
      <Paragraph position="2">  In this paper, we give two different alignment results in Table 2 and Table 3. Table 2 presents alignment results that include null links. Table 3 presents alignment results that exclude null links. The precision and recall in the tables are obtained to ensure the smallest AER for each method.</Paragraph>
      <Paragraph position="3">  In the above tables, the row &amp;quot;Ours&amp;quot; presents the result of our approach. The results are obtained by setting the word similarity thresholds to  sult of the approach that uses a bilingual dictionary instead of the rule-based machine translation system to improve statistical word alignment. The dictionary used in this method is the same translation dictionary used in the rule-based machine translation system. It includes 57,684 English words and each English word has about two Chinese translations on average. The rows &amp;quot;IBM E-C&amp;quot; and &amp;quot;IBM C-E&amp;quot; show the results obtained by IBM Model-4 when treating English as the source and Chinese as the target or vice versa. The row &amp;quot;IBM Inter&amp;quot; shows results obtained by taking the intersection of the alignments produced by &amp;quot;IBM E-C&amp;quot; and &amp;quot;IBM C-E&amp;quot;. The row &amp;quot;IBM Refined&amp;quot; shows the results by refining the results of &amp;quot;IBM Inter&amp;quot; as described in Och and Ney (2000).</Paragraph>
      <Paragraph position="4"> Generally, the results excluding null links are better than those including null links. This indicates that it is difficult to judge whether a word has counterparts in another language. It is because the translations of some source words can be omitted. Both the rule-based translation system and the bilingual dictionary provide no such information.</Paragraph>
      <Paragraph position="5"> It can be also seen that our approach performs the best among others in both cases. Our approach achieves a relative error rate reduction of 26% and 25% when compared with &amp;quot;IBM E-C&amp;quot; and &amp;quot;IBM C-E&amp;quot; respectively  . Although the precision of our method is lower than that of the &amp;quot;IBM Inter&amp;quot; method, it achieves much higher recall, resulting in a 30% relative error rate reduction. Compared with the &amp;quot;IBM refined&amp;quot; method, our method also achieves a relative error rate reduction of 30%. In addition, our method is better than the &amp;quot;Dic&amp;quot; method, achieving a relative error rate reduction of 8.8%.</Paragraph>
      <Paragraph position="6"> In order to provide the detailed word alignment information, we classify word alignment results in Table 3 into two classes. The first class includes the alignment links that have no multi-word units. The second class includes at least one multi-word unit in each alignment link. The detailed information is shown in Table 4 and Table 5. In Table 5, we do not include the method &amp;quot;Inter&amp;quot; because it has no multi-word alignment links.  All of the methods perform better on single word alignment than on multi-word alignment. In Table 4, the precision of our method is close to the &amp;quot;IBM Inter&amp;quot; approach, and the recall of our method is much higher, achieving a 47% relative error rate reduction. Our method also achieves a 37% relative error rate reduction over the &amp;quot;IBM Refined&amp;quot; method. Compared with the &amp;quot;Dic&amp;quot; method, our approach achieves much higher precision without loss of recall, resulting in a 12% relative error rate reduction.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML