XML Viewer - p04-3002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3002_metho.xml
Size: 9,011 bytes
Last Modified: 2025-10-06 14:09:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3002">
  <Title>Improving Domain-Specific Word Alignment for Computer Assisted Translation</Title>
  <Section position="3" start_page="21" end_page="21" type="metho">
    <SectionTitle>
SGSGPG [?]=
</SectionTitle>
    <Paragraph position="0"> Subtraction: SGMG [?]= PG For the specific domain, we use and to represent the word alignment sets in the two directions. The symbols ,</Paragraph>
  </Section>
  <Section position="4" start_page="21" end_page="21" type="metho">
    <SectionTitle>
SF PF and MF
</SectionTitle>
    <Paragraph position="0"> represents the intersection set, union set and the subtraction set, respectively.</Paragraph>
  </Section>
  <Section position="5" start_page="21" end_page="21" type="metho">
    <SectionTitle>
Translation Dictionary Acquisition
</SectionTitle>
    <Paragraph position="0"> When we train the statistical word alignment model with a large-scale bilingual corpus in the general domain, we can get two word alignment results for the training data. By taking the intersection of the two word alignment results, we build a new alignment set. The alignment links in this intersection set are extended by iteratively adding word alignment links into it as described in (Och and Ney, 2000).</Paragraph>
    <Paragraph position="1">  In this paper, the union operation does not remove the replicated elements. For example, if set one includes two elements {1, 2} and set two includes two elements {1, 3}, then the union of these two sets becomes {1, 1, 2, 3}. Based on the extended alignment links, we build an English to Chinese translation dictionary with translation probabilities. In order to filter some noise caused by the error alignment links, we only retain those translation pairs whose translation probabilities are above a threshold  When we train the IBM statistical word alignment model with a limited bilingual corpus in the specific domain, we build another translation dictionary with the same method as for the dictionary . But we adopt a different filtering strategy for the translation dictionary . We use log-likelihood ratio to estimate the association strength of each translation pair because Dunning (1993) proved that log-likelihood ratio performed very well on small-scale data. Thus, we get the translation dictionary by keeping those entries whose log-likelihood ratio scores are greater than a</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
2.3 Word Alignment Adaptation Algorithm
</SectionTitle>
      <Paragraph position="0"> Based on the bi-directional word alignment, we define as SI SFSGSI [?]= and as UG SIPFPGUG [?][?]= . The word alignment links in the set SI are very reliable. Thus, we directly accept them as correct links and add them into the final alignment set . WA Input: Alignment set and SI UG  same target words for the English word i, we add this link into . WA d) Otherwise, if there are two different links for this word: one target is a single word, and the other target is a multi-word unit and the words in the multi-word unit have no link in , add this multi-word alignment link to  For each source word in the set , there are two to four different alignment links. We first use translation dictionaries to select one link among them. We first examine the dictionary and then to see whether there is at least an alignment link of this word included in these two dictionaries. If it is successful, we add the link with the largest probability or the largest log-likelihood ratio score to the final set . Otherwise, we use two heuristic rules to select word alignment links. The detailed algorithm is described in Figure 1.</Paragraph>
      <Paragraph position="1">  example, for the English word &amp;quot;x-ray&amp;quot;, we have two different links in UG . One is (x-ray, X) and the other is (x-ray, XShe Xian ). And the single Chinese words &amp;quot;She &amp;quot; and &amp;quot;Xian &amp;quot; have no alignment links in the set . According to the rule d), we select the link (x-ray, XShe Xian ).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="21" end_page="21" type="metho">
    <SectionTitle>
WA
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We compare our method with three other methods.</Paragraph>
    <Paragraph position="1"> The first method &amp;quot;Gen+Spec&amp;quot; directly combines the corpus in the general domain and in the specific domain as training data. The second method &amp;quot;Gen&amp;quot; only uses the corpus in the general domain as training data. The third method &amp;quot;Spec&amp;quot; only uses the domain-specific corpus as training data. With these training data, the three methods can get their own translation dictionaries. However, each of them can only get one translation dictionary. Thus, only one of the two steps a) and b) in Figure 1 can be applied to these methods. The difference between these three methods and our method is that, for each word, our method has four candidate alignment links while the other three methods only has two candidate alignment links. Thus, the steps c) and d) in Figure 1 should not be applied to these three methods.</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
Training and Testing Data
</SectionTitle>
      <Paragraph position="0"> We have a sentence aligned English-Chinese bilingual corpus in the general domain, which includes 320,000 bilingual sentence pairs, and a sentence aligned English-Chinese bilingual corpus in the specific domain (a medical system manual), which includes 546 bilingual sentence pairs. From this domain-specific corpus, we randomly select 180 pairs as testing data. The remained 366 pairs are used as domain-specific training data.</Paragraph>
      <Paragraph position="1"> The Chinese sentences in both the training set and the testing set are automatically segmented into words. In order to exclude the effect of the segmentation errors on our alignment results, we correct the segmentation errors in our testing set.</Paragraph>
      <Paragraph position="2"> The alignments in the testing set are manually annotated, which includes 1,478 alignment links.</Paragraph>
      <Paragraph position="3">  We use evaluation metrics similar to those in (Och and Ney, 2000). However, we do not classify alignment links into sure links and possible links. We consider each alignment as a sure link. If we use to represent the alignments identified by the proposed methods and to denote the reference alignments, the methods to calculate the precision, recall, and f-measure are shown in Equation (1), (2) and (3). According to the definition of the alignment error rate (AER) in (Och and Ney, 2000), AER can be calculated with Equation (4). Thus, the higher the f-measure is, the lower the alignment error rate is. Thus, we will only give precision, recall and AER values in the experimental results.</Paragraph>
      <Paragraph position="4">  We get the alignment results shown in Table 1 by setting the translation probability threshold to  From the results, it can be seen that our approach performs the best among others, achieving much higher recall and comparable precision. It also achieves a 21.96% relative error rate reduction compared to the method &amp;quot;Gen+Spec&amp;quot;. This indicates that separately modeling the general words and domain-specific words can effectively improve the word alignment in a specific domain.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="21" end_page="21" type="metho">
    <SectionTitle>
4 Computer Assisted Translation System
</SectionTitle>
    <Paragraph position="0"> A direct application of the word alignment result to the GTMS is to get translations for sub-sequences in the input sentence using the pre-translated examples.</Paragraph>
    <Paragraph position="1"> For each sentence, there are many sub-sequences.</Paragraph>
    <Paragraph position="2"> GTMS tries to find translation examples that match the longest sub-sequences so as to cover as much of the input sentence as possible without overlapping.</Paragraph>
    <Paragraph position="3"> Figure 3 shows a sentence translated on the sub-sentential level. The three panels display the input sentence, the example translations and the translation suggestion provided by the system, respectively. The input sentence is segmented to three parts. For each part, the GTMS finds one example to get a translation fragment according to the word alignment result. By combining the three translation fragments, the GTMS produces a correct translation suggestion &amp;quot;Xi Tong Bei Ren Wei You CTSao Miao Ji . &amp;quot; Without the word alignment information, the conventional TMS cannot find translations for the input sentence because there are no examples closely matched with it. Thus, word alignment information can improve the translation accuracy of the GTMS, which in turn reduces editing time of the translators and improves translation efficiency.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML