XML Viewer - w04-1118

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1118_metho.xml
Size: 18,464 bytes
Last Modified: 2025-10-06 14:09:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1118">
  <Title>Do We Need Chinese Word Segmentation for Statistical Machine Translation?</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Review of the Baseline System for
Statistical Machine Translation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Principle
</SectionTitle>
      <Paragraph position="0"> In statistical machine translation, we are given a source language ('French') sentence fJ1 = f1 :::fj :::fJ, which is to be translated into a target language ('English') sentence eI1 = e1 :::ei :::eI: Among all possible target language sentences, we will choose the sentence with the highest probability:</Paragraph>
      <Paragraph position="2"> The decomposition into two knowledge sources in Equation 2 is known as the source-channel approach to statistical machine translation (Brown et al., 1990). It allows an independent modeling of target language model Pr(eI1) and translation model Pr(fJ1 jeI1)1. The target language model describes the well-formedness of the target language sentence. The translation model links the source language sentence to the target language sentence. The argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language. We have to maximize over all possible target language sentences.</Paragraph>
      <Paragraph position="3"> The resulting architecture for the statistical machine translation approach is shown in Figure 1 with the translation model further decomposed into lexicon and alignment model.</Paragraph>
      <Paragraph position="4">  proach based on Bayes decision rule.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Alignment Models
</SectionTitle>
      <Paragraph position="0"> The alignment model Pr(fJ1 ;aJ1jeI1) introduces a 'hidden' alignment a = aJ1, which describes 1The notational convention will be as follows: we use the symbol Pr(C/) to denote general probability distributions with (nearly) no specific assumptions. In contrast, for model-based probability distributions, we use the generic symbol p(C/).</Paragraph>
      <Paragraph position="1"> a mapping from a source position j to a target position aj. The relationship between the translation model and the alignment model is given by:</Paragraph>
      <Paragraph position="3"> In this paper, we use the models IBM-1, IBM4 from (Brown et al., 1993) and the HiddenMarkovalignmentmodel(HMM)from(Vogelet null al., 1996). All these models provide different decompositions of the probability Pr(fJ1 ;aJ1jeI1).</Paragraph>
      <Paragraph position="4"> A detailed description of these models can be found in (Och and Ney, 2003).</Paragraph>
      <Paragraph position="5"> A Viterbi alignment ^aJ1 of a specific model is an alignment for which the following equation holds:</Paragraph>
      <Paragraph position="7"> The alignment models are trained on a bilingual corpus using GIZA++(Och et al., 1999; Och and Ney, 2003). The training is done iteratively in succession on the same data, where the final parameter estimates of a simpler model serve as starting point for a more complex model. The result of the training procedure is the Viterbi alignment of the final training iteration for the whole training corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Alignment Template Approach
</SectionTitle>
      <Paragraph position="0"> In the translation approach from Section 2.1,  onedisadvantageisthatthecontextualinformation is only taken into account by the language model. The single-word based lexicon model does not consider the surrounding words. One way to incorporate the context into the translation model is to learn translations for whole word groups instead of single words. The key elements of this translation approach (Och et al., 1999) are the alignment templates. These are pairs of source and target language phrases with an alignment within the phrases.</Paragraph>
      <Paragraph position="1"> The alignment templates are extracted from the bilingual training corpus. The extraction algorithm (Och et al., 1999) uses the word alignment information obtained from the models in Section 2.2. Figure 2 shows an example of a word aligned sentence pair. The word alignment is represented with the black boxes. The figure also includes some of the possible alignment templates, represented as the larger, unfilled rectangles. Note that the extraction algorithm would extract many more alignment templates from this sentence pair. In this example, the system input was the sequence of Chinese characters without any word segmentation. As canbeseen, atranslationapproachthatisbased  onphrasescircumventstheproblemofwordsegmentation to a certain degree. This method will be referred to as &amp;quot;translation with no segmentation&amp;quot; (see Section 5.2).</Paragraph>
      <Paragraph position="2">  pair and some possible alignment templates.</Paragraph>
      <Paragraph position="3"> In the Chinese-English DARPA TIDES evaluations in June 2002 and May 2003, carried out by NIST (NIST, 2003), the alignment template approach performed very well and was ranked among the best translation systems.</Paragraph>
      <Paragraph position="4"> Further details on the alignment template approach are described in (Och et al., 1999; Och and Ney, 2002).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Task and Corpus Statistics
</SectionTitle>
    <Paragraph position="0"> In Section 5.3, we will present results for a Chinese-English translation task. The domain of this task is news articles. As bilingual training data, we use a corpus composed of the English translations of a Chinese Treebank. This corpus is provided by the Linguistic Data Consortium (LDC), catalog number LDC2002E17.</Paragraph>
    <Paragraph position="1"> In addition, we use a bilingual dictionary with 10K Chinese word entries provided by Stephan Vogel (LDC, 2003b).</Paragraph>
    <Paragraph position="2"> Table 1 shows the corpus statistics of this task. We have calculated both the number of words and the number of characters in the corpus. In average, a Chinese word is composed of 1.49 characters. For each of the two languages, there is a set of 20 special characters, such as digits, punctuation marks and symbols like &amp;quot;()%$...&amp;quot; The training corpus will be used to train a word alignment and then extract the alignment templates and the word-based lexicon. The resulting translation system will be evaluated on the test corpus.</Paragraph>
    <Paragraph position="3">  For each of the two languages, there is a set of 20 special characters, such as digits, punctuation marks and symbols like &amp;quot;()%$...&amp;quot;</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Conventional Method
</SectionTitle>
      <Paragraph position="0"> The commonly used segmentation method is based on a segmentation tool and a monolingual Chinese dictionary. Typically, this dictionary has been produced beforehand and is independent of the Chinese text to be segmented. The dictionary contains Chinese words and their frequencies. This information is used by the segmentation tool to find the word boundaries. In the LDC method (see Section 5.2) we have used the dictionary and segmenter provided by the LDC. More details can be found on the LDC web pages (LDC, 2003a). This segmenter is based on two ideas: it prefers long words over short words and it prefers high frequency words over low frequency words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Dictionary Learning from Alignments
</SectionTitle>
      <Paragraph position="0"> In this section, we will describe our method of learning a dictionary from a bilingual corpus.</Paragraph>
      <Paragraph position="1"> As mentioned before, the bilingual training corpus listed in Section 3 is the only input to the system. We firstly divide every Chinese characters in the corpus by white spaces, then train the statistical translation models with this unsegmented Chinese text and its English translation, details of the training method are described in Section 2.2.</Paragraph>
      <Paragraph position="2"> To extract Chinese words instead of phrases as in Figure 2, we configure the training parameters in GIZA++, the alignment is then restricted to a multi-source-single-target relationship, i.e. one or more Chinese characters are translated to one English word.</Paragraph>
      <Paragraph position="3"> The result of this training procedure is an alignment for each sentence pair. Such an alignment is represented as a binary matrix with JC/I elements.</Paragraph>
      <Paragraph position="4"> An example is shown in Figure 3. The unsegmented Chinese training sentence is plotted along the horizontal axes and the corresponding English sentence along the vertical axes. The black boxes show the Viterbi alignment for this sentence pair. Here, for example the first two Chinese characters are aligned to &amp;quot;industry&amp;quot;, the next four characters are aligned to &amp;quot;restruc- null The central idea of our dictionary learning method is: a contiguous sequence of Chinese characters constitute a Chinese word, if they are aligned to the same English word. Using this idea and the bilingual corpus, we can automatically generate a Chinese word dictionary. Table 2 shows the Chinese words that are extracted from the alignment in Figure 3.</Paragraph>
      <Paragraph position="5">  learned from the alignment in Figure 3.</Paragraph>
      <Paragraph position="6"> We extract Chinese words from all sentence pairs in the training corpus. Therefore, it is straightforward to collect word frequency statistics that are needed for the segmentation tool. Once, we have generated the dictionary, we can produce a segmented Chinese corpus using the method described in Section 4.1. Then, we retrain the translation system using the segmented Chinese text.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Word Length Statistics
</SectionTitle>
      <Paragraph position="0"> In this section, we present statistics of the word lengths in the LDC dictionary as well as in the self-learned dictionary extracted from the alignment. null Table 3 shows the statistics of the word lengths in the LDC dictionary as well as in the learned dictionary. For example, there are 2368 words consisting of a single character in learned dictionary and 2511 words in the LDC dictionary. These single character words represent 16.9% of the total number of entries in the learned dictionary and 18.6% in the LDC dictionary.</Paragraph>
      <Paragraph position="1"> We see that in the LDC dictionary more than 65% of the words consist of two characters and about 30% of the words consist of a single character or three or four characters. Longer words with more than four characters constitute less than 1% of the dictionary. In the learned dictionary, there are many more long words, about 15%. A subjective analysis showed that many of these entries are either named entities or idiomatic expressions. Often, these idiomatic expressions should be segmented into shorter words. Therefore, we will investigate methods to overcome this problem in the future. Some suggestions will be discussed in Section 6.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> So far, in machine translation research, a single generally accepted criterion for the evaluation of the experimental results does not exist. We have used three automatic criteria. For the test corpus, we have four references available. Hence, we compute all the following criteria with respect to multiple references.</Paragraph>
      <Paragraph position="1"> + WER (word error rate): The WER is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the reference sentence.</Paragraph>
      <Paragraph position="2"> + PER (position-independent word error rate): A shortcoming of the WER is that it requires a perfect word order. The word order of an acceptable sentence can be different from that of the target sentence, so that the WER measure alone could be misleading. The PER compares the words in the two sentences ignoring the word order.</Paragraph>
      <Paragraph position="3"> + BLEU score: This score measures the precision of unigrams, bigrams, trigrams and fourgrams with respect to a reference translation with a penalty for too short sentences (Papineni et al., 2001). The BLEU score measures accuracy, i.e. large BLEU scores are better. null</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Summary: Three Translation
Methods
</SectionTitle>
      <Paragraph position="0"> In the experiments, we compare the following three translation methods: + Translation with no segmentation: Each Chinese character is interpreted as a single word.</Paragraph>
      <Paragraph position="1"> + Translation with learned segmentation: It uses the self-learned dictionary.</Paragraph>
      <Paragraph position="2"> + Translation with LDC segmentation: The predefined LDC dictionary is used. The core contribution of this paper is the method we called &amp;quot;translation with learned segmentation&amp;quot;, which consists of three steps: + The input is a sequence of Chinese characters without segmentation. After the training using GIZA++, we extract a mono-lingual Chinese dictionary from the alignment. This is discussed in Section 4.2, and an example is given in Figure 3 and Table 2.</Paragraph>
      <Paragraph position="3"> + Using this learned dictionary, we segment the sequence of Chinese characters into words. In other words, the LDC method is used, but the LDC dictionary is replaced by the learned dictionary (see Section 4.1).</Paragraph>
      <Paragraph position="4"> + Based on this word segmentation, we perform another training using GIZA++.</Paragraph>
      <Paragraph position="5"> Then, after training the models IBM1, HMM and IBM4, we extract bilingual word groups, which are referred as alignment templates.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> The evaluation is performed on the LDC corpus described in Section 3. The translation performance of the three systems is summarized in Table 4 for the three evaluation criteria WER, PER and BLEU. We observe that the translation quality with the learned segmentation is similartothatwiththeLDCsegmentation. The WER of the system with the learned segmentation is somewhat better, but PER and BLEU are slightly worse. We conclude that it is possibletolearnadomain-specificdictionaryforChi- null nese word segmentation from a bilingual corpus.</Paragraph>
      <Paragraph position="1"> Therefore the translation system is independent of a predefined dictionary, which may be unsuitable for a certain task.</Paragraph>
      <Paragraph position="2"> The translation system using no segmentation performs slightly worse. For example, for the WER there is a loss of about 2% relative compared to the system with the LDC segmen- null segmentation methods (all numbers in percent).</Paragraph>
      <Paragraph position="3"> method error rates accuracy</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
WER PER BLEU
</SectionTitle>
    <Paragraph position="0"> no segment. 73.3 56.5 27.6 learned segment. 70.4 54.6 29.1 LDC segment. 71.9 54.4 29.2</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Effect of Segmentation on
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Translation Results
</SectionTitle>
    <Paragraph position="0"> In this section, we present three examples of the effect that segmentation may have on translation quality. For each of the three examples in Figure 4, we show the segmented Chinese source sentence using either the LDC dictionary or the self-learned dictionary, the corresponding translation and the human reference translation.</Paragraph>
    <Paragraph position="1"> In the first example, the LDC dictionary leads to a correct segmentation, whereas with the learned dictionary the segmentation is erroneous. The second and third token should be combined (&amp;quot;Hong Kong&amp;quot;), whereas the fifth token should be separated (&amp;quot;stabilize in the long term&amp;quot;). In this case, the wrong segmentation of the Chinese source sentence does not result in a wrong translation. A possible reason is that the translation system is based on word groups and can recover from these segmentation errors.</Paragraph>
    <Paragraph position="2"> In the second example, the segmentation with the LDC dictionary produces at least one error.</Paragraph>
    <Paragraph position="3"> The second and third token should be combined (&amp;quot;this&amp;quot;). It is possible to combine the seventh and eighth token to a single word because the eighth token shows only the tense. The segmentation with the learned dictionary is correct.</Paragraph>
    <Paragraph position="4"> Here, the two segmentations result in different translations.</Paragraph>
    <Paragraph position="5"> In the third example, both segmentations are incorrect and these segmentation errors affect the translation results. In the segmentation with the LDC dictionary, the first Chinese characters should be segmented as a separate word.</Paragraph>
    <Paragraph position="6"> The second and third character and maybe even the fourth character should be combined to one word.2 The fifth and sixth character should be combined to a single word. In the segmentation with the learned dictionary, the fifth and sixth token (seventh and eighth character) should be combined (&amp;quot;isolated&amp;quot;). We see that this term is missing in the translation. Here, the segmentation errors result in translation errors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML