File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1204_metho.xml

Size: 16,262 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1204">
  <Title>Training Data Modification for SMT Considering Groups of Synonymous Sentences</Title>
  <Section position="3" start_page="19" end_page="19" type="metho">
    <SectionTitle>
2 Target Corpus
</SectionTitle>
    <Paragraph position="0"> In this paper, we use a multilingual parallel corpus called BTEC (Takezawa et al., 2002) for our experiments. BTEC was used in IWSLT (Akiba et al., 2004). This parallel corpus is a collection of Japanese sentences and their translations into English, Korean and Chinese that are often found in phrase books for foreign tourists. These parallel sentences cover a number of situations (e.g., hotel reservations, troubleshooting) for Japanese going abroad, and most of the sentences are rather short. Since the scope of its topics is quite limited, some very similar sentences can be found in the corpus, making BTEC appropriate for modification with compression or replacement of sentences. We use only a part of BTEC for training data in our experiments.</Paragraph>
    <Paragraph position="1"> The training data we employ contain 152,170 Japanese sentences, with each sentence combined with English and Chinese translations. In Japanese, each sentence has 8.1 words on average, and the maximum sentence length is 150 words. In English, each sentence contains an average of 7.4 words, with a maximum sentence length of 117 words. In Chinese, each sentence has an average of 6.7 words and maximum length of 122 words. Some sentences appear twice or more in the training corpus. In total, our data include 94,268 different Japanese sentences, 87,061 different Chinese sentences, and 91,750 different English sentences.</Paragraph>
    <Paragraph position="2"> Therefore, there are some sentence pairs that consist of exactly the same sentence in one language but a different sentence in another language, as Fig.</Paragraph>
    <Paragraph position="3"> 1 shows. This relationship can help in finding the synonymous sentence group.</Paragraph>
    <Paragraph position="4"> The test data contain 510 sentences from different training sets in the BTEC. Each source sentence in the test data has 15 target sentences for evaluations. For the evaluation, we do not use any special process for the grouping process. Consequently, our results can be compared with those of other MT systems.</Paragraph>
  </Section>
  <Section position="4" start_page="19" end_page="21" type="metho">
    <SectionTitle>
3 Modification Method
</SectionTitle>
    <Paragraph position="0"> When an SMT system learns the translation model, variations in the translated sentences of the pair are critical for determining whether the system obtains a good model. If the same sentence appears twice in the input-side language and these sentences form pairs with two different target sentences in the output-side language, then broadly speaking the translation model defines almost the same probability for these two target sentences.</Paragraph>
    <Paragraph position="1"> In our model, the translation system features the ability to generate an output sentence with some variations; however, for the system to generate the most appropriate output sentence, sufficient information is required. Thus, it is difficult to prepare a sufficiently large training corpus.</Paragraph>
    <Section position="1" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
3.1 Synonymous Sentence Group
</SectionTitle>
      <Paragraph position="0"> Kashioka (2004) reported two steps for making a synonymous sentence group. The first is a concatenation step, and the second is a decomposition step. In this paper, to form a synonymous sentence group, we performed only the concatenation step, which has a very simple idea. When the expression  &amp;quot; in language B form a synonymous group. If other language information is available, we can extend this synonymous group using information on translation pairs for other languages.</Paragraph>
      <Paragraph position="1"> In this paper, we evaluate an EJ/JE system and a CJ/JC system, and our target data include three languages, i.e., Japanese, English, and Chinese. We make synonymous sentence groups in two different environments. One is a group using Japanese and English data, and other is a group that uses Japanese and Chinese data.</Paragraph>
      <Paragraph position="3"> The JE group contained 72,808 synonymous sentence groups, and the JC group contained 83,910 synonymous sentence groups as shown in Table 1.</Paragraph>
    </Section>
    <Section position="2" start_page="20" end_page="21" type="sub_section">
      <SectionTitle>
3.2 Modification
</SectionTitle>
      <Paragraph position="0"> We prepared the three types of modifications for training data.</Paragraph>
      <Paragraph position="1">  1. Compress the training corpus based on the synonymous sentence group (Fig. 2). 2. Replace the input and output sides' sentences with the selected sentence, considering the synonymous sentence group (Fig. 3). 3. Replace one side's sentences with a se null lected sentence, considering the synonymous sentence group (Figs. 4, 5). We describe these modifications in more detail in the following subsections.</Paragraph>
      <Paragraph position="2">  Here, a training corpus is constructed with several groups of synonymous sentences. Then, each group keeps only one pair of sentences and the other pairs are removed from each group, thereby decreasing the total number of sentences and narrowing the variation of expressions. Figure 2 shows an example of modification in this way. In the figure, S1, S2, and S3 indicate the input-side sentences while T1 and T2 indicate the output-side sentences. The left-hand side box shows a synonymous sentence group in the original training corpus, where four sentence pairs construct one synonymous sentence group. The right-hand side box shows a part of the modified training corpus. In this case, we keep the S1 and T1 sentences, and this resulting pair comprises a modified training corpus.</Paragraph>
      <Paragraph position="3"> The selection of what sentences to keep is an important issue. In our current experiment, we select the most frequent sentence in each side's language from within each group. In Fig. 2, S1 appeared twice, while S2 and S3 appeared only once in the input-side language. As for the output-side language, T1 appeared three times and T2 appeared once. Thus, we keep the pair consisting of S1 and T1. When attempting to separately select the most frequent sentence in each language, we may not find suitable pairs in the original training corpus; however, we can make a new pair with the extracted sentences for the modified training corpus.  In the compression stage, the total number of sentences in the modified training corpus is decreased, and it is clear that fewer sentences in the training corpus leads to diminished accuracy. In order to make a comparison between the original training corpus and a modified training corpus with the same number of sentences, we extract one pair of sentences from each group, and each pair appears in the modified training corpus in the same number of sentences. Figure 3 shows an example of this modification. The original training data are the same as in Fig. 2. Then we extract S1 and T1 by the same process from each side with this group, and replacing all of the input-side sentences with S1 in this group. The output side follows the same process. In this case, the modified training corpus consists of four pairs of S1 and T1.</Paragraph>
      <Paragraph position="4">  sentence With the previous two modifications, the language variations in both sides decrease. Next, we propose the third modification, which narrows the range of one side's variations.</Paragraph>
      <Paragraph position="5"> The sentences of one side are replaced with the selected sentence from that group. The sentence for replacement is selected by following the same process used in the previous modifications. As a result, two modified training corpora are available  as shown in Figs. 4 and 5. Figure 4 illustrates the output side's decreasing variation, while Fig. 5 shows the input side's decreasing variation.</Paragraph>
      <Paragraph position="6">  In this section, we describe the SMT systems used in these experiments. The SMT systems' decoder is a graph-based decoder (Ueffing et al., 2002; Zhang et al., 2004). The first pass of the decoder generates a word-graph, a compact representation of alternative translation candidates, using a beam search based on the scores of the lexicon and language models. In the second pass, an A* search traverses the graph. The edges of the word-graph, or the phrase translation candidates, are generated by the list of word translations obtained from the inverted lexicon model. The phrase translations extracted from the Viterbi alignments of the training corpus also constitute the edges. Similarly, the edges are also created from dynamically extracted phrase translations from the bilingual sentences (Watanabe and Sumita, 2003). The decoder used the IBM Model 4 with a trigram language model and a five-gram part-of-speech language model.</Paragraph>
      <Paragraph position="7"> Training of the IBM model 4 was implemented by the GIZA++ package (Och and Ney, 2003). All parameters in training and decoding were the same for all experiments. Most systems with this training can be expected to achieve better accuracy when we run the parameter tuning processes. However, our purpose is to compare the difference in results caused by modifying the training corpus.</Paragraph>
      <Paragraph position="8"> We performed experiments for JE/EJ and JC/CJ systems and four types of training corpora:  1) Original BTEC corpus; 2) Compressed BTEC corpus (see 3.2.1); 3) Replace both languages (see 3.2.2); 4) Replace one side language (see 3.2.3) 4-1) replacement on the input side 4-2) replacement on the output side.</Paragraph>
      <Paragraph position="9"> For the evaluation, we use BLEU, NIST, WER, and PER as follows:  NIST: An arithmetic mean of the n-gram matches between test and reference sentences multiplied by a length factor, which again penalizes short translation sentences.</Paragraph>
      <Paragraph position="10"> mWER (Niessen et al., 2000): Multiple reference word-error rate, which computes the edit distance (minimum number of insertions, deletions, and substitutions) between test and reference sentences.</Paragraph>
      <Paragraph position="11"> mPER: Multiple reference position-independent word-error rate, which computes the edit distance without considering the word order.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="21" end_page="22" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section, we show the experimental results for the JE/EJ and JC/CJ systems.</Paragraph>
    <Paragraph position="1">  Modification of the training data is based on the synonymous sentence group with the JE pair. The EJ system performed at 0.55 in mWER with the original data set, and the system replacing the Japanese side achieved the best performance of 0.44 in mWER. The system then gained 0.11 in mWER. On the other hand, the system replacing the English side lost 0.05 in mWER. The mPER score also indicates a similar result. For the BLEU and NIST scores, the system replacing the Japanese side also attained the best performance. The JE system attained a score of 0.52 in mWER with the original data set, while the system with English on the replacement side gave the best performance of 0.42 in mWER, a gain of 0.10. On the other hand, the system with Japanese on the replacement side showed no change in mWER, and the case of compression achieved good performance. The ratios of mWER and mPER are nearly the same for replacing Japanese. Thus, in both directions replacement of the input-side language derives a positive effect for translation modeling.</Paragraph>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
5.2 CJ/JC system-based JC group
</SectionTitle>
      <Paragraph position="0"> Tables 4 and 5 show the evaluation results for the EJ/JE system based on the group with a JC language pair.</Paragraph>
      <Paragraph position="1">  The CJ system achieved a score of 0.41 in mWER with the original data set, with the other cases similar to the original; we could not find a large difference among the training corpus modifications. Furthermore, the JC system performed at 0.38 in mWER with the original data, although the other cases' results were not as good. These results seem unusual considering the EJ/JE system, indicating that they derive from the features of the Chinese part of the BTEC corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="22" end_page="23" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Our EJ/JE experiment indicated that the system with input-side language replacement achieved better performance than that with output-side language replacement. This is a reasonable result because the system learns the translation model with fewer variations for input-side language.</Paragraph>
    <Paragraph position="1"> In the experiment on the CJ/JC system based on the JC group, we did not provide an outline of the EJ/JE system due to the features of BTEC. Initially, BTEC data were created from pairs of Japanese and English sentences in the travel domain. Japanese-English translation pairs have variation as shown in Fig. 1. However, when Chinese data was translated, BTEC was controlled so that the same Japanese sentence has only one Chinese sentence.</Paragraph>
    <Paragraph position="2"> Accordingly, there is no variation in Chinese sentences for the pair with the same Japanese sentence. Therefore, the original training data would be similar to the situation of replacing Chinese. Moreover, replacing the Japanese data was almost to the same as replacing both sets of data. Considering this feature of the training corpus, i.e. the results for the CJ/JC system based on the group with JC language pairs, there are few differences between keeping the original data and replacing the Chinese data, or between replacing both side's data and replacing only the Japanese data. These results demonstrate the correctness of the hypothesis that reducing the input side's language variation makes learning models more effective.</Paragraph>
    <Paragraph position="3"> Currently, our modifications only roughly process sentence pairs, though the process of making groups is very simple. Sometimes a group may include sentences or words that have slightly different meanings, such as. fukuro (bag), kamibukuro (paper bag), shoppingu baggu (shopping bag), tesagebukuro (tote bag), and biniiru bukuro (plastic bag). In this case if we select tesagebukuro from the Japanese side and &amp;quot;paper bag&amp;quot; from the English side, we have an incorrect word pair in the translation model. To handle such a problem, we would have to arrange a method to select the sen- null tences from a group. This problem is discussed in Imamura et al. (2003). As one solution to this problem, we borrowed the measures of literalness, context freedom, and word translation stability in the sentence-selection process.</Paragraph>
    <Paragraph position="4"> In some cases, the group includes sentences with different meanings, and this problem was mentioned in Kashioka (2004). In an attempt to solve the problem, he performed a secondary decomposition step to produce a synonymous group. However, in the current training corpus, each synonymous group before the decomposition step is small, so there would not be enough difference for modifications after the decomposition step.</Paragraph>
    <Paragraph position="5"> The replacement of a sentence could be called paraphrasing. Shimohata et al. (2004) reported a paraphrasing effect in MT systems, where if each group would have the same meaning, the variation in the phrases that appeared in the other groups would reduce the probability. Therefore, considering our results in light of their discussion, if the training corpus could be modified with the module for paraphrasing in order to control phrases, we could achieve better performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML