File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1057_metho.xml
Size: 16,504 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1057"> <Title>Feedback Cleaning of Machine Translation Rules Using Automatic Evaluation</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Automatic Evaluation of MT Quality </SectionTitle> <Paragraph position="0"> We utilize BLEU (Papineni et al., 2002) for the automatic evaluation of MT quality in this paper.</Paragraph> <Paragraph position="1"> BLEU measures the similarity between MT results and translation results made by humans (called In this paper, the number of rules denotes the number of unique pairs of source patterns and target patterns. Rule No. Syn. Cat. Source Pattern Target Pattern Source Example</Paragraph> <Paragraph position="3"> references). This similarity is measured by N-gram precision scores. Several kinds of N-grams can be used in BLEU. We use from 1-gram to 4-gram in this paper, where a 1-gram precision score indicates the adequacy of word translation and longer N-gram (e.g., 4-gram) precision scores indicate fluency of sentence translation. The BLEU score is calculated from the product of N-gram precision scores, so this measure combines adequacy and fluency.</Paragraph> <Paragraph position="4"> Note that a sizeable set of MT results is necessary in order to calculate an accurate BLEU score. Although it is possible to calculate the BLEU score of a single MT result, it contains errors from the subjective evaluation. BLEU cancels out individual errors by summing the similarities of MT results. Therefore, we need all of the MT results from the evaluation corpus in order to calculate an accurate BLEU score.</Paragraph> <Paragraph position="5"> One feature of BLEU is its use of multiple references for a single source sentence. However, one reference per sentence is used in this paper because an already existing bilingual corpus is applied to the cleaning.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Feedback Cleaning </SectionTitle> <Paragraph position="0"> In this section, we introduce the proposed method, called feedback cleaning. This method is carried out by selecting or removing translation rules to increase the BLEU score of the evaluation corpus (Figure 1).</Paragraph> <Paragraph position="1"> Thus, this task is regarded as a combinatorial optimization problem of translation rules. The hill-climbing algorithm, which involves the features of this task, is applied to the optimization. The following sections describe the reasons for using this method and its procedure. The hill-climbing algorithm often falls into locally optimal solutions.</Paragraph> <Paragraph position="2"> However, we believe that a locally optimal solution is more effective in improving MT quality than the previous methods.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Costs of Combinatorial Optimization </SectionTitle> <Paragraph position="0"> Most combinatorial optimization methods iterate changes in the combination and the evaluation. In the machine translation task, the evaluation process requires the longest time. For example, in order to calculate the BLEU score of a combination (solution), we have to translate C times, where C denotes the size of the evaluation corpus. Furthermore, in order to find the nearest neighbor solution, we have to calculate all BLEU scores of the neighborhood.</Paragraph> <Paragraph position="1"> If the number of rules is R and the neighborhood is regarded as consisting of combinations made by changing only one rule, we have to translate C x R times to find the nearest neighbor solution. Assume that C =10,000 and R = 100,000, the number of sentence translations (sentences to be translated) becomes one billion. It is infeasible to search for the optimal solution without reducing the number of sentence translations.</Paragraph> <Paragraph position="2"> A feature of this task is that removing rules is easier than adding rules. The rules used for translating a sentence can be identified during the translation.</Paragraph> <Paragraph position="3"> Conversely, the source sentence set S[r], where a rule r is used for the translation, is determined once the evaluation corpus is translated. When r is removed, only the MT results of S[r] will change, so we do not need to re-translate other sentences.</Paragraph> <Paragraph position="4"> Assuming that five rules on average are applied to translate a sentence, the number of sentence translations becomes 5 x C + C =60,000 for testing all rules. On the contrary, to add a rule, the entire corpus must be re-translated because it is unknown which MT results will change by adding a rule.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Cleaning Procedure </SectionTitle> <Paragraph position="0"> Based on the above discussion, we utilize the hill-climbing algorithm, in which the initial solution contains all rules (called the base rule set) and the search for a combination is done by only removing rules. The algorithm is shown in Figure 3. This algorithm can be summarized as follows.</Paragraph> <Paragraph position="1"> * Translate the evaluation corpus first and then obtain the rules used for the translation and the BLEU score before removing rules.</Paragraph> <Paragraph position="2"> * For each rule one-by-one, calculate the BLEU score after removing the rule and obtain the difference between this score and the score before the rule was removed. This difference is called the rule contribution.</Paragraph> <Paragraph position="3"> * If the rule contribution is negative (i.e., the BLUE score increases after removing the rule), remove the rule.</Paragraph> <Paragraph position="4"> In order to achieve faster convergence, this algorithm removes all rules whose rule contribution is negative in one iteration. This assumes that the removed rules are independent from one another.</Paragraph> </Section> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> 5 N-fold Cross-cleaning </SectionTitle> <Paragraph position="0"> In general, most evaluation corpora are smaller than training corpora. Therefore, omissions of cleaning (In the case of three-fold cross-cleaning) will remain because not all rules can be tested by the evaluation corpus. In order to avoid this problem, we propose an advanced method called cross-cleaning (Figure 4), which is similar to cross-validation. The procedure of cross-cleaning is as follows.</Paragraph> <Paragraph position="1"> 1. First, create the base rule set from the entire training corpus.</Paragraph> <Paragraph position="2"> 2. Next, divide the training corpus into N pieces uniformly.</Paragraph> <Paragraph position="3"> 3. Leave one piece for the evaluation, acquire rules from the rest (N [?] 1) of the pieces, and repeat them N times. Thus, we obtain N pairs of rule set and evaluation sub-corpus. Each rule set is a subset of the base rule set.</Paragraph> <Paragraph position="4"> 4. Apply the feedback cleaning algorithm to each of the N pairs and record the rule contributions even if the rules are removed. The purpose of this step is to obtain the rule contributions. 5. For each rule in the base rule set, sum up the rule contributions obtained from the rule subsets. If the sum is negative, remove the rule from the base rule set.</Paragraph> <Paragraph position="5"> The major difference of this method from cross-validation is Step 5. In the case of cross-cleaning, the rule subsets cannot be directly merged because some rules have already been removed in Step 4. Therefore, we only obtain the rule contributions from the rule subsets and sum them up. The summed contribution is an approximate value of the rule contribution to the entire training corpus. Cross-cleaning removes the rules from the base rule set based on this approximate contribution.</Paragraph> <Paragraph position="6"> Cross-cleaning uses all sentences in the training corpus, so it is nearly equivalent to applying a large evaluation corpus to feedback cleaning, even though it does not require specific evaluation corpora.</Paragraph> </Section> <Section position="7" start_page="1" end_page="1" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> In this section, the effects of feedback cleaning are evaluated by using English-to-Japanese translation.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 6.1 Experimental Settings Bilingual Corpora The corpus used in the fol- </SectionTitle> <Paragraph position="0"> lowing experiments is the Basic Travel Expression Corpus (Takezawa et al., 2002). This is a collection of Japanese sentences and their English translations based on expressions that are usually found in phrasebooks for foreign tourists. We divided it into sub-corpora for training, evaluation, and test as shown in Table 1. The number of rules acquired from the training corpus (the base rule set size) was 105,588.</Paragraph> <Paragraph position="1"> Evaluation Methods of MT Quality We used the following two methods to evaluate MT quality.</Paragraph> </Section> </Section> <Section position="8" start_page="1" end_page="1" type="metho"> <SectionTitle> 1. Test Corpus BLEU Score </SectionTitle> <Paragraph position="0"> The BLUE score was calculated with the test corpus. The number of references was one for each sentence, in the same way used for the feedback cleaning.</Paragraph> </Section> <Section position="9" start_page="1" end_page="2" type="metho"> <SectionTitle> 2. Subjective Quality </SectionTitle> <Paragraph position="0"> A total of 510 sentences from the test corpus were evaluated by paired comparison. Specifically, the source sentences were translated using the base rule set, and the same sources were translated using the rules after the cleaning.</Paragraph> <Paragraph position="1"> One-by-one, a Japanese native speaker judged which MT result was better or that they were of the same quality. Subjective quality is represented by the following equation, where I denotes the number of improved sentences and D denotes the number of degraded sentences.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 6.2 Feedback Cleaning Using Evaluation Corpus </SectionTitle> <Paragraph position="0"> In order to observe the characteristics of feedback cleaning, cleaning of the base rule set was carried out by using the evaluation corpus. The results are shown in Figure 5. This graph shows changes in the test corpus BLEU score, the evaluation corpus BLEU score, and the number of rules along with the number of iterations.</Paragraph> <Paragraph position="1"> Consequently, the removed rules converged at nine iterations, and 6,220 rules were removed. The evaluation corpus BLEU score was improved by increasing the number of iterations, demonstrating that the combinatorial optimization by the hill-climbing algorithm worked effectively. The test corpus BLEU score reached a peak score of 0.245 at the second iteration and slightly decreased after the third iteration due to overfitting. However, the final score was 0.244, which is almost the same as the peak score.</Paragraph> <Paragraph position="2"> The test corpus BLEU score was lower than the evaluation corpus BLEU score because the rules used in the test corpus were not exhaustively checked by the evaluation corpus. If the evaluation corpus size could be expanded, the test corpus score would improve.</Paragraph> <Paragraph position="3"> About 37,000 sentences were translated on average in each iteration. This means that the time for an iteration is estimated at about ten hours if translation speed is one second per sentence. This is a short enough time for us because our method does not require real-time processing.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.3 MT Quality vs. Cleaning Methods </SectionTitle> <Paragraph position="0"> Next, in order to compare the proposed methods with the previous methods, the MT quality achieved by each of the following five methods was measured.</Paragraph> </Section> </Section> <Section position="10" start_page="2" end_page="2" type="metho"> <SectionTitle> 1. Baseline </SectionTitle> <Paragraph position="0"> The MT results using the base rule set.</Paragraph> <Paragraph position="1"> 2. Cutoff by Frequency Low-frequency rules that appeared in the training corpus less often than twice were removed from the base rule set. This threshold was experimentally determined by the test corpus test was performed in the same manner as in Imamura (2002)'s experiment. We introduced rules with more than 95 percent confi- null The results are shown in Table 2. This table shows that the test corpus BLEU score and the subjective In this experiment, it took about 80 hours until convergence quality of the proposed methods (simple feedback cleaning and cross-cleaning) are considerably improved over those of the previous methods. Focusing on the subjective quality of the proposed methods, some MT results were degraded from the baseline due to the removal of rules. However, the subjective quality levels were relatively improved because our methods aim to increase the portion of correct MT results.</Paragraph> <Paragraph position="2"> Focusing on the number of the rules, the rule set of the simple feedback cleaning is clearly a locally optimal solution, since the number of rules is more than that of cross-cleaning, although the BLEU score is lower. In comparing the number of rules in cross-cleaning with that in the cutoff by frequency, the former is three times higher than the latter. We assume that the solution of cross-cleaning is also the locally optimal solution. If we could find the globally optimal solution, the MT quality would certainly improve further.</Paragraph> </Section> <Section position="11" start_page="2" end_page="2" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 7.1 Other Automatic Evaluation Methods </SectionTitle> <Paragraph position="0"> The idea of feedback cleaning is independent of BLEU. Some automatic evaluation methods of MT quality other than BLEU have been proposed. For example, Su et al. (1992), Yasuda et al. (2001), and Akiba et al. (2001) measure similarity between MT results and the references by DP matching (edit distances) and then output the evaluation scores. These automatic evaluation methods that output scores are applicable to feedback cleaning.</Paragraph> <Paragraph position="1"> The characteristics common to these methods, including BLEU, is that the similarity to references are measured for each sentence, and the evaluation score of an MT system is calculated by aggregating the similarities. Therefore, MT results of the evaluation corpus are necessary to evaluate the system, and reducing the number of sentence translations is an important technique for all of these methods.</Paragraph> <Paragraph position="2"> The effects of feedback cleaning depend on the characteristics of objective measures. DP-based measures and BLEU have different characteristics (Yasuda et al., 2003). The exploration of several measures for feedback cleaning remains an interesting future work.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 7.2 Domain Adaptation </SectionTitle> <Paragraph position="0"> When applying corpus-based machine translation to a different domain, bilingual corpora of the new domain are necessary. However, the sizes of the new corpora are generally smaller than that of the original corpus because the collection of bilingual sentences requires a high cost.</Paragraph> <Paragraph position="1"> The feedback cleaning proposed in this paper can be interpreted as adapting the translation rules so that the MT results become similar to the evaluation corpus. Therefore, if we regard the bilingual corpus of the new domain as the evaluation corpus and carry out feedback cleaning, the rule set will be adapted to the new domain. In other words, our method can be applied to adaptation of an MT system by using a smaller corpus of the new domain.</Paragraph> </Section> </Section> class="xml-element"></Paper>