File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1612_metho.xml
Size: 16,087 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1612"> <Title>Paraphrasing Rules for Automatic Evaluation of Translation into Japanese</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Background: Overview of BLEU </SectionTitle> <Paragraph position="0"> This section briefly describes the original BLEU (Papineni et al., 2002b)1, which was designed for English translation evaluation, so English sentences are used as examples in this section.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 N-gram precision </SectionTitle> <Paragraph position="0"> BLEU evaluation uses a parallel corpus which consists of sentences in the source language and their translations to the target language by professional translators. We call the professional translations reference sentences. It is preferable if the corpus has multiple reference sentences translated by multiple translators for each source sentence.</Paragraph> <Paragraph position="1"> Sentences in the source language are also translated by the translation systems to be evaluated. The 1See the cited paper for more detailed definitions.</Paragraph> <Paragraph position="2"> translations are called candidate sentences. Below is an example.</Paragraph> <Paragraph position="3"> I had the person of an office correct a clock.</Paragraph> <Paragraph position="4"> The BLEU score is based on n-gram precision shown in Equation (1). It is the ratio of n-grams which appear both in the candidate sentence and in at least one of the reference sentences, among all n-grams in the candidate sentence.</Paragraph> <Paragraph position="6"> Candidate 1 in Example 1 contains 11 unigrams including punctuation. 8 unigrams out of these also appear in Reference 1 or Reference 2: 'I', 'had', 'a', 'in', 'the', 'office', 'watch' and '.', therefore, the unigram precision of Candidate 1 is 8/11. The bigram precision is 4/10 since 'I had', 'in the', 'the office' and 'watch .' are found. The only matched trigram is 'in the office', so the trigram precision is 1/9.</Paragraph> <Paragraph position="7"> On the other hand, the unigram, bigram, and tri-gram precisions of Candidate 2 are 8/11, 2/10, 0/9, respectively, which are lower than those of Candidate 1. Indeed Candidate 1 is a better English translation than Candidate 2.</Paragraph> <Paragraph position="8"> In practice the n-gram precision is calculated not for each sentence but for all of the sentences in the corpus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Brevity Penalty </SectionTitle> <Paragraph position="0"> The n-gram precision is calculated by dividing the number of matched n-grams by the number of n-grams in the candidate sentence. Therefore, a short candidate sentence which consists only of frequently used words can score a high n-gram precision. For example, if the candidate sentence is just &quot;The&quot;, its unigram precision is 1.0 if one of reference sentences contains at least one 'the', and that is usually true.</Paragraph> <Paragraph position="1"> To penalize such a meaningless translation, the BLEU score is multiplied by the brevity penalty shown in (2).</Paragraph> <Paragraph position="3"> where c and r are the total numbers of words in the candidate sentences and the reference sentences which have the closest numbers of words in each parallel sentence.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 BLEU score </SectionTitle> <Paragraph position="0"> The BLEU score is calculated by Equation (3) below. It is the geometric average of the n-gram precisions multiplied by the brevity penalty. The geometric average is used because pn decreases exponentially as n increases. The BLEU score ranges between 0 and 1.</Paragraph> <Paragraph position="2"> The evaluations use unigrams up to N-grams. If a large n is used, the fluency of the sentences becomes a more important factor than the correctness of the words. Empirically the BLEU score has a high correlation with human evaluation when N = 4 for English translation evaluations (Papineni et al., 2002b).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Japanese Version of BLEU and Its Extension </SectionTitle> <Paragraph position="0"> This section describes how to adapt BLEU for Japanese translation evaluation. The adaptation consists of three steps.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Use of Morphological Analyzer </SectionTitle> <Paragraph position="0"> The first modification is mandatory for using the n-gram metric as in the original BLEU implementation. Since Japanese has no spaces between words, the words have to be separated by morphological</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Distinguish between Different Parts-of-speech </SectionTitle> <Paragraph position="0"> Many English words can be used as various parts-of-speech (POSs), but BLEU doesn't distinguish between the words with the same surface form in terms of their POSs, since the sentences are not processed by a tagger, so the system can't handle POSs. This doesn't cause a problem because most of the multi-POS words have conceptually similar meanings, as exemplified by the adverb 'fast' and the adjective 'fast' which have the same basic concept, so matching them between the candidate and references reasonably reflects the quality of the translation.</Paragraph> <Paragraph position="1"> On the other hand, Japanese homonyms tend to be completely different if their POSs are different.</Paragraph> <Paragraph position="2"> For example, the postpositional phrasal particle 'ga' and the connective particle 'ga' should be distinguished from one another since the former acts as a subject case marker, while the latter connects two clauses that normally contradict each other. Fortunately the morphological analyzer outputs POS information when the sentence is separated into words, and therefore the words are also distinguished by their POSs in the described method.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Paraphrasing Rules </SectionTitle> <Paragraph position="0"> Example 3 is another possible translation of the source sentence of Example 2.</Paragraph> <Paragraph position="1"> Example 3 Kare ga hon wo yo n da .</Paragraph> <Paragraph position="2"> He SUBJ book ACC read INF-EUPH PAST . 'He read a book.' The only difference here is the ending of the sentence has a less polite form. However, when Example 2 is the only reference sentence, the BLEU evaluation of Example 3 does not score high: 6/8 for unigrams, 4/7 for bigrams, 3/6 for trigrams, and 2/5 a wild card shared by both sides. ':' is a boundary of morphemes. '(verb-c)' means a consonant verb such as 'yomu'. Actually these rules have conditions not described here so that they are not overused. for 4-grams, while its meaning is same as that of the reference sentence.</Paragraph> <Paragraph position="3"> Basically BLEU copes with this problem of variation in writing styles by relying on the number of reference sentences available for each source sentence and by reflecting the total size of corpus. That is, if the corpus has multiple reference sentences translated by different translators, multiple writing styles will tend to be included, and if the corpus is very large, such inconsistencies of writing style are statistically not a problem.</Paragraph> <Paragraph position="4"> In Japanese translation evaluation, however, this problem can not be resolved using such a quantitative solution because the influence of the differences in writing styles are too large. For example, whether or not the translation is given in the polite form depends on the translation system2, so the evaluation score is strongly affected by the degree of matching of the writing styles between the translation system and the reference sentences.</Paragraph> <Paragraph position="5"> To cancel out the differences in writing styles, we apply some paraphrasing rules to the reference sentences to generate new sentences with different writing styles. The generated sentences are added to the reference sentences, and therefore, n-grams in the candidate sentences can match the reference sentences regardless of their writing styles. Table 1 shows examples of paraphrasing rules.</Paragraph> <Paragraph position="6"> These rules are applied to the reference sentences. If a reference sentence matches to a paraphrasing rule, the sentence is replicated and the replica is rewritten using the matched rule. For example, the Japanese sentence in Example 2 matches Rule 1 in Table 1 so the Japanese sentence in Example 3 is 2Some translation systems allow us to specify such writing styles but some systems don't.</Paragraph> <Paragraph position="7"> produced. In this case, the evaluation is done as if there are two reference sentences, therefore, a candidate sentence gets the same score regardless of its politeness.</Paragraph> <Paragraph position="8"> To avoid applying the same rules repeatably, the rules are applied in a specific order. How to generate the rules is described in Section 4.1.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Environments </SectionTitle> <Paragraph position="0"> To see how much the three extensions above contribute to the evaluation of translation, the correlation between the automatic evaluation and the human evaluation is calculated. We used a bilingual corpus which consists of 6,871 English sentences on a technical domain and their translations into Japanese.</Paragraph> <Paragraph position="1"> 100 sentences were randomly selected and translated by 5 machine translation systems S1-S5 and a human H1 who is a native Japanese speaker but does not have strong knowledge of the technical domain. These 6 translations were evaluated by five methods: B1 to B4 are Japanese versions of BLEU with the extension described in Section 3 and M1 is a manual evaluation.</Paragraph> <Paragraph position="2"> B1: Morphological analysis is applied to translated Japanese sentences. Only the technique described in Section 3.1 is used.</Paragraph> <Paragraph position="3"> B2: Functional words are distinguished by their POSs. This corresponds to the technique in Section 3.1 and 3.2.</Paragraph> <Paragraph position="4"> B3: Paraphrasing rules are applied to the reference sentences as described in Section 3.3. Here the applied rules are limited to 51 rules which rewrite polite forms (e.g. 1 and 2 in Table 1).</Paragraph> <Paragraph position="5"> B4: All 88 paraphrasing rules including other types (e.g. 3 and 4 in Table 1) are applied.</Paragraph> <Paragraph position="6"> M1: Average score of the manual evaluation of all translations in the corpus. The sentences were scored using a 5-level evaluation: 1 (poor) to 5 (good). The evaluator was different from the H1 is 1. B1 is omitted since it is close to B2.</Paragraph> <Paragraph position="7"> The paraphrasing rules used in B3 and B4 were prepared manually by comparing the candidate sentences and the reference sentences in the reminder of the corpus which are not used for the evaluation. The application of the rules are unlikely to produce incorrect sentences, because the rules are adjusted by adding the applicable conditions, and the rules that may have side effects are not adopted. This was confirmed by applying the rules to 200 sentences in another corpus. A total of 189 out of the 200 sentences were paraphrased in at least a part, and all of the newly created sentences were grammatically correct and had the same meaning as the original sentences.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Experimental Results </SectionTitle> <Paragraph position="0"> Table 2 shows the result of evaluation using the five methods. Comparing the correlation with M1, B2 slightly outperformed B1, thus the POS information improves the evaluation. B3 was better than B2 in correlation by 0.06. This is because the scores by the B3 evaluation were much higher than the B2 evaluation except for S5, since only S5 tends to output sentences in polite form while the most of reference sentences are written in polite form. Further improvement was observed in B4, by applying other types of paraphrasing rules.</Paragraph> <Paragraph position="1"> Figure 1 graphically illustrates the correlation between the BLEU evaluations and the human evaluations, by normalizing the results so that S1 is 0, H1 is 1, and the rest of scores are linearly interpolated.</Paragraph> <Paragraph position="2"> We can see that only B4 ranks all six systems in the same order as the manual evaluation.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Lexical or Structural Paraphrasing Rules </SectionTitle> <Paragraph position="0"> The paraphrasing rules used here have no lexical rules that rewrite content words into other expressions as in Example 4.</Paragraph> <Paragraph position="1"> Example 4 dokusho : suru $ hon : wo : yo : mu 'read' 'read a book' The main reason why we don't use such rules is that this type of rules may produce incorrect sentences. For instance, (a) in Example 5 is rewritten into (b) by the rule in Example 4, but (b) is not correct. null This error can be decreased if the paraphrasing rules have more strict conditions about surrounding words, however, using such lexical rules contradicts the original BLEU's strategy that the differences in expressions should be covered by the number of reference sentences. This strategy is reasonable because complicated rules tend to make the evaluation arbitrary, that is, the evaluation score strongly depends on the lexical rules. To verify that the lexical rules are unnecessary, we added 17,478 wordreplacing rules to B4. The rules mainly replace Chinese characters or Kana characters with canonical Paraphrasing rule [?]correl</Paragraph> <Paragraph position="3"> contributed to the translation evaluation. The column '[?]correl' means the decrease of the correlation in the translation evaluation when the rule is removed. '(verb-v)' denotes a vowel verb.</Paragraph> <Paragraph position="4"> ones. With the rules, the correlation with M1 was 0.886, which is much lower than B4.</Paragraph> <Paragraph position="5"> This result implies the differences in content words do not affect the evaluations. More specifically, BLEU's misjudgments because of differences in content words occur with almost equal probability for each translation system. Thus it is enough to use the structural (i.e. non-lexical) paraphrasing rules which rewrite only functional words.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation of Each Paraphrasing Rule </SectionTitle> <Paragraph position="0"> The contribution of the paraphrasing was measured by the increase of reliability of the translation evaluation, as described in Section 4.2. In the same way, the effect of each single paraphrasing rule can be also evaluated quantitatively. Table 3 shows the three best paraphrasing rules which contributed to the translation evaluation. Here the contribution of a rule to the automatic evaluation is measured by the increase of correlation with the human evaluation when the rule is used.</Paragraph> </Section> </Section> class="xml-element"></Paper>