File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2070_metho.xml
Size: 18,392 bytes
Last Modified: 2025-10-06 14:10:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2070"> <Title>Stochastic Iterative Alignment for Machine Translation Evaluation</Title> <Section position="5" start_page="540" end_page="542" type="metho"> <SectionTitle> 3 Stochastic Iterative Alignment (SIA) </SectionTitle> <Paragraph position="0"> for Machine Translation Evaluation We introduce three techniques to allow more sensitive scores to be computed.</Paragraph> <Section position="1" start_page="540" end_page="541" type="sub_section"> <SectionTitle> 3.1 Modified String Alignment </SectionTitle> <Paragraph position="0"> This section introduces how to compute the string alignment based on the word gaps. Given a pair of strings, the task of string alignment is to obtain the longest monotonic common sequence (where gaps are allowed). SIA uses a different weighting strategy from ROUGE-W, which is more exible.</Paragraph> <Paragraph position="1"> In SIA, the alignments are evaluated based on the geometric mean of the gaps in the reference side and the MT output side. Thus in the dynamic programming, the state not only includes the current covering length of the MT output and the reference, but also includes the last aligned positions in them. The algorithm for computing the alignment score in SIA is described in Figure 3. The sub-routine COMPUTE SCORE, which computes the score gained from the current aligned positions, is shown in Figure 4. From the algorithm, we can function GET ALIGN SCORE(mt, M, ref, N) triangleright Compute the alignment score of the MT output mt with length M and the reference ref with length N for i = 1; i [?] M; i = i +1 do for j = 1; j [?] N; j = j +1 do for k = 1; k [?] i; k = k +1 do for m = 1; m [?] j; m = m +1 do</Paragraph> <Paragraph position="3"> function COMPUTE SCORE(mt, ref, i, j, n, p) if mt[i] == ref [j] then return 1/p(i [?] n) x (j [?] p);</Paragraph> <Paragraph position="5"> see that not only will strict n-grams get higher scores than non-consecutive sequences, but also the non-consecutive sequences with smaller gaps will get higher scores than those with larger gaps.</Paragraph> <Paragraph position="6"> This weighting method can help SIA capture more subtle difference of MT outputs than ROUGE-W does. For example, if SIA is used to align mt1 and ref in Figure 1, it will choose life is like box instead of life is like chocolate, because the average distance of 'box-box' to its previous mapping 'like-like' is less than 'chocolate-chocolate'. Then the score SIA assigns to mt1 is: parenleftbigg 1 For mt2, there is only one possible alignment, its score in SIA is computed as: parenleftbigg 1 Thus, mt1 will be considered better than mt2 in SIA, which is reasonable. As mentioned in section 1, though loose-sequence-based metrics give a better re ection of the sentence-wide similarity of the MT output and the reference, they cannot make full use of word-level information. This defect could potentially lead to a poor performance in adequacy evaluation, considering the case that the ignored words are crucial to the evaluation. In the later part of this section, we will describe an iterative alignment scheme which is meant to compensate for this defect.</Paragraph> </Section> <Section position="2" start_page="541" end_page="541" type="sub_section"> <SectionTitle> 3.2 Stochastic Word Mapping </SectionTitle> <Paragraph position="0"> In ROUGE and METEOR, PORTER-STEM and WORD-NET are used to increase the chance of the MT output words matching the references.</Paragraph> <Paragraph position="1"> We use a different stochastic approach in SIA to achieve the same purpose. The string alignment has a good dynamic framework which allows the stochastic word matching to be easily incorporated into it. The stochastic string alignment can be implemented by simply replacing the function COMPUTE SCORE with the function of Figure 5. The function similarity(word1, word2) returns a ratio which re ects how similar the two words are. Now we consider how to compute the similarity ratio of two words. Our method is motivated by the phrase extraction method of Bannard and Callison-Burch (2005), which computes the similarity ratio of two words by looking at their relationship with words in another language. Given a bilingual parallel corpus with aligned sentences, say English and French, the probability of an English word given a French word can be computed by training word alignment models such as IBM Model4. Then for every English word e, we have a set of conditional probabilities given each French word: p(e|f1), p(e|f2), ... , p(e|fN). If we consider these probabilities as a vector, the similarities of two English words can be obtained by computing the dot product of their corresponding vectors.2 The formula is described below:</Paragraph> <Paragraph position="3"> Paraphrasing methods based on monolingual parallel corpora such as (Pang et al., 2003; Barzilay and Lee, 2003) can also be used to compute the similarity ratio of two words, but they don't have as rich training resources as the bilingual methods do.</Paragraph> <Paragraph position="4"> 2Although the marginalized probability (over all French words) of an English word given the other English word (PNk=1 p(ei|fk)p(fk|ej)) is a more intuitive way of measuring the similarity, the dot product of the vectors p(e|f) described above performed slightly better in our experiments. function STO COMPUTE SCORE(mt, ref, i, j, n, p) if mt[i] == ref [j] then</Paragraph> </Section> <Section position="3" start_page="541" end_page="542" type="sub_section"> <SectionTitle> Score 3.3 Iterative Alignment Scheme </SectionTitle> <Paragraph position="0"> ROUGE-W, METEOR, and WER all score MT output by rst computing a score based on each available reference, and then taking the highest score as the nal score for the MT output. This scheme has the problem of not being able to use multiple references simultaneously. The iterative alignment scheme proposed here is meant to alleviate this problem, by doing alignment between the MT output and one of the available references until no more words in the MT output can be found in the references. In each alignment round, the score based on each reference is computed and the highest one is taken as the score for the round. Then the words which have been aligned in best alignment will not be considered in the next round. With the same number of aligned words, the MT output with fewer alignment rounds should be considered better than those requiring more rounds. For this reason, a decay factor a is multiplied with the scores of each round. The nal score of the MT output is then computed by summing the weighted scores of each alignment round. The scheme is described in Figure 6.</Paragraph> <Paragraph position="1"> The function GET ALIGN SCORE 1 used in GET ALIGN SCORE IN MULTIPLE REFS is slightly different from GET ALIGN SCORE described in the prior subsection. The dynamic programming algorithm for getting the best alignment is the same, except that it has two more tables as input, which record the unavailable positions in the MT output and the reference. These positions have already been used in the prior best alignments and should not be considered in the ongoing alignment. It also returns the aligned positions of the best alignment. The pseudocode for GET ALIGN SCORE 1 is shown in Figure 7.</Paragraph> <Paragraph position="2"> The computation of the length penalty is similar to BLEU: it is set to 1 if length of the MT output is longer than the arithmetic mean of length of the function GET ALIGN SCORE IN MULTIPLE REFS(mt, ref 1, ..., ref N , a) triangleright Iteratively Compute the Alignment Score Based on Multiple References and the Decay Factor a references, and otherwise is set to the ratio of the two. Figure 8 shows how the iterative alignment scheme works with an evaluation set containing one MT output and two references. The selected alignment in each round is shown, as well as the unavailable positions in MT output and references. With the iterative scheme, every common word between the MT output and the reference set can make a contribution to the metric, and by such means SIA is able to make full use of the word-level information. Furthermore, the order (alignment round) in which the words are aligned provides a way to weight them. In BLEU, multiple references can be used simultaneously, but the common n-grams are treated equally.</Paragraph> </Section> </Section> <Section position="6" start_page="542" end_page="544" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Evaluation experiments were conducted to compare the performance of different metrics including BLEU, ROUGE, METEOR and SIA.3 The test data for the experiments are from the MT evaluation workshop at ACL05. There are seven sets of MT outputs (E09 E11 E12 E14 E15 E17 E22), all of which contain 919 English sentences. These sentences are the translation of the same Chinese input generated by seven different MT systems.</Paragraph> <Paragraph position="1"> The uency and adequacy of each sentence are manually ranked from 1 to 5. For each MT output, there are two sets of human scores available, and function GET ALIGN SCORE1(mt, ref, mttable, reftable) triangleright Compute the alignment score of the MT output mt with length M and the reference ref with length N, without considering the positions in mttable and reftable</Paragraph> <Paragraph position="3"/> <Section position="1" start_page="542" end_page="543" type="sub_section"> <SectionTitle> Without Considering Aligned Positions </SectionTitle> <Paragraph position="0"> m: England with France discussed this crisis in London r1: Britain and France consulted about this crisis in London with each other r2: England and France discussed the crisis in London m: England with France discussed this crisis in London r2: England and France discussed the crisis in London r1: Britain and France consulted about this crisis in London with each other m: England with France discussed this crisis in London r1: Britain and France consulted about this crisis in London with each other r2: England and France discussed the crisis in London we randomly choose one as the score used in the experiments. The human overall scores are calculated as the arithmetic means of the human uency scores and adequacy scores. There are four sets of human translations (E01, E02, E03, E04) serving as references for those MT outputs. The MT outputs and reference sentences are transformed to lower case. Our experiments are carried out as follows: automatic metrics are used to evaluate the MT outputs based on the four sets of references, and the Pearson's correlation coef cient of the automatic scores and the human scores is computed to see how well they agree.</Paragraph> </Section> <Section position="2" start_page="543" end_page="543" type="sub_section"> <SectionTitle> 4.1 N-gram vs. Loose Sequence </SectionTitle> <Paragraph position="0"> One of the problems addressed in this paper is the different performance of n-gram based metrics and loose-sequence-based metrics in sentence-level evaluation. To see how they really differ in experiments, we choose BLEU and ROUGE-W as the representative metrics for the two types, and used them to evaluate the 6433 sentences in the 7 MT outputs. The Pearson correlation coef cients are then computed based on the 6433 samples. The experimental results are shown in Table 1. BLEU-n denotes the BLEU metric with the longest n-gram of length n. F denotes uency, A denotes adequacy, and O denotes overall.</Paragraph> <Paragraph position="1"> We see that with the increase of n-gram length, BLEU's performance does not increase monotonically. The best result in adequacy evaluation is achieved at 2-gram and the best result in uency is achieved at 4-gram. Using n-grams longer than 2 doesn't buy much improvement for BLEU in uency evaluation, and does not compensate for the loss in adequacy evaluation. This con rms Liu and Gildea (2005)'s nding that in sentence level evaluation, long n-grams in BLEU are not bene cial.</Paragraph> <Paragraph position="2"> The loose-sequence-based ROUGE-W does much better than BLEU in uency evaluation, but it does poorly in adequacy evaluation and doesn't achieve a signi cant improvement in overall evaluation.</Paragraph> <Paragraph position="3"> We speculate that the reason is that ROUGE-W doesn't make full use of the available word-level information.</Paragraph> </Section> <Section position="3" start_page="543" end_page="544" type="sub_section"> <SectionTitle> 4.2 METEOR vs. SIA </SectionTitle> <Paragraph position="0"> SIA is designed to take the advantage of loose-sequence-based metrics without losing word-level information. To see how well it works, we choose E09 as the development set and the sentences in the other 6 sets as the test data. The decay fac- null BLEU, ROUGE, METEOR and SIA tor in SIA is determined by optimizing the over-all evaluation for E09, and then used with SIA to evaluate the other 5514 sentences based on the four sets of references. The similarity of English words is computed by training IBM Model 4 in an English-French parallel corpus which contains seven hundred thousand sentence pairs. For every English word, only the entries of the top 100 most similar English words are kept and the similarity ratios of them are then re-normalized. The words outside the training corpus will be considered as only having itself as its similar word. To compare the performance of SIA with BLEU, ROUGE and METEOR, the evaluation results based on the same testing data is given in Table 2. B3 denotes BLEU-3; R 1 denotes the skipped bi-gram based ROUGE metric which considers all skip distances and uses PORTER-STEM; R 2 denotes ROUGE-W with PORTER-STEM; M denotes the METEOR metric using PORTER-STEM and WORD-NET synonym; S denotes SIA.</Paragraph> <Paragraph position="1"> We see that METEOR, as the other metric sitting in the middle of n-gram based metrics and loose sequence metrics, achieves improvement over BLEU in both adequacy and uency evaluation. Though METEOR gets the best results in adequacy evaluation, in uency evaluation, it is worse than the loose-sequence-based metric ROUGE-W-STEM. SIA is the only one among the 5 metrics which does well in both uency and adequacy evaluation. It achieves the best results in uency evaluation and comparable results to METEOR in adequacy evaluation, and the balanced performance leads to the best overall evaluation results in the experiment. To estimate the significance of the correlations, bootstrap resampling (Koehn, 2004) is used to randomly select 5514 sentences with replacement out of the whole test set of 5514 sentences, and then the correlation coef cients are computed based on the selected sentence set. The resampling is repeated 5000 times, and the 95% con dence intervals are shown in Tables 3, 4, and 5. We can see that it is very dif - null cult for one metric to signi cantly outperform another metric in sentence-level evaluation. The results show that the mean of the correlation factors converges right to the value we computed based on the whole testing set, and the con dence intervals correlate with the means.</Paragraph> <Paragraph position="2"> While sentence-level evaluation is useful if we are interested in a con dence measure on MT outputs, syste-x level evaluation is more useful for comparing MT systems and guiding their development. Thus we also present the evaluation results based on the 7 MT output sets in Table 6. SIA uses the same decay factor as in the sentence-level evaluation. Its system-level score is computed as the arithmetic mean of the sentence level scores, and low mean high</Paragraph> </Section> </Section> <Section position="7" start_page="544" end_page="545" type="metho"> <SectionTitle> WLS WLS WLS WLS INCS INCS INCS INCS STEM WN STEM </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="544" end_page="544" type="sub_section"> <SectionTitle> and WordNet </SectionTitle> <Paragraph position="0"> so are ROUGE, METEOR and the human judgments. We can see that SIA achieves the best performance in both uency and adequacy evaluation of the 7 systems. Though the 7-sample based results are not reliable, we can get a sense of how well SIA works in the system-level evaluation.</Paragraph> </Section> <Section position="2" start_page="544" end_page="545" type="sub_section"> <SectionTitle> 4.3 Components in SIA </SectionTitle> <Paragraph position="0"> To see how the three components in SIA contribute to the nal performance, we conduct experiments where one or two components are removed in SIA, shown in Table 7. The three components are denoted as WLS (weighted loose sequence alignment), PROB (stochastic word matching), and INCS (iterative alignment scheme) respectively. WLS without INCS does only one round of alignment and chooses the best alignment score as the nal score. This scheme is similar to ROUGE-W and METEOR. We can see that INCS, as expected, improves the adequacy evaluation without hurting the uency evaluation. PROB improves both adequacy and uency evaluation performance. The result that SIA works with PORTER-STEM and WordNet is also shown in both used, PORTER-STEM is used rst. We can see that they are not as good as using the stochastic word matching. Since INCS and PROB are independent of WLS, we believe they can also be used to improve other metrics such as ROUGE-W and</Paragraph> </Section> </Section> <Section position="8" start_page="545" end_page="545" type="metho"> <SectionTitle> METEOR. 5 Conclusion </SectionTitle> <Paragraph position="0"> This paper describes a new metric SIA for MT evaluation, which achieves good performance by combining the advantages of n-gram-based metrics and loose-sequence-based metrics. SIA uses stochastic word mapping to allow soft or partial matches between the MT hypotheses and the references. This stochastic component is shown to be better than PORTER-STEM and WordNet in our experiments. We also analyzed the effect of other components in SIA and speculate that they can also be used in other metrics to improve their performance.</Paragraph> <Paragraph position="1"> Acknowledgments This work was supported by NSF ITR IIS-09325646 and NSF ITR IIS0428020. null</Paragraph> </Section> class="xml-element"></Paper>