File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3114_metho.xml
Size: 12,691 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3114"> <Title>Out-of-domain test set</Title> <Section position="4" start_page="103" end_page="105" type="metho"> <SectionTitle> 2 Automatic Evaluation </SectionTitle> <Paragraph position="0"> For the automatic evaluation, we used BLEU, since it is the most established metric in the field. The BLEU metric, as all currently proposed automatic metrics, is occasionally suspected to be biased towards statistical systems, especially the phrase-based systems currently in use. It rewards matches of n-gram sequences, but measures only at most indirectly over-all grammatical coherence.</Paragraph> <Paragraph position="1"> The BLEU score has been shown to correlate well with human judgement, when statistical ma- null chine translation systems are compared (Doddington, 2002; Przybocki, 2004; Li, 2005). However, a recent study (Callison-Burch et al., 2006), pointed out that this correlation may not always be strong.</Paragraph> <Paragraph position="2"> They demonstrated this with the comparison of statistical systems against (a) manually post-edited MT output, and (b) a rule-based commercial system.</Paragraph> <Paragraph position="3"> The development of automatic scoring methods is an open field of research. It was our hope that this competition, which included the manual and automatic evaluation of statistical systems and one rule-based commercial system, will give further insight into the relation between automatic and manual evaluation. At the very least, we are creating a data resource (the manual annotations) that may the basis of future research in evaluation metrics.</Paragraph> <Section position="1" start_page="104" end_page="104" type="sub_section"> <SectionTitle> 2.1 Computing BLEU Scores </SectionTitle> <Paragraph position="0"> We computed BLEU scores for each submission with a single reference translation. For each sentence, we counted how many n-grams in the system output also occurred in the reference translation. By taking the ratio of matching n-grams to the total number of n-grams in the system output, we obtain the precision pn for each n-gram order n. These values for n-gram precision are combined into a BLEU score:</Paragraph> <Paragraph position="2"> The formula for the BLEU metric also includes a brevity penalty for too short output, which is based on the total number of words in the system output c and in the reference r.</Paragraph> <Paragraph position="3"> BLEU is sensitive to tokenization. Because of this, we retokenized and lowercased submitted output with our own tokenizer, which was also used to prepare the training and test data.</Paragraph> </Section> <Section position="2" start_page="104" end_page="105" type="sub_section"> <SectionTitle> 2.2 Statistical Significance </SectionTitle> <Paragraph position="0"> Confidence Interval: Since BLEU scores are not computed on the sentence level, traditional methods to compute statistical significance and confidence intervals do not apply. Hence, we use the bootstrap resampling method described by Koehn (2004).</Paragraph> <Paragraph position="1"> Following this method, we repeatedly -- say, 1000 times -- sample sets of sentences from the output of each system, measure their BLEU score, and use these 1000 BLEU scores as basis for estimating a confidence interval. When dropping the top and bottom 2.5% the remaining BLEU scores define the range of the confidence interval.</Paragraph> <Paragraph position="2"> Pairwise comparison: We can use the same method to assess the statistical significance of one system outperforming another. If two systems' scores are close, this may simply be a random effect in the test data. To check for this, we do pairwise bootstrap resampling: Again, we repeatedly sample sets of sentences, this time from both systems, and compare their BLEU scores on these sets. If one system is better in 95% of the sample sets, we conclude that its higher BLEU score is statistically significantly better. null The bootstrap method has been critized by Riezler and Maxwell (2005) and Collins et al. (2005), as being too optimistic in deciding for statistical significant difference between systems. We are therefore applying a different method, which has been used at the 2005 DARPA/NIST evaluation.</Paragraph> <Paragraph position="3"> We divide up each test set into blocks of 20 sentences (100 blocks for the in-domain test set, 53 blocks for the out-of-domain test set), check for each block, if one system has a higher BLEU score than the other, and then use the sign test.</Paragraph> <Paragraph position="4"> The sign test checks, how likely a sample of better and worse BLEU scores would have been generated by two systems of equal performance.</Paragraph> <Paragraph position="5"> Let say, if we find one system doing better on 20 of the blocks, and worse on 80 of the blocks, is it significantly worse? We check, how likely only up to k = 20 better scores out of n = 100 would have been generated by two equal systems, using the bi-</Paragraph> <Paragraph position="7"> If p(0..k;n,p) < 0.05, or p(0..k;n,p) > 0.95 then we have a statistically significant difference between the systems.</Paragraph> <Paragraph position="8"> from 5 randomly selected systems for a randomly selected sentence is presented. No additional information beyond the instructions on this page are given to the judges. The tool tracks and reports annotation speed.</Paragraph> </Section> </Section> <Section position="5" start_page="105" end_page="107" type="metho"> <SectionTitle> 3 Manual Evaluation </SectionTitle> <Paragraph position="0"> While automatic measures are an invaluable tool for the day-to-day development of machine translation systems, they are only a imperfect substitute for human assessment of translation quality, or as the acronym BLEU puts it, a bilingual evaluation understudy.</Paragraph> <Paragraph position="1"> Many human evaluation metrics have been proposed. Also, the argument has been made that machine translation performance should be evaluated via task-based evaluation metrics, i.e. how much it assists performing a useful task, such as supporting human translators or aiding the analysis of texts.</Paragraph> <Paragraph position="2"> The main disadvantage of manual evaluation is that it is time-consuming and thus too expensive to do frequently. In this shared task, we were also confronted with this problem, and since we had no funding for paying human judgements, we asked participants in the evaluation to share the burden. Participants and other volunteers contributed about 180 hours of labor in the manual evaluation.</Paragraph> <Section position="1" start_page="105" end_page="106" type="sub_section"> <SectionTitle> 3.1 Collecting Human Judgements </SectionTitle> <Paragraph position="0"> We asked participants to each judge 200-300 sentences in terms of fluency and adequacy, the most commonly used manual evaluation metrics. We settled on contrastive evaluations of 5 system outputs for a single test sentence. See Figure 3 for a screen-shot of the evaluation tool.</Paragraph> <Paragraph position="1"> Presenting the output of several system allows the human judge to make more informed judgements, contrasting the quality of the different systems. The judgements tend to be done more in form of a ranking of the different systems. We assumed that such a contrastive assessment would be beneficial for an evaluation that essentially pits different systems against each other.</Paragraph> <Paragraph position="2"> While we had up to 11 submissions for a translation direction, we did decide against presenting all 11 system outputs to the human judge. Our initial experimentation with the evaluation tool showed that this is often too overwhelming.</Paragraph> <Paragraph position="3"> Making the ten judgements (2 types for 5 systems) takes on average 2 minutes. Typically, judges initially spent about 3 minutes per sentence, but then accelerate with experience. Judges where excluded from assessing the quality of MT systems that were submitted by their institution. Sentences and systems were randomly selected and randomly shuffled for presentation.</Paragraph> <Paragraph position="4"> We collected around 300-400 judgements per judgement type (adequacy or fluency), per system, per language pair. This is less than the 694 judgements 2004 DARPA/NIST evaluation, or the 532 judgements in the 2005 DARPA/NIST evaluation.</Paragraph> <Paragraph position="5"> This decreases the statistical significance of our results compared to those studies. The number of judgements is additionally fragmented by our breakup of sentences into in-domain and out-of-domain.</Paragraph> </Section> <Section position="2" start_page="106" end_page="106" type="sub_section"> <SectionTitle> 3.2 Normalizing the judgements </SectionTitle> <Paragraph position="0"> The human judges were presented with the following definition of adequacy and fluency, but no addi- null Judges varied in the average score they handed out. The average fluency judgement per judge ranged from 2.33 to 3.67, the average adequacy judgement ranged from 2.56 to 4.13. Since different judges judged different systems (recall that judges were excluded to judge system output from their own institution), we normalized the scores.</Paragraph> <Paragraph position="1"> The normalized judgement per judge is the raw judgement plus (3 minus average raw judgement for this judge). In words, the judgements are normalized, so that the average normalized judgement per judge is 3.</Paragraph> <Paragraph position="2"> Another way to view the judgements is that they are less quality judgements of machine translation systems per se, but rankings of machine translation systems. In fact, it is very difficult to maintain consistent standards, on what (say) an adequacy judgement of 3 means even for a specific language pair. The way judgements are collected, human judges tend to use the scores to rank systems against each other. If one system is perfect, another has slight flaws and the third more flaws, a judge is inclined to hand out judgements of 5, 4, and 3. On the other hand, when all systems produce muddled output, but one is better, and one is worse, but not completely wrong, a judge is inclined to hand out judgements of 4, 3, and 2. The judgement of 4 in the first case will go to a vastly better system output than in the second case.</Paragraph> <Paragraph position="3"> We therefore also normalized judgements on a per-sentence basis. The normalized judgement per sentence is the raw judgement plus (0 minus average raw judgement for this judge on this sentence).</Paragraph> <Paragraph position="4"> Systems that generally do better than others will receive a positive average normalized judgement per sentence. Systems that generally do worse than others will receive a negative one.</Paragraph> <Paragraph position="5"> One may argue with these efforts on normalization, and ultimately their value should be assessed by assessing their impact on inter-annotator agreement. Given the limited number of judgements we received, we did not try to evaluate this.</Paragraph> </Section> <Section position="3" start_page="106" end_page="107" type="sub_section"> <SectionTitle> 3.3 Statistical Significance </SectionTitle> <Paragraph position="0"> Confidence Interval: To estimate confidence intervals for the average mean scores for the systems, we use standard significance testing.</Paragraph> <Paragraph position="1"> Given a set of n sentences, we can compute the sample mean -x and sample variance s2 of the individual sentence judgements xi:</Paragraph> <Paragraph position="3"> The extend of the confidence interval [-x[?]d, -x+d] can be computed by d = 1.96* s[?]n (6) Pairwise Comparison: As for the automatic evaluation metric, we want to be able to rank different systems against each other, for which we need assessments of statistical significance on the differences between a pair of systems.</Paragraph> <Paragraph position="4"> Unfortunately, we have much less data to work with than with the automatic scores. The way we cant distinction between system performance. Automatic scores are computed on a larger tested than manual scores (3064 sentences vs. 300-400 sentences). null collected manual judgements, we do not necessarily have the same sentence judged for both systems (judges evaluate 5 systems out of the 8-10 participating systems).</Paragraph> <Paragraph position="5"> Still, for about good number of sentences, we do have this direct comparison, which allows us to apply the sign test, as described in Section 2.2.</Paragraph> </Section> </Section> class="xml-element"></Paper>