File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-3250_concl.xml
Size: 2,008 bytes
Last Modified: 2025-10-06 13:54:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3250"> <Title>Statistical Significance Tests for Machine Translation Evaluation</Title> <Section position="9" start_page="40" end_page="40" type="concl"> <SectionTitle> 8 Summary and Outlook </SectionTitle> <Paragraph position="0"> Having a trusted experimental framework is essential for drawing conclusions on the effects of system changes. For instance: do not test on train, do not use the same test set repeatedly, etc. We stressed the importance of assembling test sets from different parts of a larger pool of sentences (Figure 2 vs. Figure 3).</Paragraph> <Paragraph position="1"> We discussed some properties of the widely used BLEU score, especially the effect of the brevity system comparisons and different sample sizes. 12%/1% means 12% correct and 1% wrong conclusions. 30,000 test sentences are divided into 300, 100, 50, and 10 samples, each the size of 100, 300, 600, and 3000 sentences respectively.</Paragraph> <Paragraph position="2"> penalty and the role of larger n-grams.</Paragraph> <Paragraph position="3"> One important element of a solid experimental framework is a statistical significance test that allows us to judge, if a change in score that comes from a change in the system, truly reflects a change in overall translation quality.</Paragraph> <Paragraph position="4"> We applied bootstrap resampling to machine translation evaluation and described methods to compute statistical significance intervals and levels for machine translation evaluation metrics. We described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods. We provided empirical evidence that the computed intervals are accurate.</Paragraph> <Paragraph position="5"> Aided by the proposed methods, we hope that it becomes common practice in published machine translation research to report on the statistical significance of test results.</Paragraph> </Section> class="xml-element"></Paper>