File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/c00-2137_concl.xml
Size: 1,287 bytes
Last Modified: 2025-10-06 13:52:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2137"> <Title>More accurate tests Ibr the statistical significance of result differences *</Title> <Section position="5" start_page="952" end_page="952" type="concl"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> In elnpirical natural language processing, one is often COml)aring differences in values of metrics like recall, precision and balanced F-score.</Paragraph> <Paragraph position="1"> Many of the statistics tests commonly used to make such comparisons assume the independence between the results being compared. \Y=e ran ~ set of m~tural language processing experiments and tbund that this assuml)tion is often violated in .~uch a way as t,o understate the stal, istical significance of the difli;rences between the results. We point out some analyt;ica.1 statistics tests like lnatched-l)air t,, sign mid Wilcoxon tests, which do not midge this assmnption and show that they (;tl,ll \])e llsed Oll a llletric like recall, l?br more complicated 1nettles like precision and balanced F-score, wc use a compute-intensive randonfization test, which also avoids this assumption. A next topic to address is that of possible dependencies l)etween test set samples. null</Paragraph> </Section> class="xml-element"></Paper>