File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/c00-2137_concl.xml

Size: 1,287 bytes

Last Modified: 2025-10-06 13:52:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2137">
  <Title>More accurate tests Ibr the statistical significance of result differences *</Title>
  <Section position="5" start_page="952" end_page="952" type="concl">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> In elnpirical natural language processing, one is often COml)aring differences in values of metrics like recall, precision and balanced F-score.</Paragraph>
    <Paragraph position="1"> Many of the statistics tests commonly used to make such comparisons assume the independence between the results being compared. \Y=e ran ~ set of m~tural language processing experiments and tbund that this assuml)tion is often violated in .~uch a way as t,o understate the stal, istical significance of the difli;rences between the results. We point out some analyt;ica.1 statistics tests like lnatched-l)air t,, sign mid Wilcoxon tests, which do not midge this assmnption and show that they (;tl,ll \])e llsed Oll a llletric like recall, l?br more complicated 1nettles like precision and balanced F-score, wc use a compute-intensive randonfization test, which also avoids this assumption. A next topic to address is that of possible dependencies l)etween test set samples. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML