File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1136_metho.xml

Size: 3,827 bytes

Last Modified: 2025-10-06 14:08:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1136">
  <Title>Significance tests for the evaluation of ranking methods</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical validation
</SectionTitle>
    <Paragraph position="0"> In order to validate the statistical model and the significance tests proposed in Section 3, it is necessary to simulate the repetition of an evaluation experiment. Following the arguments of Section 3.1, the conditions should be the same for all repetitions so that the amount of purely random variation can be measured. To achieve this, I divided the Frankfurter Rundschau Corpus into 80 contiguous, non-overlapping parts, each one containing approx. 500k words. Candidates for PP-verb collocations were extracted as described in Section 1, with a frequency threshold of f [?] 4. The 80 samples of candidate sets were ranked using the association measures G2, X2 and t as scoring functions, and true positives were manually identified according to the criteria of (Krenn, 2000).14 The true average precision piA of an acceptance set A was estimated by averaging over all 80 samples.</Paragraph>
    <Paragraph position="1"> Both the confidence intervals of Section 3.2 and the significance tests of Section 3.3 are based on the assumption that P(TA|NA) follows a binomial distribution as given by Eq. (2). Unfortunately, it 14I would like to thank Brigitte Krenn for making her annotation database of PP-verb collocations (Krenn, 2000) available, and for the manual annotation of 1913 candidates that were not covered by the existing database.</Paragraph>
    <Paragraph position="2"> is impossible to test the conditional distribution directly, which would require that NA is the same for all samples. Therefore, I use the following approach based on the unconditional distribution P(PA). If NA is sufficiently large, P(PA|NA) can be approximated by a normal distribution with mean u = piA and variance s2 = piA(1[?]piA)/NA (from Eq. (2)).</Paragraph>
    <Paragraph position="3"> Since u does not depend on NA and the standard deviation s is proportional to (NA)[?]1/2, it is valid to make the approximation</Paragraph>
    <Paragraph position="5"> as long as NA is relatively stable. Eq. (3) allows us to pool the data from all samples, predicting that</Paragraph>
    <Paragraph position="7"> with u = piA and s2 = piA(1 [?] piA)/N. Here, N stands for the average number of TPs in A.</Paragraph>
    <Paragraph position="8"> These predictions were tested for the measures g1 = G2 and g2 = t, with cutoff thresholds g1 = 32.5 and g2 = 2.09 (chosen so that N = 100 candidates are accepted on average). Figure 4 compares the empirical distribution of PA with the expected distribution according to Eq. (4). These histograms show that the theoretical model agrees quite well with the empirical results, although there is a little more variation than expected.15 The empirical standard deviation is between 20% and 40% larger than expected, with s = 0.057 vs. s = 0.044 for G2 and s = 0.066 vs. s = 0.047 for t. These findings suggest that the model proposed in Section 3.1 may indeed represent a lower bound on the true amount of random variation.</Paragraph>
    <Paragraph position="9"> Further evidence for this conclusion comes from a validation of the confidence intervals defined in Section 3.2. For a 95% confidence interval, the true proportion piA should fall within the confidence interval in all but 4 of the 80 samples. For G2 (with g = 32.5) and X2 (with g = 239.0), piA was outside the confidence interval in 9 cases each (three of them very close to the boundary), while the confidence interval for t (with g = 2.09) failed in 12 cases, which is significantly more than can be explained by chance (p &lt; .001, binomial test).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML