File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2137_metho.xml

Size: 27,929 bytes

Last Modified: 2025-10-06 14:07:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2137">
  <Title>More accurate tests Ibr the statistical significance of result differences *</Title>
  <Section position="2" start_page="0" end_page="947" type="metho">
    <SectionTitle>
@2000 The MITRE Corl)oration. All rights r(~s(n'vcd.
</SectionTitle>
    <Paragraph position="0"> null hyl)othesis), wh~tt is 1:11(; 1)robat)ility that the results on the test set would l)e at least this skewed in the new technique's favor (Box eta\]., 1978, So(:. 2.3)? Thai; is, what is P(test se, t results at least this skew('A in the new techni(lue's favor I new technique is no (liffercnt than the old) If the i)robtfl)ility is small enough (5% off;on is used as the threshold), then one will rqiect the mill hyi)otheMs and say that the differences in 1;he results are :'sta.tisl;ically siglfilicant&amp;quot; aI; that thrt,shold level.</Paragraph>
    <Paragraph position="1"> This 1)al)(n&amp;quot; examines some of th(`- 1)ossil)le mePShods for trying to detect statistically signif'leant difl'el'enc(`-s in three commonly used metl'i(:s: tel'all, 1)re('ision and balanced F-score. Many of these met;Ire(Is arc foun(t to be i)rol)lema.ti(&amp;quot; ill a, so, t; of eXl)erinw, nts that are performed. Thes(~ methods have a, tendency to ullderestimat(`- th(', signili(:ance, of the results, which tends t() 1hake one, 1)elieve thai; some new techni(tuc is no 1)el;l;er l;lmn the (:urrent technique even when il; is.</Paragraph>
    <Paragraph position="2"> This mtderestimate comes fl'om these lnc|hells assuming l;hat; the te(:hlfi(tues being con&gt; lmrcd produce indepen(lc, nt results when in our eXl)eriments , the techniques 1)eing COml)ared tend to 1)reduce l)ositively corr(`-lated results.</Paragraph>
    <Paragraph position="3"> To handle this problem, we, point out some st~ttistical tests, like the lnatche(t-pair t, sign and Wilcoxon tests (Harnett, 1982, See. 8.7 and 15.5), which do not make this assulnption. One Call ITS(', l;llcse tests Oll I;hc recall nlel;ric, but l;he precision an(l 1)alanced F-score metric have too COml)lex a tbrm for these tests. For such com1)lex lne|;ri(;s~ we llSe a colnplll;e-in|;Clisiv(~ randomization test (Cohen, 1995, Sec. 5.3), which also ~tvoids this indet)en(lence assmnption.</Paragraph>
    <Paragraph position="4">  The next section describes many of the standard tests used and their problem of assuming certain forms of independence. The first subsecrio11 describes tests where this assumption appears in estimating the standard deviation of the difference between the techniques' results.</Paragraph>
    <Paragraph position="5"> The second subsection describes using contingency tables and the X 2 test. Following this is a section on methods that do not 1hake this independence assumption. Subsections in turn describe some analytical tests, how they can apply to recall but not precision or the F-score, and how to use randomization tests to test precision and F-score. We conclude with a discussion of dependencies within a test set's instances, a topic that we have yet to deal with.</Paragraph>
  </Section>
  <Section position="3" start_page="947" end_page="952" type="metho">
    <SectionTitle>
2 Tests that assume independence
</SectionTitle>
    <Paragraph position="0"> between compared results</Paragraph>
    <Section position="1" start_page="947" end_page="949" type="sub_section">
      <SectionTitle>
2.1 Finding and using the variance of a
</SectionTitle>
      <Paragraph position="0"> result difference For each metric, after determining how well a new and current technique t)efforms on solne test set according to that metric, one takes the diflbrence between those results and asks &amp;quot;is that difference significant?&amp;quot; A way to test this is to expect 11o difference in the results (the null hypothesis) and to ask, assuming this expectation, how mmsual are these results? One way to answer this question is to assulne that the diffb, rence has a normal or t distribution (Box et al., 1978, Sec. 2.4). Then one calculates the following:</Paragraph>
      <Paragraph position="2"> where d = xl- x2 is the difference found between xl and x2, the results for the new and current techniques, respectively. E\[d\] is the expected difference (which is 0 under the null hypothesis) and Sd is an estimate of the standard deviation of d. Standard deviation is the square root of the variance, a measure of how much a random variable is expected to vary. The results of equation 1 are compared to tables (c.f. in Box et al. (1978, Appendix)) to find out what the chances are of equaling or exceeding the equation 1 results if the null hypothesis were true.</Paragraph>
      <Paragraph position="3"> The larger the equation 1 results, the more unusual it would be under the null hypothesis.</Paragraph>
      <Paragraph position="4"> A complication of using equation 1 is that one usually does not have Sd, but only st and s2, where Sl is the estimate for Xl'S standard deviation and similarly for s2. Ilow does one get the former fi'om the latter? It turns out that (Box et al., 1978, Ch. 3) o -2 o-12 + a~ d = -- 2p12a10-2 where cri is the true standard deviation (instead of the estimate si) and pl'2 is the correlation coefficient between xl and :c2. Analogously, it turns out that</Paragraph>
      <Paragraph position="6"> where r12 is an estimate for P12. So not only does cr d (and Sd) depend on the properties of xl and x2 in isolation, it also depends on how Xl and .~'2 interact, as measured by P12 (and 'rr)). When Xl and x2 are independent, p12 = 0, and then (Td = ~-+ c7~ and analogously, Sd = ~ + s~. When P~2 is positive, ;1; 1 and x2 are positively correlated: a rise in xl or x2 tends to be accompanied by a rise in the other result. When P12 is negative, :cl and x2 are negatively correlated: a rise in :cl or x9 tends to be accompmfied by a decline in the other result. -1 &lt; P12 &lt; 1 (Larsen and Marx, 1986, Sec. 10.2).</Paragraph>
      <Paragraph position="7"> The assulnption of' independence is often used in fornlnlas to determine the statistical significance of the difference d = .~:1 - x2. But how accurate is this assumption? One nfight expect sonic positive correlation from both results coming from the same test set;. One may also expect some positive correlation when either both techniques are just variations of each other 1 or both techniques are trained on the same set of training data (and so are missing the same examples relative to the test set).</Paragraph>
      <Paragraph position="8"> This assumption was tested during some experiments for finding granunatical relations (subject, object, various types of nxodifiers, etc.). The metric used was the fraction of the relations of interest in the test set that were recalled (tbund) by some technique. The relations of interest were w~rious subsets of the 748 relation instances in that test set. An example sub-set is all the modifier relations. Another subset is just that of all the time modifier relations.</Paragraph>
      <Paragraph position="9">  First, two difl'erent te(:hniques, one ltlellloryt)ased and the other tl'ansti)rlnation-rule based, wei&amp;quot;e trained on the same training set, and then both teste(1 on that ticst set;. l~.e(:all eonlt)a.risons we, re made tbr ten subsets of tim relations and the r12 was found for each cOral)arisen. From Box et al. (1978, Ch. 3)</Paragraph>
      <Paragraph position="11"> where Yil~ = \] if the ith technique recalls the tcl;h relation and = 0 if not. 'lz, is the nmnl)cr of relations in the subset. !\]i and si are mean and stmJ(lard de, vial, ion estimate.s (based on the Yik'S), rest)ectively, fl)r the ith technique.</Paragraph>
      <Paragraph position="12"> For the ten subsets, only Clio COlnl)arison had a 'r12 (::lose to 0 (It was -0.05). The other nine c()ml)arisons had 'r12's 1)etw(x',n 0.29 and 0.53.</Paragraph>
      <Paragraph position="13"> The ten coral)arisen inedian value was 0.38.</Paragraph>
      <Paragraph position="14"> Next;, the transformatiol&gt;rulc t)ased t.cchnique was rUll with difl'erent sets of starting conditions and/or different, but overlapl)ing , sub-sets of the training set. Recall comparisons were ma(le on the same test (lata. set 1)etween l;he dif fcrent variations. Many of the comparisons were, of how well two wu:iations recalled a particular subset of the relations. A total of 40 comparisons were made.. The 'r\]2's on all d0 were 1)ositire. 3 of the 'r,2's w('~re ill the 0.20-0.30 range. 24 of the rj2's wore in the 0.50--0.79 range. 13 of the 'r\]2's were in the 0.80-1.0() range.</Paragraph>
      <Paragraph position="15"> So in our ext)erin~ents, we were usually eomt)aring 1)ositivcly correlated results. How much error is introdu(:e(t t)y assuming independence? An easy--to-analyze case is when the standard devial,ions for the results being eoml)ared a:t'c the same. ?~ 'J}hen equation 2 reduces to s,,- sV/2(l- r12), where s = sl = ,s'2. If one, assumes the re.sults m'e indcpel:dent (~/SSllllle r,2 = 0), then sd :-~ .sv/22. Call this wflue sd-i,7,g. As flu increases in value, Sd decreases:</Paragraph>
      <Paragraph position="17"> 'l'he rightmost cohunn above indicates the magnitude by which erroneously assuming indepen&gt;\[lifts is actually roughly true in the coml)arisons nmde, and is assumed to be true in many of the standard Wsts for statistical significance.</Paragraph>
      <Paragraph position="18"> (lence (using 8d_in d ill 1)lace of sd) will increase the standard deviation estimate. In equation 1, sd forms the denominator of the ratio d/.s d. So erroneously assmning independence will mean that the mmmrator d, the difference between the two results: will nee(t to increase by that same factor in order f()r equation 1 to have the same wtlue as without the indel)endence assmnt)tion.</Paragraph>
      <Paragraph position="19"> Since the value of that equation indicates the statistical significance of d, assunfing independence will mean that e1 will have to be larger than without the. assumption to achieve the same al)parent level of statistical significance.</Paragraph>
      <Paragraph position="20"> l?roln tile tal)le above, when r12 = 0.50, (1 will need to 1)c about 41% larger. Another way to look at this is that assuming indei)en(lenee will make the same. v~due, of d appear less statist;ically signifiealtt.</Paragraph>
      <Paragraph position="21"> The common tests of statistical significance use this assumt)tion. The, tesl; klloWlt as the 1, (Box et; al., 1978, Sec. 4.1) or two-saml)le t (Harnett, 1982, See. 8.7) test does. This test uses equation 1 and then compares the resulting va.lue against the t; distribution tal)les. This test has a (:Oml)licated form for sd l)eeause:  1. :c! and :c2 can t)e 1)ased on (tiffering num1)ers of saml)les. Call these retail)ors 'n~ and 'n2 r(;sl)ectivcly.</Paragraph>
      <Paragraph position="22"> 2. 111l this t(;st, the zi's are each an ni sampie average, of altother varial)le ((:all it yi). '\['his is important because the si's in this test are standm'd deviation estimates tor the yi's, not the xi's. The relationship between them is that si for Jli is the same as ( for :,:,:.</Paragraph>
      <Paragraph position="23"> 3. The test itself assumes that !11 and Y2 have  the same standard deviation (call this common value s). The denominator estimates ,s using a weighte(1 average of 81 and s2.</Paragraph>
      <Paragraph position="24"> The weighting is b~sed on nl and r7,2.</Paragraph>
      <Paragraph position="25"> From Harnett (1982, Scc. 8.7), the denominator</Paragraph>
      <Paragraph position="27"> When 'nl = 'n2 (call this common value 'n), '~1 and s2 will be given equal weight, and Sd siml)lifie.s to ~ + ,s'~)/n. Making the substitution described above of si v/57 tbr si leads to an Sd of  s 2 the fbrm had earlier for the -t-, 2, we using independence assumption.</Paragraph>
      <Paragraph position="28"> Another test that both makes this assulnt)tion and uses a tbrm of equation 1 is a test tbr binonlial data (Harnett, 1982, Sec. 8.1.1) which uses the &amp;quot;t'aet&amp;quot; that binomial distributions tend to approximate normal distributions. In this test, the zi's being compared are the fraction of the items of interest that are recovered by the ith technique. In this test, the denominator sd of equation 1 also has a complicated fbrm, both due to the reasons mentioned for the t, test above and to the fact that with a binomial distribution, the standard deviation is a flmction of the number of samples and the mean wflue.</Paragraph>
    </Section>
    <Section position="2" start_page="949" end_page="950" type="sub_section">
      <SectionTitle>
2.2 Using contingency tables and X 2 to
</SectionTitle>
      <Paragraph position="0"> test precision A test that does not use equation 1 but still makes an assunlption of independence l)etween a:l and a:u is that of using contingency tables with the chi-squared 0,52) distribution (Box et al., 1978, Sec. 5.7). When tile assmnption is valid, this test is good for comparing differences ill the pr'ecision metric. Precision is the fraction of the items &amp;quot;Ibund&amp;quot; 1)y some technique that are actually of interest. Precision = l~,/(I~, + S), where R is the number of items that are of interest and m'e Recalled (fbund) by tile technique, and S is the munber of items that are found by tile technique that turn out to be Spurious (not of interest). One can test whether the precision results from two techniques are different by using a 2 x 2 contingency table to test whether the ratio R/S is different for the two techniques.</Paragraph>
      <Paragraph position="1"> One makes tile latter test, by seeing if tile assumption that the ratios for the two techniques are the same (the null hypothesis) leads to a statistically significant result when using a X 2 distribution with one degree of freedom. A 2 x 2 table has 4 cells. The top 2 cells are filled with the R and S of one technique and the bottom 2 cells get the R and S of the other technique. In this test, the valuc in each cell is assumed to have a Poisson distribution. When the cell values are not too small, these Poisson distributions are approximately Normal (Gaussiml). As a result, when the cell values are independent, smnming tlle normalized squares of the difference between each cell and its expected value leads to a X 2 distribution (Box el; al., 1978, Sec. 2.5-2.6).</Paragraph>
      <Paragraph position="2"> How well does this test work in our experiments? Precision is a non-linear time(ion of two random wu'iables R and S, so we did not try to estimate the correlation coefficient \]'or precision. However, we can easily estimate the correlation coefficients for the R's. They are the r12's found in section 2.1. As that section mentions, the r12's fbund are just about always positive. So at least in our experiments, the R's are not illdependent, but are positively correlated, which violates the assumptions of the test.</Paragraph>
      <Paragraph position="3"> An example of how this test behaves is the following comparison of the precision of two different methods at finding the modifier relations using tile stone training and test set. The cormlation coefficient estilnate tor R is 0.35 mid the</Paragraph>
      <Paragraph position="5"> Placing the l~, and S values into a 2 x 2 table leads to a X 2 value of 2.38. a At t degree of freedom, tile X 2 tables indicate that if the null hypothesis were true, there would 1)e a 10% to 20% chance of producing a X 2 value at least this large. So according to this test, this nnlch of an observed difference in precision wouht not be unusual if no actual differ(,ncc in the precision exists between the two nw, thods.</Paragraph>
      <Paragraph position="6"> This test assumes independence between the /~, wdues. When we use a 22(I (=1048576) trial approximate rmldomization test (section 3.3), which makes no such assumptions, then we find that this latter test indicates that under the null hypothesis, there is less than a 4% chance of producing a difference in precision results as large as the one observed. So this latter test indicates that this nmch of an observed difference in precision would be mmsual if no actual difference ill the precision exists between the two methods.</Paragraph>
      <Paragraph position="7"> It should be mentioned that the manner of testing here is slightly different than the manner in the rest of this paper. The X 2 test looks at the square of the difference of two results, and rejects the mill hylmthesis (the compared techniques are the same) when this square is a\Ve do not use Yate's adjustment to compensate lbr the numbers in the table being integers. 1)oing so would lmve made the results even worse.</Paragraph>
      <Paragraph position="8">  large, whel;he, r l;lm largeness is (:aused l)y t;he new t;eehni(lue t)l&amp;quot;o(lucing' a, much l)(fl;l;er result; titan l;he current, l;e(:hlfique or vice-versa. So 1,o l)e fair, we eolnl)ared l;he X 2 resull;s with a l;wo-sided version of l;hc rmldon~iz~fl;ion t;esl,: esl;inm|;e, l;he likelihood glu~l; l;he obsea'ved magnil;u(le of t, he resull; (lifl'eren(:e would 1)c matched or exceeded (regardless of' which l;echnique produced l;he betl;er resull;) raider the mill hyl)othesis. A one-sided version of the test;, which is colnt)aral)le t;o what we use in l;he rest of the t)a per, esl;inml;es l;he likelihood of a (tifferenl; oul;come under t;he null hyt)oChesis: that of m~l:clling or exceeding t;he (lit\['erence of how lllllch l)C/,l;ter i;he new (possibly 1)ett, er) l;e(:lmi(lue's ohs('a'ved result is than l;he currenl; l;e('hnique's o|)serve,(1 l'esull;, ht t;he ahoy(; scenario, a one-sided t(;sl; t)rodu(:es ~ 3(~, tigure insl;ead of s~ d:% figure.  a fOI'ln of eqm~l;ion l m:e used. The null }lyl)ol;hesis is st;ill l;\]l~tl; ~he mtmeral;or d \]ms ~t 0 me,m, but el is now l;he stun of these difference values (divided 1)y t;he number of Smnl)les), instead of being :r~ - :re. Similm'ly, the (lenomimd;or .sd is now esl;inml;ing l;he si;a.ndm'd (leviation of l;hese difl'erenee wdues, instead of being a funcl;ion of s:l and su. '.Flfis means for example, (;hal; even if t;lm values fl'om l,eclmiques l and 2 vary on (liiti:rent; test; Smnl)les , Sd will now 1)(' 0 if on each tesI; smnl)le, l;echnique \] 1)reduces a. value l;lmt is the ssulle C()llS|;allI; tHI1OlllIi; lllOl'e t;han l;he va,\]ue fl'om t, echnique 2.</Paragraph>
      <Paragraph position="9"> Two ol;h(',r tests for eomlmring how (;we techni(lueS 1)ert'()rm 1) 3, comtmring how well l;hey perform on each I;est Smnl)le arc the sign mid Wilcoxon tests (Harnel;t;, 1!)82, See. 15.5). Unlike, t;\]le nl~tl;ched-tmir t: t;esI;~ neither of t, hese l;wo I;CSI;5 slSSllllte t;ln~l; I;hc sum of l;he (litl'crences has a normal (Gaussian) (listribul;ion. The i;wo tests are, so-calh~d nonl)a.rmut%ri(: l;esl;s, which (lo not; make assuml)l, ions a.1)out; how l, he rcsull;s axe disln'il)ut, ed (thrnel,l,, 1982, Ch. 15).</Paragraph>
      <Paragraph position="10"> il'he sign |;est is I;he simplier of lJm I;wo. It uses a 1)inomial dist,rilm|;ion to examine the munber of l;esl; smni)les where t;e(:hlfi(lUe \] 1)crforms \])el;l;er t;ha.n l;e(:hnique 2 ve, rsus l;he munl)er where</Paragraph>
    </Section>
    <Section position="3" start_page="950" end_page="951" type="sub_section">
      <SectionTitle>
3.2 Using the tests for matched-pairs
</SectionTitle>
      <Paragraph position="0"> All three of l;hc ma.l,(:he(1-tmir t, sign and Wilcoxon t;csl;s can 1)e a.pl)lied t;o t;hc re, call metric, whicll is the fl'act;ion of |;he il;ems of inl;crcsl; in ~,he l:csl; sol; l;lml; a, I;e, ehniquc recalls (finds).</Paragraph>
      <Paragraph position="1"> Each il;em of inl,eresi; in |;he l;esl; (la~;a serves as a. l;cst sainlflU. \Y=e use t;he sign l;esl; b(',causc iI; 11Htkcs fcwel&amp;quot; assumi)i;ions 1;hart i;he nml;chcd-l)air 1: I;est and is simplier l;han the Wih'oxon I;esi;. 111 addit;ion, the fro:i; glml; t~he sign l;e, st ignores l;he size of 1;he result; difl'erence on eacll l;esl; Smnl)le (tocs llOI; nml;ter here. \Y=iI:h I;he recall met;rio, each sa.mple of int;eresl; is either found or nol; by a. t;eehnique. There are no interlnedbtte values.</Paragraph>
      <Paragraph position="2"> While 1;he 1;hree l;esl;s described in sccl;ion 3.1 can be used on the re(:~dl mctxic, 1;hey CallllO|; bc &amp;quot;&amp;quot; ' ' used on ell;lint t;hc precision or sla mgh|fforwardly 1)abmced F-score met;rics. This is because both precision and F-score ~tre more coml)licated nonlinem' flmci;ions of rml(lom varial)lcs than recall.</Paragraph>
      <Paragraph position="3"> In fst(:t bol;h can be l;hought of as non-linem&amp;quot; flm(:l;ions involving recall. As described in Section 2.2, precision = 1~./(1~ + S), where I~ is i;he nmnl)er of iWms t;lmt; are of inl:eresl; that; are '/'c'called by a W, chnique mid S is l;he mmfl)er of it;e, ms (fi)und 1)y s~ technique) that; are nol; of interest;. The 1)~dmmed F-score = 2ab/(a + b), where a is recall and b is precision.</Paragraph>
    </Section>
    <Section position="4" start_page="951" end_page="952" type="sub_section">
      <SectionTitle>
3.3 Using randomization fbr precision
</SectionTitle>
      <Paragraph position="0"> and F-score A class of technique that ean handke all ldnds of flmetions of random variables without the above problenls is the computationally-intellsive randomization tests (Noreen, 1989, Ch. 2) (Cohen, 1995, Sec. 5.3). These tests have previously used on such flmctions during the &amp;quot;message understanding&amp;quot; (MUC) evaluations (Chinchor et al., 1993). The randomization test we use is like a randomization version of the paired sample (matched-1)air) t test (Cohen, 1995, Sec. 5.3.2). This is a type of stratified shuffling (Noreen, 198!), Sec. 2.7). When eomt)aring two techniques, we gather-u I) all the responses (whether actually of interest or not) produced by one of the two techniques when examining the test data, but not both techniques. Under the 111111 hyl)othesis , the two techniques are not really different, so any resl)onse produced by one of the teehniques eonld have just as likely come fl'om the other. So we shuffle these responses, reassign each response to one of the two techniques (equally likely to either technique) and see how likely such a shuffle 1)roduces a difference (new technique lninus old technique) in the metric(s) of interest @1 our ease, precision and l?-score) that is at least; as large as the difference observed when using the two techniques on the test data.</Paragraph>
      <Paragraph position="1"> 'n responses to shuttle and assign 4 leads to 2 ~' difl'erent w~\ys to shuffle and assign I;hose responses. So when 'n. is small, one can try each of the different shuttles once and produce an exact randomization. V~;hen n gets large, the mmfl)er of different shutttes gets too large to be exhaustively evaluated. ~J?hen one performs a.u approximate randomization where each shuffle is perfornmd with randoln assignments.</Paragraph>
      <Paragraph position="2"> For us, when n &lt; 20 (2'&amp;quot; .&lt;_. 1048576), we use an exact randomization. For n &gt; 20, we use an approximate randomization with 1048576 shuf ties. Because an approximate randomization uses random nmnbers, which both lead to oc~ casional unusual results and may involve using a not-so-good pseudo-random 1111111\])(;I&amp;quot; generatol &amp;quot;~, we perfbrm the following cheeks:  late the statistical significance for the recall metric, and compare this significance value with the significance value found for recall analytically by the sign test.</Paragraph>
      <Paragraph position="3"> An example of using randomization is to compare two different methods on finding modifier relations ill the same test set,. The results on the test; set, are:</Paragraph>
      <Paragraph position="5"> Zl: 64.1% 35.2% Two questions being tested are whether the apparent ilnt)rovement in reca.ll and F-score f!rom using method I is significant. Also being tested is whether the apparent imt)rovenmnt; in pl'ecision fl'om using method Ii is significant. In this example, there are 10&amp;quot;1 relations that should be found (are of interest). Of these, 19 are recalled by both methods, 28 are recalled by method I but not; II, and 6 are recalled by II but not I. The correlation coeificient estilnate between the methods' recalls is 0.35. In addition, 5 stmrious (not of interest) relations arc found by both methods, with method I finding an additional 43 Sl)uriolls relationships (not found by method II) and mePShod II finding an additional 9 relationships.</Paragraph>
      <Paragraph position="6"> There are a total of 28+6+43+9=86 relations that are found (whether of interest oi' not) by one method, but not the other. This is too many to t)erfornl an exact randolnizgtion, so a 1048576 trial apt)roximate randomization is perfornmd.</Paragraph>
      <Paragraph position="7"> In 96 of these trials, method I's recall is greater than method iI's recall by at, least (45.6%-24.3%). Similarly, in 14794 of the trials, the F-score difference is at least (47.5%-35.2%). In 25770 of the trials, method II's precision is greater than method I's precision by at least; (64.1%-49.5%). N:om (Noreen, 1989, Sec. aA.a), the significance level (probability under the null hypothesis) is at most (.,e + 1)/(,~t + 1), where ',,.: is the nul~lt/er of trials that meet the criterion alld 1t, t is the number of trials. So fbr recall, the significance level is at most (96+1)/(1048576+1) =0.00009.</Paragraph>
      <Paragraph position="8">  Similarly, for F-score, the significance level is at most 0.()1 d: and for l)re(:ision, the level is at lllOSt 0.025. A secon(l 1048576 trial t)ro(luces similar results, as does a sign test on recall. 'l'hus, we see that all three dit\[ere.n(:es are statistically siglfiIica.nt. null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="952" end_page="952" type="metho">
    <SectionTitle>
4 The future.: handling inter-smnple
</SectionTitle>
    <Paragraph position="0"> dependencies An assmnption made by all I;he methods mentioned in this I)~tl)er is ttmt the nlenlbcrs of the Lest set are all independent of one anothex. Tlmt is, knowing how a method l)('rforms on one test sot sanlple should not give any information on how that method \])el'forllls on other test set samples. This assulnl)tJon is not always true.</Paragraph>
    <Paragraph position="1"> Church and Mercer (1993) give some exalnples of dependence bctwe.en test set insl;ances ill natural la.llguage. One tyt)e of dci)endence is that of a lexeme's part of speech on the l)m'l;s of speech of neighl)oring lexenm,~ (th(,ir section 2.1). Sinfilar is the concept of colloca, t;ion, where the prolml)ility of a lexeme's al&gt; l)earance is influenced by the. lexemes ai)pea.rin: ~ i1~ nearby positions (their section 3). A type of (tet)en(lence that is less local is that often, a. con-tent word's al)pe.arance in a piece of text gr(;atly increases the cha.n('es of th~tt s;ulle wor(1 ~q)l)ear illg b~ter in that 1)iece of texl; (their se(:l;ion 2.;/). What ~tr('. the effects when SOllle d{:t)endency exists? The expected (average) value of' the installC(~ results will stay the, same. However, the ('lmnees of getting an llllllSllal reslllt (;a,lt c\]la.ll~re. As an eXmnl)le , take five flips of a Nit coin.</Paragraph>
    <Paragraph position="2"> When no dependen(:ies exist 1)etween the tlil)s , the clmnces of the extreme result tha.t all the flit)s l:md on :~ particular side is faMy small ((1/2) 5 -- i\[/32). When the ttil)s are positively correlated, these chmices increase. When the first flip lands on that side, the chances of the other four tlil)s doing the same are now ea.ch greater tlmn 1/2.</Paragraph>
    <Paragraph position="3"> Since statistical significance testing involves finding the chances of getting an mmsmd (skcwe(1) result under some null hyt)othesis, one needs to determine those del)endencies in order to accurately determine those dmnces, l)etermining the etk's:t of these dependencies is something that is yet to l)e done,.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML