File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1003_metho.xml
Size: 24,574 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1003"> <Title>THE STATISTICAL SIGNIFICANCE OF THE MUC-4 RESULTS</Title> <Section position="3" start_page="0" end_page="31" type="metho"> <SectionTitle> ELEMENTS OF HYPOTHESIS TESTING </SectionTitle> <Paragraph position="0"> A key element in hypothesis testing is obviously the hypothesis. Statistics can be used to reject a hypothesis.</Paragraph> <Paragraph position="1"> In statistical hypothesis testing, it is important to compose the hypothesis in such a way as to be able to use statistics to reject it and to thereby be able to conclude that something of interest is true. The hypothesis formulated to be rejected is called the null hypothesis. In Table 1, some elementary terms are defined, including the null hypothesis.</Paragraph> <Paragraph position="2"> We axe interested in determining whether two systems are significantly different in their performance on a MUC-4 test set. To conclude that two systems are significantly different, we would formulate null hypotheses of the following form to be tested for each test set: The absolute value of the difference between System X's overall recall (precision, F-measure) score for the data extraction task and System Y's overall recall (precision, F-measure) score for the data extraction task is approximately equal to zero.</Paragraph> <Paragraph position="3"> If this null hypothesis can be rejected, then we cart conclude that the systems are significantly different.</Paragraph> <Paragraph position="4"> Another key element in hypothesis testing found within the null hypothesis is the test statistic. A test statistic is a function which can be applied to a set of sample data to produce a single numerical value. A simple example of a test statistic is recall, which is a function of the number correct, the number partially correct, and the number of possible fills. The test statistic we used in our hypothesis testing is the absolute value of the difference in recall, precision, or F-measure of systems. Observations are instances of a set of random variables. An example of an observation is the four-tuple for a MUC-4 message consisting of the number possible, actual, correct, and partially correct. This four-tuple plays an important role in our application of the approximate randomization method.</Paragraph> <Paragraph position="5"> O Null hypothesis The hypothesis that a relationship of interest is not present.</Paragraph> <Paragraph position="6"> Examples (informal): 1) system X and system Y do not differ in recall 2) system X and system Y do not differ in precision 3) system X and system Y do not differ in F-measure for equal weighting of recall and precision 0 Test statistic A function which can be applied to a set of sample data to produce a single numerical value.</Paragraph> <Paragraph position="7"> Examples: recall, precision, F-measure, difference in recall O Observations Instances of values of a set of random variables.</Paragraph> <Paragraph position="8"> Example: number possible, actual, correct ,partially correct O Significance level The prebabiHty that a test statistic that is as extreme or more extreme than the actual value could have arisen by chance, given the null hypothesis.</Paragraph> <Paragraph position="9"> The lower the significance level, the less probable it is that the null hypothesis holds. In our case, the lower the significance level, the more likely it is that the two systems are significantly different.</Paragraph> <Paragraph position="10"> The final key element in hypothesis testing is some means of generating the probability distribution of the test statistic under the assumption that the null hypothesis is true. Instead of assuming a distribution, such as the Norreal distribution, we empirically generate the distribution as illustrated in Figure 1. As shown, the significance level is the area under the distribution curve bounded on the lower end by the actual value of the test statistic.The significance level is the probability that a test statistic that is as extreme or more extreme than the actual value could have arisen by chance, given the null hypothesis. Thus, the lower the significance level, the more likely it is that the two systems are significantly differenL</Paragraph> </Section> <Section position="4" start_page="31" end_page="31" type="metho"> <SectionTitle> RANDOMIZATION TESTING </SectionTitle> <Paragraph position="0"> Traditional statistical analysis requires knowledge of or an assumption about the distribution of the data in the sample. We do not know the distribution for our sample ~d prefer to not make an assumption about it. Computationally-intensive methods empirically generate the distribution. Exact randomization testing generates all of the logically possible outcomes and compares the actual outcome to these. This amount of generation is often impractical because of the large number of data points involved, so approximate randomization is used instead. A confidence level is calculated to indicate how close the approximate randomization is to the exact randomization.</Paragraph> <Paragraph position="1"> The method of approximate randomization involves random shuffling of the data to generate the dislribution. We used the approximate randomization method described by Noreen in \[1\] with stratified shuffling to control for categorical variables that are not of primary interest in the hypothesis test.</Paragraph> <Paragraph position="2"> The first step in approximate randomization is to select a test statistic. Our formulation of the null hypothesis indicates that the test statistic is the absolute value of the difference in recall, precision, or F-measure. The next step is to input the data. The data for each test set consists of the four-tuples of number possible, actual, correct, and partially correct for each message for each system. The actual statistic: is calculated for the pair of systems as they come under scrutiny. The desired number of shuffles is set to 9,999 because it takes about eight hours to run the test for each test set and the confidence levels can be easily looked up in a table given in \[1\]. We have arbitrarily chosen the cutoff confidence level to be a conservative 99%. In a test of how the confidence levels were affected by the number of shuffEes, 9,999 shuffles produced slightly higher confidence levels than 999 and were worth the 16-fold increase in computing time. Once the desired number of shuffles is set, the counters for the number of shuffles, ns, and the number of times the pseudostatistic is greater than or equal to the actual statistic, nge, are set to 0. A loop then increments ns until it has exceeded 9,999, the desired number of shuffles. The first step in this loop is to shuffle the data, which is the first major operation that occurs during approximate randomization. Table 2 contains an outline of the major operations involved in approximate randomization. Figure 2 illustrates the stratified shuffling used for the analysis of the MUC-</Paragraph> </Section> <Section position="5" start_page="31" end_page="32" type="metho"> <SectionTitle> 4 results. i APPROXIMATE RANDOMIZATION </SectionTitle> <Paragraph position="0"> 1. Shuffle ns times (ns is 9,999 in our case).</Paragraph> <Paragraph position="1"> 2. Count the number of times (number greater than or equal, nge) that Istat....~.- atat...,,~..I ~ latatA- mt.I (stat can be recall, precision, or F.measure in our case).</Paragraph> <Paragraph position="2"> 3. The estimate of the significance level is (nge + 1) / (ns + 1) (the l's are added to ensure the test is valid).</Paragraph> <Paragraph position="3"> 4. The confidence level is found by calculation or table lookup.</Paragraph> </Section> <Section position="6" start_page="32" end_page="35" type="metho"> <SectionTitle> SHUFFLING SYSTEM A \] </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Coin FHp (100 times) , heads tails , PSEUDO SYSTEMA I ~ PSEUDO SYSTEMB I (pOSnx 7nx COrnx Pa ~ ~pOSny aCtny COrny In Figure 2, data is shuffled by exchange of the systems' message scores depending on the outcome of a computer-simulated coin flip. After 100 coin flips, one per naessage, the absolute value of the difference in the metrics of the resulting pseudosystems can be compared to the corresponding absolute value of the difference in the metrics of the actual systems. The inequality for the comparison is shown in operation 2 of Table 2. The value ofnge is incremented every time a randomized pair of pseudosystems salLisfies this inequality. The significance level is calculated according to the formula in operation 3 of Table 2. The corresponding confidence level is then found in the appropriate table in \[1 \].</Paragraph> <Paragraph position="3"> According to Noreen, Randomization is used to test the generic null hypothesis that one variable (or group of variables) is unrelated to another variable (or group of variables). Significance is assessed by shuffling one variable (or set of variables) relative to another variable (or set of variables). Shuffling ensures that there is in fact no relationship between the variables. If the variables are related, then the value of the test statistic for the original unshuffled data should be unusual relative to the values of the test statistic that are obtained after shuffling. 1 In our case, the four-tuple of data associated with each message for each system is the dependent set of variables that is shuffled. The explanatory variable is the system. Shuffling ensures that there is no relationship between the differences in the scores and the systems that produced them, i.e., that the differences were achieved by chance. If they were not achieved by chance, then the value of the actual test statistic will be far enough out on the tail of the empirically-generated distribution to be significant. The area under the distribution curve with the lower bound of the actual test statistic will be smaller than the cutoff in this case. We have arbitrarily chosen the cutoff to be 0.1 because we are able to distinguish reasonable groupings of systems at this level. At lower cutoffs, too many systems are not significantly different. The choice of a cutoff has traditionally been based on the practical impfications of the choice. We need more data before we will know more of the implications of the cutoff given the sample sizes that we now have.</Paragraph> <Paragraph position="4"> The stratified shuffling that was limited to exchange at the message level was the most conservative approach to shuffling because it eliminated the effect of the varying number of slots filled per message in the key. We did not want to introduce the effect of this nuisance variable. Instead, we wanted the simplest, most straightforward method to be used for the initial introduction of approximate randomization in this application. Further experimentation with the parameters of stratification, the cutoff for significance level, and the cutoff for confidence level may occur in the future.</Paragraph> <Section position="1" start_page="33" end_page="35" type="sub_section"> <SectionTitle> Examples </SectionTitle> <Paragraph position="0"> These examples were obtained from \[3\] and are meant to give an intuitive sense of the results of the approximate randomization method. In general, ff two systems hawe. a large recall difference, they are likely to produce more homogeneous pseudo systems; nge will be small and the significance level will be low. On the other hand, if two systems have similar recall scores, they are likely to produce a high nge and a larger significance level, giving a lower likefihood that the differences are significant.</Paragraph> <Paragraph position="1"> A pair of messages that show consistent differences across a range of messages will be more likely to be statistically significantly different in their score than two systems which have a small number of differences appearing in just a few messages. Consider the following systems reporting precision results for a set of 100 messages. System A has a score of 15/20 on each of 50 relevant messages and no spurious templates, for a total of 750/1000 = 75% precision. System B has the identical score on each of the relevant messages except for one, where its score is 0/20. System B has a precision of 735/1000 - 73.5%. In the random shuffle, one of the pseudo systems will have the &quot;0 precision&quot; template. The pseudostafistic will always be the same as the measured absolute difference between the systems, i.e., 1.5%, so the significance level is 1.0 and the difference is not statistically significant.</Paragraph> <Paragraph position="2"> 1. From page 9 of \[1\].</Paragraph> <Paragraph position="3"> Let us suppose that there is a System C with a precision score of 18/20 on each of 50 relevant messages and with no spurious templates. Any random shuffle of Systems A and C is likely to produce a smaller difference than the absolute value of the difference between A and C. The significance level will be extremely close to zero, indicating with virtual certainty that the two systems are significantly different.</Paragraph> <Paragraph position="4"> WHAT WE CANNOT CONCLUDE FROM APPROXIMATE RANDOMIZATION It is the systems themselves that set the conditions for the shultle in the pairwise comparison. The same difference in scores between pairs of systems may not equate to the same significance level when different systems are involved in the pairwise comparison. The statistical test measures the difference from what would have occtm'ed randomly. Each pair of systems provides a different set of conditions for the stratified shuffle. Thus, we cannot conclude from the results of the approximate randomization tests that the evaluation scores are generally significant within a certain range of percentage points. This cautionary note has further consequences which are discussed below in the results for TST3.</Paragraph> <Paragraph position="5"> We also cannot conclude from the approximate randomization test whether our sample is sufficiently representative of newswire reports on activities concerning Latin America. In addition, we do not know whether the size of the sample is large enough for all of our purposes. The test sets were carefully collected in such a way as to reduce bias by selecting the number of articles concerning certain countries based on the percentage of articles about those countries in the entire corpus for the associated time period. In addition, the test articles were selected based solely on the order in which they appeared in the larger corpus and not on their contents. The size of the test sets was determined by the practical limits of time allotted for composing answer keys, running the participating systems on the test sets, and scoring the results. Even with these precautions, however, we still do not know whether the random sample of test messages is representative of the population from which they were drawn. Our statistical testing is not aimed at answering this question.</Paragraph> <Paragraph position="6"> WHAT WE CAN CONCLUDE FROM APPROXIMATE RANDOMIZATION TST3 The results 'for TST3 of MUC-4 are presented in this section. TST3 is the official test set for MUC-4. The systems otticially report scores for recall, precision, and the F-measures for three weightings of recall and precision based on the ALL TEMPLATES row in the summary score report. 2 Approximate randomization was applied to each pair of participating systems. The approximate randomization method applied to the TST3 results randomly shuffles the message-by-message scores for the two systems being compared and checks to see how many times the test statistic is as extreme as or more extreme than the actual test statistic. The lower the number of times the pseudo test staffstic is greater than or equal to the actual test statistic, the lower the significance level and the more likely the actual test statistic did not occur by chance.</Paragraph> <Paragraph position="7"> The results of this computer-intensive testing are reported in two formats. The first format shows the significance levels for the pairwise comparison for each of the scores reported (Figures 3 - 6). The second and more informative format shows the significance groups or clusters based on the cutoffs at 0.I for significance level and 0.99 for confidence level (Figures 7 - 11). The significance groupings represent groups of systems which were not considered to be significantly different from each other according to these cutoff criteria. Please note that the F-measures are calculated using floating point arithmetic throughout and differ slightly from the F-measures in the official score reports, which were calculated based on the integer values of recall and precision. These more accurate F-measures depicted in Figures 9 - 11 appear in tabular format in Appendix G.</Paragraph> <Paragraph position="8"> 2. For further information on the score report, see the paper &quot;MUC-4 Evaluation Metrics&quot; in these proceedings. null iii;ii~i~i~i I . . . ...... ooo o ~ :i:: ii:i:i:::i:! &quot;&quot;&quot;&quot;&quot;'&quot; I .............</Paragraph> <Paragraph position="9"> !ii!:li~ili * * * * .... * * * &quot; deg * .... c~ ' ii::i.~ii::i: 1 I ~?:~ ............</Paragraph> <Paragraph position="10"> ii;~ii ............ ~ .... c- -- o:o:o:-- - o- g_ ::::;:::::::::::: i &quot;&quot;&quot;&quot;&quot;&quot;&quot; ' ' ~iii~\]i::i::i::i::iii I .......... - - i,~ ~.! - - - , ............... ,'iiii oo-o oo i?~:m:~:~: . * : ~ deg o ,~ ~ o o * * * * * * , * . * * ..... a ~,~,~ii',',iili~,i'~&quot;&quot;':&quot; .':. * ...... * * * &quot; * .... * * * &quot; ....... ~'i :'deg~odegdeg ......... o ~ o ~ o~ odeg~ o ~ o ~ o g o ~ o ~ _ o.&quot; o~ o~&quot;- ~i~ .....</Paragraph> <Paragraph position="11"> i &quot; * &quot; &quot; &quot; * .... ~ _oi ~?i: .:\[:: 0 0 :! .... ~ .... I ..... c~ I ...... i::::ii::::i::::::ii::iiiii ! o o .......< ;;;: ..........</Paragraph> <Paragraph position="12"> ~',~',;',~i~i~i~ ..... ........</Paragraph> <Paragraph position="13"> ....... i &quot;''&quot;&quot;&quot;&quot; * * * . . . * * * * ~rO O0 O,q&quot; OC, I O0 ~D u'),~r 00 ii!iii!i~ii!iiiii i ...... * * * - .&quot; .&quot; .&quot; .&quot; .&quot; .&quot; .&quot; .&quot; .&quot; I*~00 0C/'~10 C/~ 0000 ~ro ~ u') O0 O00000 ~ 0 tD . 0000 !iii|iiii i:!:ilJ,l~!:!:! ,,i,i~i,li . . ooo o ooo o ooo o ooo o oO __ ~ ! Ioo oo oo oo o ~.-~ .................................... ~; ~ ~o o~ ~ ~ o~o; Io;o~ o~o ~ ~ oOO~ ~ ~ ...... &quot;''' . * , , . . * * * !:!:!:!:!:!:~:!:~:i ~ = m O0 ~0 O0 O0 O0 ~U~ O00~ O0 O0 ::::::::: ::::::::::::::::::: ~\[~ ~. ::::::::::::::::: ??;?:;::: ioo odeg &quot;'i ........... deg! * .:.:.;-;-;.:-;-: I i:i~i~i::i:~i ..... iiii!iiiiiiiiiiiiiiii ~'deg ~ o o o ~. ~ ~ o ~- . o., o o o o,- o o. o o ~ !!iiiiiiiiii!ii!i!!ii!!iiii ~ o o o o o o o ~ o o ~ ~ o ,, o o o o ~ o o o o o. Oe~ 00i00 00 O0 O0 ~'0 'ql'O 00 04e~ O0 00 0C/~ i:i:i:~:!:i ::~ii~:i::i~ ~deg-9 oo oo oo oo -- ~o oo oo ~o oo ~,~ ~o oo ~i:! :~,~i:i:i~ - - i 0 0 0 0 0 0 0 0 0 0 ~ 0 0 0 0 0 ~ O) 0 0 ~ ~ 0 0 0 0 !!i~ii~ii~!ii ......... oo oo c~d do oo oo oC/~ C/~d od do C/~c~ oo do do &quot; rrr ......</Paragraph> <Paragraph position="14"> iiiiiiiiiiiiiill o o o o ~ - o o o ~ o~ ~ o o 00 O0 C/'~ O0 0 ~ii~ : o- o oo o ~oo eo o o o oo o o o~ ~ o degdego o dego deg ii!iiii,i~iiiiiiiiii odeg: :: :o deg. :: odeg: :: :: odego deg. :: :: odeg: ooo deg oO~ :: :o deg. ...............- . .-........ .-......... ........... :::::::::.:~i~::::: !::-?'.~:~: : :::liT:::</Paragraph> <Paragraph position="16"/> </Section> </Section> <Section position="7" start_page="35" end_page="42" type="metho"> <SectionTitle> HiiHHBHB/I IHH|HHHH RHHHHHHBHliHUHHHH IBRBBHHHBHIHHHHH WHHBHIHIBIHHHHHHH IHHHHHHHUHHHHHHH UHHHHHHH/HHHHHHHH EHBBBHHUHHHHHHHHH BHHHB/HHHHHHHHHH |HHBIUHHHHHHHHHHH UHHHIHHHHHHHHHHHHH UHIIHHHHHHHHHHHHHH |HmHHHHHHHHHHHHHHH </SectionTitle> <Paragraph position="0"> Although most of the information in the figures is self-explanatory, there are a few items of interest to be pointed out. The first concerns the kinds of point spreads that occur within and between significance groupings. For example, in Figure 7, the UMBC-Conquest team has a recall score of 2.5, which differs from the next higher recall score of 6.6 for USC by 4.1 percentage points. Whereas, the first large group of four systems, MDC (20.4), NMSU-Brandeis (21.9), LSI (23.0), and SRA (26.9), has a spread of 6.5 percentage points. This difference between what separates two significantly different systems and what separates systems which are not significantly different illustrates the cautionary note that this method is unable to determine that the evaluation scores are generally significant within a certain range of percentage points.</Paragraph> <Paragraph position="1"> Another anomaly in the results is in the precision scores. In Figure 8, the UMBC- Conquest team has a precision score of 20.8. This score is not significantly different ~)m seven other sites, although those seven sites divide into significance groupings of their own. The reason for this is the low number of actuals generated by the system.</Paragraph> <Paragraph position="2"> Figure 9 illustrates two more examples of how the conditions set up by the systems influence the outcome of the statistical significance testing more than the actual test statistics themselves. The first example involves MDC, Paramax, and SRA and the second involves GE, GE-CMU, and UMass.</Paragraph> <Paragraph position="3"> For the first example, the F-measures are as follows: lVIDC 24.33, Paramax 29.03, and SRA 29.33. Paramax and SRA are not significantly different at the 0.1 level with at least a confidence of 0.99; MDC and SRA are not, but MDC and Paramax are significantly different. If the cautionary note did not hold, then these results would be anomalous, because SRA's F-measure is higher than Paramax's. Therefore, we would expect that MDC and Paramax would not be significantly different because MDC and SRA are not. However, the F-measures are not the only factors affecting the results of the significance testing. The conditions set up by the systems during the shuffling have more influence than the raw results. This case illustrates that we cannot conclude from an approximate randomization test that the evaluation scores are generally significant within a certain range of percentage points.</Paragraph> <Paragraph position="4"> In the case of GE, GE-CMU, and UMASS, we see a similar illustration of the cautionary note. The F-mea- null sures are: GE 56.01, GE-CMU 51.98, and UMASS 51.61. The significance levels are GE and GE-CMU: 0.0415, GE and UMASS: 0.0994, and GE-CMU and UMASS: 0.8918. GE and GE-CMU are significantly different at the 0.05 level and GE-CMU and UMASS are not significantly different even at the 0.1 level. The significance levels show that GE is significantly different from GE-CMU and UMASS at the 0.1 level. However, the confidence level for the significance test for GE and UMASS does not meet the 0.99 cutoff because it is only 0.635. So the pair cannot be considered significantly different by our criteria. The cautionary note explains why this situation could arise even though we would expect that the significance level would have been higher for GE and GE-CMU than for GE and UMASS.</Paragraph> <Paragraph position="5"> TST4 The results for TST4 of MUC-4 are presented in this section. TST4 is a second test set chosen from an earlier time period. It contains more straightforward messages concerning terrorism with fewer irrelevant messages than TST3. The significance results are presented in the same formats as those for TST3. The first format is a matrix showing the significance levels for the pairwise comparison for each of the scores reported (Figures 12 - 15). The second and more informative format is a scatterplot showing the significance groups or clusters based on the cutoffs at 0.1 for significance level and 0.99 for confidence level (Figures 16 - 20). The significance groupings represent groups of systems which were not considered to be significantly different from each other according to these cutoff criteria.</Paragraph> </Section> class="xml-element"></Paper>