File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/m91-1007_intro.xml

Size: 7,123 bytes

Last Modified: 2025-10-06 14:05:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1007">
  <Title>Matched Only Matched / Missin g All Template s</Title>
  <Section position="3" start_page="0" end_page="60" type="intro">
    <SectionTitle>
RESULTS
</SectionTitle>
    <Paragraph position="0"> Our overall results on TST2 were very good in relation to other systems, but we devoted much of our analysi s to explaining why they were much lower than our expectations .</Paragraph>
    <Paragraph position="1"> The GE results on TST2 are unusual in that we experienced a considerable drop in performance between TST1 and TST2, in spite of enhancements to our system that showed substantial improvement on ou r testing prior to TST2 . Part of the drop is attributable to system-level problems introduced directly befor e the test . To determine the effect of these problems, we produced a revised run with two one-line changes i n the system code . However, even this revised run shows a significant difference between runs on TST1 an d the development corpus and the TST2 .</Paragraph>
    <Paragraph position="2">  The following table summarizes our results on the second test, TST2, both officially and with the syste m  In addition to these core results, we ran a number of other tests to put the TST2 runs in the context o f our other results. Figure 2 illustrates how our two runs TST2 (official) and TST2* (revised) compare wit h the historical system performance on training data. Data points that share an X coordinate represent run s using the same system configuration on distinct 100-message samples taken from the development corpus . Although the TST2* (revised) point is clearly more representative of system performance than TST2 , we were still surprised by the drop and did some analysis to try to determine its cause . While we canno t definitively explain why the TST2 points are lower, the lower performance on TST2 does not seem to indicate that our system was overly tuned to the development examples. To test this, we restored the system fro m tape to a configuration as close as possible to the TST1 run . This point, marked on the Figure 2 graph as TST2 in March, is still about 10 points lower in recall than the TST1 run . In addition, note that the range of recall scores on different sets of 100 texts from the development corpus, shown by unlabeled dots at an y fixed time on the graph, is about 20 points, a substantial variation .</Paragraph>
    <Paragraph position="4"> Although the much lower performance on TST2 could fall within the normal variation of performanc e among different message sets, we are still left to explain this variation, which did not seem to hit othe r systems as hard . The most likely hypothesis is that our program performed substantially lower on TST 2 than on other runs, because the strategy we chose in the final configuration was overly cautious in producin g templates, while the answer key had an unusually large number of templates . This hypothesis is supported b y the higher performance of our system in the MATCHED-ONLY row (see Figure 3 below) . The fact that ou r program produced less than half as many templates as the system with the highest MATCHED/MISSIN G recall, combined with the fact that the answer key contained more templates than other sets, adds to th e evidence that our program paid a &amp;quot;recall penalty&amp;quot; for generating fewer templates .</Paragraph>
    <Paragraph position="5"> To test this theory, we conducted a number of experiments, two of which involved using different strategie s that we had viewed as being sub-optimal . In one test, we eliminated all portions of code that cut out spurious templates, causing the program to generate about twice as many templates per message set, where most of the additional templates were incorrect (because the code had been specifically designed to eliminat e incorrect templates, not correct ones) . This change, certainly not one that improved the program, resulte d in a 6-point gain in recall on the TST1 set (shown by TST1 ovg in Figure 2) with a 3-point loss in precision .</Paragraph>
    <Paragraph position="6"> Then, we tried an experiment by blindly copying every template in our answer key (without changing the program or the answers otherwise). This resulted in a 6-point gain in recall with a 6-point loss in precision.</Paragraph>
    <Paragraph position="7"> Since these extra templates could not possibly be matching correctly (because no two events should b e alike), this experiment also shows that generating incorrect templates tends to result in higher recall than not generating enough templates, and suggests that overgenerating more intelligently tends to improve recal l more than it hurts precision .</Paragraph>
    <Paragraph position="8"> The TST2 set contained far more templates, as well as far more optional templates, than TST1 or the average for 100 messages in the development set . The development answer key contained, on average , 8 optional templates per 100 messages, while TST1 had 7 and TST2 had 32 . The development answers averaged 83 filled templates per 100 messages, and TST1 and TST2 had 95 and 130, respectively .</Paragraph>
    <Paragraph position="9"> Systems that overgenerate at the template level tend to be more impervious to changes in the percentag e of OPTIONAL templates because extra templates are more likely to match, perhaps felicitously . In addition , overgenerating at the template level helps to prevent missing non-optional templates, which have the greates t effect on MATCHED/1VIISSING recall .</Paragraph>
    <Paragraph position="10">  Figure 3 gives a concise summary of the number of templates each system generated with respect t o their recall in MATCHED/MISSING (M/M REC) and MATCHED-ONLY (M-O REC) . Our system kept its template overgeneration very low . Li fact, only 3 sites had lower template overgeneration, none o f them within 20 recall points . One system with slightly higher recall produced 148 additional spurious templates . Note that the systems with lower template overgeneration also tend to get a bigger gain in recal l in MATCHED-ONLY .</Paragraph>
    <Paragraph position="11"> The results seem to show a surprising variation from one test set to another, as well as an importan t tradeoff between template overgeneration and recall, especially in the important MATCHED/MISSIN G column . In retrospect, we believe that our overall TST2 results would have been closer to the expecte d performance of our system had we been less cautious about avoiding spurious templates . On the other hand, it might have been a good idea to measure template overgeneration (as well as the &amp;quot;accidental&amp;quot; matchin g of templates) as part of the results, since these incorrect templates are not a good thing . In most systems , overgeneration probably came from trying to maximize MATCHED/MISSING recall, so the MUC-3 scor e reporting didn't suggest that template overgeneration was a real issue . Probably the test design for MUC- 4 should show the relationship between template performance and overall scores more clearly.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML