File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/91/m91-1003_concl.xml
Size: 2,866 bytes
Last Modified: 2025-10-06 13:56:39
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1003"> <Title>COMPARING MUCK-II AND MUC-3 : ASSESSING THE DIFFICULTY OF DIFFERENT TASK S</Title> <Section position="5" start_page="28" end_page="29" type="concl"> <SectionTitle> CONCLUSIONS </SectionTitle> <Paragraph position="0"> Table 4 attempts to summarize this discussion by providing a rough order of magnitude for th e different dimensions. We see from this that MUC-3 is many times harder than MUCK-II, in three of I The term &quot;MATCHED-MISSING&quot; refers to the metric which penalized systems for each missing template, but counte d spurious templates wrong only in the template ID slot, not in each individual filled slo t 1. Complexity of Data 10x 2. Corpus Dimensions 100x 3. Nature of Task 4. Difficulty of Template Fill 2x 5. Overall Performance 0.5x the four dimensions, while performance has only been cut by a factor of two . Even though the relation between difficulty and precision/recall figures is certainly not linear (the last 10-20% is always much harder to get than the first 80%), the degree of difficulty has increased much more than the performanc e has deteriorated.</Paragraph> <Paragraph position="1"> This comparison is reassuring in several respects . First, it means that the field has made very substantial progress in the past two years . MUC-3 shows that current message understanding systems are able to handle a realistic corpus, with a realistic throughput with a reasonable degree of accuracy - highe r precision and recall than many information retrieval systems are likely to get . Secondly, it means that as a test, MUC-3 was well-designed . Part of the motivation in changing tasks and domains after MUCK-I I was to make the problem realistic and sufficiently challenging so that there would be no easy or trick solutions . MUC-3 has served that purpose admirably . It is realistic but current systems can achieve a reasonable level of performance . It is hard enough so that there is substantial room for improvement . This task can provide a reasonable challenge for message understanding systems over the next severa l years.</Paragraph> <Paragraph position="2"> Finally, this comparison leads to an important conclusion about evaluation methodology . This paper represents a tentative first step towards defining some ways of measuring the dimensions of an application . But it is clear that we need to do much more work in this area in order to gain insight into what dimension s really affect success and which ones are less critical . We need to run experiments, where we can vary on e set of parameters, while holding others constant . In short, to gain the maximum benefit from evaluatio n efforts such as the MUC conferences, we need to make evaluation methodology itself a legitimate topi c for research .</Paragraph> </Section> class="xml-element"></Paper>