File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1015_metho.xml
Size: 7,987 bytes
Last Modified: 2025-10-06 14:12:44
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1015"> <Title>SRI INTERNATIONAL'S TACITUS SYSTEM : MUC-3 TEST RESULTS AND ANALYSI S</Title> <Section position="3" start_page="0" end_page="106" type="metho"> <SectionTitle> LEVEL OF EFFORT </SectionTitle> <Paragraph position="0"> The only way of even approximating the amount of time spent on this effort is from figures on time charged to the project . All participants in the MUC-3 process will realize that this is not a very reliable wa y of estimating the level of effort .</Paragraph> <Paragraph position="1"> Since the preliminary MUC-3 workshop in February, approximately 800 person-hours were spent on th e project.</Paragraph> <Paragraph position="2"> The only possible way to break that down into subtasks is by personnel . Preprocessor, system development, testing : 180 hours The amount of the training corpus that was used varied with the component. For the relevance filter, all 1400 available messages were used . For the lexicon, every word in the first 600 and last 200 messages and in the TST1 corpus were entered . For the remaining messages, those words occurring more than once and al l non-nouns were entered .</Paragraph> <Paragraph position="3"> For syntax and pragmatics, we were able only to focus on the first 100 messages in the developmen t corpus .</Paragraph> <Paragraph position="4"> Tests were run almost entirely on the first 100 messages because those were the only ones for which a reliable key existed and because concentrating on those would give us a stable measure of progress . The system improved over time. On the February TST1 run, our recall was 14% and our precision wa s 68% on Matched and Missing Templates . At the end of March, on the first 100 messages in the development set, our recall was 22% and our precision was 63% . At the time of the TST2 evaluation, on the first 10 0 messages in the development set, our recall was 37% and our precision was 64% .</Paragraph> <Paragraph position="5"> WHAT WAS AND WAS NOT SUCCESSFUL As described in the System Summary, we felt that the treatment of unknown words was for the mos t part adequate .</Paragraph> <Paragraph position="6"> The statistical relevance filter was extremely successful . The keyword antifilter, on the other hand, i s apparently far too coarse and needs to be refined or eliminated .</Paragraph> <Paragraph position="7"> We felt syntactic analysis was a stunning success. At the beginning of this effort, we despaired of bein g able to handle sentences of the length and complexity of those in the MUC-3 corpus, and indeed man y sites abandoned syntactic analysis altogether . Now, however, we feel that the syntactic analysis of materia l such as this is very nearly a solved problem. The coverage of our grammar, our scheduling parser, and ou r heuristic of using the best sequence of fragments for failed parses combined to enable us to get a very hig h proportion of the propositional content out of every sentence . The mistakes that we found in the first 2 0 messages of TST2 can, for the most part, be attributed to about five or six causes, which could be remedie d with a few days work .</Paragraph> <Paragraph position="8"> On the other hand, the results for terminal substring parsing, our method for dealing with sentences of more than 60 words, are inconclusive, and we believe this technique could be improved . In pragmatics, much work remains to be done . A large number of fairly simple axioms need to be written , as well as some more complex axioms . In the course of our preparation for MUC-2 and MUC-3, we hav e made sacrifices in robustness for the sake of efficiency, and we would like to re-examine the trade-offs . We would like to push more of the problems of syntactic and lexical ambiguity into the pragmatics component , rather than relying on syntactic heuristics . We would also like to further constrain factoring, which no w sometimes results in the incorrect identification of distinct events .</Paragraph> <Paragraph position="9"> In template-generation, we feel our basic framework is adequate, but a great many details must be added . The module we would most like to rewrite is in fact not now a module but should be . It consists of the various treatments of subcategorization, selectional constraints, generation of canonical predicate-argumen t relations, and the sort hierarchy in pragmatics . At the present time, due to various historical accidents and compromises, these are all effectively separate . The new module would give a unified treatment to this whol e set of phenomena .</Paragraph> </Section> <Section position="4" start_page="106" end_page="106" type="metho"> <SectionTitle> USABILITY FOR OTHER APPLICATION S </SectionTitle> <Paragraph position="0"> In the preprocessor, the spelling corrector and the morphological word assignment component are usabl e in other applications without change .</Paragraph> <Paragraph position="1"> The methods used in the relevance filter are usable in other applications, but, of course, the particula r statistical model and set of keywords are not .</Paragraph> <Paragraph position="2"> In the syntactic analysis component, the grammar and parsing programs and the vast majority of the core lexicon are usable without change in another application . Only about five or six grammar rules are particular to this domain, encoding the structure of the heading, interview conventions, &quot;[words indistinct]&quot; , and so on. The logical form produced is application-independent .</Paragraph> <Paragraph position="3"> The theorem prover on which the pragmatics component is based is application-independent . All of the enhancements we have made in our 1VIUC-3 effort would have benefited our MUC-2 effort as well .</Paragraph> <Paragraph position="4"> In the knowledge base, only about 20 core axioms carried over from the opreps domain to the terroris t domain. Since most of the current set of axioms is geared toward MUC-3 's particular task, there would ver y probably not be much more of a carry-over to a new domain.</Paragraph> <Paragraph position="5"> The extent to which the template-generation component would carry over to a new application depend s on the extent to which the same baroque requirements are imposed on the output .</Paragraph> </Section> <Section position="5" start_page="106" end_page="106" type="metho"> <SectionTitle> WHAT WAS LEARNED ABOUT EVALUATIO N </SectionTitle> <Paragraph position="0"> On the one hand, the mapping from texts to templates is discontinuous in the extreme . One mishandle d semicolon can cost 4% in recall in the overall score, for example . Therefore, the numerical results of thi s evaluation must he taken with a grain of salt . Things can he learned about the various systems only by a deeper analysis of their performance . On the other hand, the task is difficult enough to provide a rea l challenge, so that pushing recall and precision both into the 70s or 80s will require the system to do virtuall y everything right .</Paragraph> <Paragraph position="1"> Leading up to MUC-3 there were a great many difficulties to be worked out, diverting the attention o f researchers from research to the mechanics of evaluation . It is to be hoped that most of these problems hav e been settled and that for 1\'IUC-4 they will constitute less of a. drain on researchers ' time.</Paragraph> <Paragraph position="2"> We feel the task of the MUC-3 evaluation is both feasible and challenging in the relatively short term .</Paragraph> <Paragraph position="3"> How practical it is is for others to judge.</Paragraph> </Section> <Section position="6" start_page="106" end_page="107" type="metho"> <SectionTitle> ACKNOWLEDGEMENTS </SectionTitle> <Paragraph position="0"> This research has been funded by the Defense Advanced Research Projects Agency under Office of Nava l Research contracts N00014-85-C-0013 and N00014-90-C-0220 .</Paragraph> </Section> class="xml-element"></Paper>