File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1017_metho.xml
Size: 10,856 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1017"> <Title>THE PRC PAKTUS SYSTEM : MUC-4 TEST RESULTS AND ANALYSI S</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> KEY SYSTEM FEATURE S </SectionTitle> <Paragraph position="0"> The PRC PAKTUS system used for MUC-4 is essentially the same linguistic system that w e used for MUC-3, with the addition of a generic discourse analysis module . PAKTUS applies lexical, syntactic, semantic, and discourse analysis to all text in each document. The linguistic modules for this are nearly independent of the task domain (i .e., MUC-4, but some of the data -lexical entries, and a few grammar rules -- are tuned for the MUC-4 text corpus). Task-specific template filling and filtering operations are performed only after linguistic analysis is completed.</Paragraph> <Paragraph position="1"> The task-specific patterns that determine what to extract from the discourse structures wer e only minimally defined due to the limited time and effort available . The other task-specific additions to the system were the location set list, and functions for better recognizing time and location of events.</Paragraph> </Section> <Section position="4" start_page="0" end_page="132" type="metho"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> Figure 1 summarizes PRC's scores for MUC-4. The scoring notation is explained in Appendix G. Overall, we were pleased with the performance improvement since MUC-3, which was obtained with only about 4 person months of linguistic development effort, little of which wa s specific to the MUC-4 task. The most significant new development, compared to our MUC- 3 system, is the addition of the discourse analysis module . This module is generic for expository discourse such as is found in news reports. Application-specific extraction requirements ar e maintained separately from the discourse module, are applied only after it executes, and were minimally specified for MUC-4.</Paragraph> <Paragraph position="1"> Our system generally had much better precision than recall in these tests . We expected this because it uses complete linguistic analysis designed for text understanding, and because it ha s only a very limited amount of task-specific knowledge . For example, its discourse analysi s module was trained on only 8 of the MUC-4 pre-test corpus of 1500 reports . For these same reasons, we also expected a high degree of corpus independence, and this was supported by the similarity of scores on TST3 and TST4.</Paragraph> <Paragraph position="2"> The main limiting factors for PRC were time and availability of people for development. We directed most of our energies to generic linguistic development, and the linguistic aspects of th e task have essentially been completed. Because we had little time remaining to devote to MUC-4-specific issues, however, much of the information that PAKTUS produced through syntactic, semantic, and discourse analysis did not find its way into the template fills .</Paragraph> </Section> <Section position="5" start_page="132" end_page="133" type="metho"> <SectionTitle> DEVELOPMENT EFFORT </SectionTitle> <Paragraph position="0"> Three PRC researchers participated in linguistic development that contributed to MUC-4 performance. Most of this development was generic, however, and will support applications other than MUC-4. Figure 2 shows an estimate of our level of effort broken down by linguistic task.</Paragraph> <Paragraph position="1"> Our total linguistic development effort was about four months, with almost 40% of that on discourse analysis . Significant effort also went into time and location grammar functions, although this is small compared to the prior effort that went into the overall grammar.</Paragraph> <Paragraph position="2"> Lexicon entry was minimal, consisting primarily of semi-automatic entry of the MUC- 4 location set list. Many words from the MUC-4 corpus have never been entered into the PAKTU S lexicon. Instead, heuristics based on word morphology make guesses about these unrecognized words.</Paragraph> <Paragraph position="3"> The specific changes and additions to the PAKTUS knowledge bases for MUC-4 are enumerated in Figure 3 . Most of the lexical additions were from the MUC-4 location set list . These were added semi-automatically in batch mode. Other lexical additions were based on short One notable area that would have significantly improved performance was the definition o f MUC-4-specific conceptual patterns . These are used to extract information from the discourse structures. Very little was done here, however, due to limited time and resources . Only 88 of these patterns were added. We had intended to define several hundred, but that would have required about another month of effort.</Paragraph> </Section> <Section position="6" start_page="133" end_page="386" type="metho"> <SectionTitle> SYSTEM TRAINING AND PERFORMANCE IMPROVEMENT </SectionTitle> <Paragraph position="0"> As already noted, the most significant system improvement was in discourse analysis . The new discourse module was trained on only 8 documents from the test2 set . These were documents 1, 3, 10, 11, 48, 63, 99, and 100 . The time and location grammar and functional changes were based on manual analysis of the 100 test2 documents . The entire pre-test corpus was scanne d automatically to identify words missing from our lexicon, but only a few of these were entered -those more common words that did not conform to our unrecognized word heuristics .</Paragraph> <Paragraph position="1"> The improvement in PAKTUS's linguistic performance from MUC-3 up to the day of testin g for MUC-4 can be seen in Figure 4, derived from the test runs on the test2 corpus, using the F-measure specified for MUC-4. The development was carried out during April and May, 1992 .</Paragraph> <Paragraph position="2"> The basic functionality of the new discourse module was completed on May 6, and i t dramatically improved performance. This module has two main functions: 1) it builds discourse topic structures, and 2) it unifies noun phrases that refer to the same entity . There is a rather intricate interaction between these two functions, and this had to be carefully developed over th e next ten days (through May 17), so that improvement in one function did not impair the other .</Paragraph> <Paragraph position="3"> After completion of the two basic discourse functions, enhancements (pronoun reference, etc .) were added to the discourse module, through May 25 . This allowed only three days for MUC-4 specific knowledge to be added that could take advantage of the new discourse module .</Paragraph> <Paragraph position="4"> It can be seen from figure 4 that, once the discourse functions were properly integrated (o n May 17), performance improvement averaged one point per day over the last eleven days befor e official MUC-4 testing. We believe that the system is far from the limit of its extraction capabilit y based on its existing linguistic components. This belief is supported by the ease with which we improved performance on the MUC-4 conference walkthrough document (test2, document 48) by adding a few MUC-4-specific conceptual patterns .</Paragraph> </Section> <Section position="7" start_page="386" end_page="386" type="metho"> <SectionTitle> REUSABILITY OF THE SYSTEM </SectionTitle> <Paragraph position="0"> Almost all of PAKTUS is generic and can be applied to other applications . All of its processes, including the new discourse analysis module, are generic . They operate on a set of object-oriented knowledge bases, some of which are generic (common English grammar and lexicon) and some of which are domain-specific (conceptual templates).</Paragraph> <Paragraph position="1"> The primary tasks in applying PAKTUS to a new domain or improving its performance in an existing domain, are lexicon addition and conceptual template specification, both of which ar e relatively easy (compared to changing the grammar, for example) .</Paragraph> <Paragraph position="2"> Two other tasks that must be done, but only once for each new domain, are specifying th e input document formats, and the output specifications. These are template-driven in PAKTUS .</Paragraph> <Paragraph position="3"> For MUC-4 we used the template supplied by NRaD, adding a function for each template slot to gather information from our generic discourse data structures .</Paragraph> </Section> <Section position="8" start_page="386" end_page="386" type="metho"> <SectionTitle> WHAT WE LEARNED About PAKTUS </SectionTitle> <Paragraph position="0"> We learned that the current implementation of PAKTUS, including the new discourse module , is robust and adaptable. The more complex components (syntactic, semantic, and discours e analysis modules) are stable and competent enough to apply the system to different domains an d produce useful results, by adding domain-specific knowledge (lexicon and conceptual patterns) .</Paragraph> <Paragraph position="1"> We were particularly pleased to learn that it was not necessary to manually analyze much of th e corpus in detail. This was done for only eight documents for MUC-4. The full development corpus was used only for lexicon development and testing the system for overall performance an d logic errors.</Paragraph> <Paragraph position="2"> About the Task MUC-4 reinforced our appreciation of the importance of clearly defined output specifications , and the utility of having answer keys against which to measure the system's progress . We are already using the MUC-4 task specifications as a model for a new application of our system . We have also come to appreciate the utility of an automated scoring program to the development effort. This quickly eliminates much uncertainty about whether a new development i s useful or not, and thereby speeds system development.</Paragraph> <Paragraph position="3"> About Evaluation It is difficult to define evaluation measures for a task of this nature . Although the MUC-4 measures seem better than those of MUC-3, they do not accurately convey the true performance in some situations . For example, the system might correctly fill in 75% of the information for a template, but not report it because it got the wrong date (events over three months old are not reported), or the wrong country . We would prefer to report all incidents, with an extra slot indicating whether they are considered relevant or not. This seems more appropriate for evaluatin g linguistic competence . We also suspect that many analysts using such a system would like to be able to identify &quot;irrelevant&quot; incidents, especially since, given the current limits of linguistic technology, they may be relevant after all.</Paragraph> </Section> class="xml-element"></Paper>