File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1002_metho.xml
Size: 19,355 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1002"> <Title>MUC-4 EVALUATION METRICS</Title> <Section position="3" start_page="0" end_page="22" type="metho"> <SectionTitle> SCORE REPORT </SectionTitle> <Paragraph position="0"> The MUC-4 Scoring System produces score reports in various formats . These reports show the scores for the templates and messages in the test set . Varying amounts of detail can be reported. The scores that are of the most interest are those that appear in the comprehensive summary report. Figure 1 shows a sample summary score report.</Paragraph> <Paragraph position="1"> The rows and columns of this report are explained below.</Paragraph> <Section position="1" start_page="0" end_page="22" type="sub_section"> <SectionTitle> Scoring Categories </SectionTitle> <Paragraph position="0"> The basic scoring categories are located at the top of the score report These categories are defined in Tabl e 1 . The scoring program determines the scoring category for each system response . Depending on the type of slot being scored, the program can either determine the category automatically or prompt the user to determine th e amount of credit the response should be assigned .</Paragraph> <Paragraph position="1"> * If the response and the key are deemed to be equivalent, then the fill is assigned the category of correct (COR).</Paragraph> <Paragraph position="2"> * If partial credit can be given, the category is partial (PAR).</Paragraph> <Paragraph position="3"> * If the key and response simply do not match, the response is assigned an incorrect (INC) . * If the key has a fill and the response has no corresponding fill, the response is missing (MIS).</Paragraph> <Paragraph position="4"> * If the response has a fill which has no corresponding fill in the key, the response is spurious (SPU) .</Paragraph> <Paragraph position="5"> * If the key and response are both left intentionally blank, then the response is scored as noncommittal (NON) .</Paragraph> </Section> </Section> <Section position="4" start_page="22" end_page="43" type="metho"> <SectionTitle> SLOT </SectionTitle> <Paragraph position="0"> template-Id . . . . . . . .. . . . . . .. . . . . . .. .. . . . . . .. . . . . . . . .. . . . . .. .. . . . . . .. . . . . . . .. . . . . . .. .. . . . . . . . .. . . . . . .. . . . . . . . Inc-be. . . . . .. . . . . . . .. . . . . . . .. .. . . . .. . . . . . . . .. . . . . . .. . .. . . . . .. . . . . . . . .. . . . . . .. . inc-stage. . . . . .. . . . . . . . .. . . . . . .. . . . . . . .. . . .</Paragraph> <Paragraph position="1"> ....................................................................................................</Paragraph> <Paragraph position="2"> inc-instr-type. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . .</Paragraph> <Paragraph position="3"> . . . . . .. . . . . . . .. . . . . . . .. . . . . . ..</Paragraph> <Paragraph position="4"> perp-Ind-id. . . . . .. . . . . . . .. . . . . . . .. . . . . . .. .</Paragraph> <Paragraph position="5"> perp-org-conf. . .. . . . . . . . .. . . . . . .. . . . . . . .. . .</Paragraph> <Paragraph position="6"> . . . .. . . . . . . .. . . . . . .. . . .</Paragraph> <Paragraph position="7"> phys-tgt-type. . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . .</Paragraph> <Paragraph position="8"> phys-tgt-natlon phys-tgt-total-num. . .. . . . . . .. . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . .</Paragraph> <Paragraph position="9"> hum-tgt-desc hum-tgt-num. . .. . . . . . . .. . . . . . . .. . . . . .. . . . . . . . . .. . . . . . hum-tgt-effect ACT COR PA R</Paragraph> <Paragraph position="11"> 105:. :*::84 5 54.:: . :.</Paragraph> <Paragraph position="12"> key and response are both blan k key is blank and response is not response is blank and key is not Table 1 : Scoring Categories In Figure 1, the two columns titled ICR (interactive correct) and IPA (interactive partial) indicate the result s of interactive scoring . Interactive scoring occurs when the scoring system finds a mismatch that it cannot automati cally resolve . It queries the user for the amount of credit to be assigned . The number of fills that the user assigns to the category of correct appears in the ICR column; the number of fills assigned partial credit by the user appears in the IPA column .</Paragraph> <Paragraph position="13"> In Figure 1, the two columns labelled possible (POS) and actual (ACT) contain the tallies of the numbers o f slots filled in the key and response, respectively. Possible and actual are used in the computation of the evaluation metrics. Possible is the sum of the correct, partial, incorrect, and missing . Actual is the sum of the correct, partial , incorrect, and spurious.</Paragraph> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> Evaluation Metrics </SectionTitle> <Paragraph position="0"> The evaluation metrics were adapted from the field of Information Retrieval (IR) and extended for MUC .</Paragraph> <Paragraph position="1"> They measure four different aspects of performance and an overall combined view of performance . The four evaluation metrics of recall, precision, overgeneration, and fallout are calculated for the slots and summary score rows (se e Recall (REC) is the percentage of possible answers which were correct. Precision (PRE) is the percentage of actual answers given which were correct A system has a high recall score if it does well relative to the number of slo t fills in the key. A system has a high precision score if it does well relative to the number of slot fills it attempted . In IR, a common way of representing the characteristic performance of systems is in a precision-recal l graph. Normally as recall goes up, precision tends to go down and vice versa [1] . One approach to improving recall is to increase the system's generation of slot fills . To avoid overpopulation of the template database by the message understanding systems, we introduced the measure of overgeneration. Overgeneration (OVG) measures the percentage of the actual attempted fills that were spurious.</Paragraph> <Paragraph position="2"> Fallout (FAL) is a measure of the false positive rate for slots with fills that come from a finite set. Fallout is the tendency for a system to choose incorrect responses as the number of possible responses increases . Fallout is calculated for all of the set fill slots listed in the score report in Figure 1 and is shown in the last column on the right. Fallout can be calculated for the SET FILLS ONLY row because that row contains the summary tallies for the set fill slots. The TEXT FILTERING row discussed later contains a score for fallout because the text filtering problem als o has a finite set of responses possible.</Paragraph> <Paragraph position="3"> These four measures of recall, precision, overgeneration, and fallout characterize different aspects of syste m performance. The measures of recall and precision have been the central focus for analysis of the results . Overgeneration is a measure which should be kept under a certain value . Fallout was rarely used in the analyses done of the results. It is difficult to rank the systems since the measures of recall and precision are often equally important ye t negatively correlated. In IR, a method was developed for combining the measures of recall and precision to get a sin gle measure. In MUC-4, we use van Rijsbergen's F-measure [1, 2] for this purpose .</Paragraph> <Paragraph position="4"> The F-measure provides a way of combining recall and precision to get a single measure which fall s between recall and precision . Recall and precision can have relative weights in the calculation of the F-measure giving it the flexibility to be used for different applications . The formula for calculating the F-measure is</Paragraph> <Paragraph position="6"> where P is precision, R is recall, and 13 is the relative importance given to recall over precision . If recall and precision are of equal weight, 13 = 1.0. For recall half as important as precision, 13 = 0.5. For recall twice as important as precision,13 = 2 .0.</Paragraph> <Paragraph position="7"> The F-measure is higher if the values of recall and precision are more towards the center of the precision recall graph than at the extremes and their sums are the same. So, for 3 = 1.0, a system which has recall of 50% an d precision of 50% has a higher F-measure than a system which has recall of 20% and precision of 80%. This behavior is exactly what we want from a single measure .</Paragraph> <Paragraph position="8"> The F-measures are reported in the bottom row of the summary score report in Figure 1 . The F-measure with recall and precision weighted equally is listed as &quot;P&R .&quot; The F-measure with precision twice as important as recall is listed as &quot;2P&R .&quot; The F-measure with precision half as important as recall is listed as &quot;P&2R .&quot; The F-measure is cal culated from the recall and precision values in the ALL TEMPLATES row. Note that the recall and precision value s in the ALL TEMPLATES row are rounded integers and that this causes a slight inaccuracy in the F-measures . The values used for calculating statistical significance of results are floating point values all the way through the calculations. Those more accurate values appear in the paper &quot;The Statistical Significance of the MUC-4 Results &quot; and in Appendix G of these proceedings .</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> Summary Rows </SectionTitle> <Paragraph position="0"> The four rows labeled &quot;inc-total,&quot; &quot;perp-total,&quot; &quot;phys-tgt-total,&quot; and &quot;hum-tgt-total&quot; in the summary scor e report in Figure 1 show the subtotals for associated groups of slots referred to as &quot;objects .&quot; These are object level scores for the incident, perpetrator, physical target, and human target . They are the sums of the scores shown for the F= individual slots associated with the object as designated by the first part of the individual slot labels . The template for MUC-4 was designed as a transition from a flat template to an object-oriented template . Although referred to as object-oriented, the template is not strictly object-oriented, but rather serves as a data representation upon which a n object-oriented system could be built[3] . However, no object-oriented database system was developed using this template as a basis.</Paragraph> <Paragraph position="1"> The four summary rows in the score report labelled &quot;MATCHED/MISSING,&quot; &quot;MATCHED/SPURIOUS, &quot; &quot;MATCHED ONLY,&quot; and &quot;ALL TEMPLATES&quot; show the accumulated tallies obtained by scoring spurious and miss ing templates in different manners . Each message can cause multiple templates to be generated depending on the number of terrorist incidents it reports. The keys and responses may not agree in the number of templates generated or the content-based mapping restrictions may not allow generated key and response templates to be mapped to eac h other. These cases lead to spurious and/or missing templates. There are differing views as to how much systems should be penalized for spurious or missing templates depending upon the requirements of the application . These differing views have lead us to provide the four ways of scoring spurious and missing information as outlined in Table 3 . The MATCHED ONLY manner of scoring penalizes the least for missing and spurious templates by scorin g them only in the template id slot. This template id score does not impact the overall score because the template id slot is not included in the summary tallies; the tallies only include the other individual slots . The MATCHED/MISSING method scores the individual slot fills that should have been in the missing template as missing and scores the template as missing in the template id slot; it does not penalize for slot fills in spurious templates except to score the spurious template in the template id slot. MATCHED/SPURIOUS, on the other hand, penalizes for the individual slo t fills in the spurious templates, but does not penalize for the missing slot fills in the missing templates . ALL TEMPLATES is the strictest manner of scoring because it penalizes for both the slot fills missing in the missing template s and the slots filled in the spurious templates . The metrics calculated based on the scores in the ALL TEMPLATES row are the official MUC-4 scores .</Paragraph> <Paragraph position="2"> These four manners of scoring provide four points defining a rectangle on a precision-recall graph which we refer to as the &quot;region of performance&quot; for a system (see Figure 2) . At one time, we thought that it would be useful to compare the position of the center of this rectangle across systems, but later realized that two systems could have th e same centers but very different size rectangles. Plotting the entire region of performance for each system does provid e a useful comparison of systems .</Paragraph> <Paragraph position="3"> In Figure 1, the score report contains two summary rows (SET FILLS ONLY and STRING FILLS ONLY ) which give tallies for a subset of the slots based on the type of fill the slot can take. These rows give tallies that show the system's performance on these two types of slots: set fill slots and string fill lots . Set fill slots take a fill from a finite set specified in a configuration file . String fill slots take a fill that is a string from a potentially infinite set . Text Filtering The purpose of the text filtering row is to report how well systems distinguish relevant and irrelevant messages. The scoring program keeps track of how many times each of the situations in the contingency table arises for a system (see Table 4). It then uses those values to calculate the entries in the TEXT FILTERING row. The evaluation metrics are calculated for the row as indicated by the formulas at the bottom of Table 4 . An analysis of the text filtering results appears elsewhere in these proceedings.</Paragraph> <Paragraph position="4"> IMPROVEMENTS OVER MUC- 3 The major improvements in the scoring of MUC-4 included: * automating the scoring as effectively as possible * restricting the mapping of templates to cases where particular slots matched in content as opposed to mapping only according to an optimized scor e * the exclusion of template id scores from the summary score tallie s * the inclusion of more summary information including object level scores, string fills onl y scores, text filtering scores, and F-measures.</Paragraph> <Paragraph position="5"> These changes are interdependent ; they interact in ways that affect the overall scores of systems and serve to make MUC-4 a more demanding evaluation than MUC-3.</Paragraph> <Paragraph position="6"> The complete automation of the scoring of set fill slots was possible due to the information in a slot configuration file which told the program the hierarchical structure of the set fills. If a response exactly matches the key, it is scored as correct. If a response is a more general set fill element than the key according to the pre-specified hierarchy , it is scored as partially correct. If the response cannot be scored as correct or partially correct by these criteria then the set fill is scored as incorrect . All set fills can thus be automatically scored . Often, however, the set fill is cross-referenced to another slot which is a string fill . The scoring of string fills cannot be totally automated . Instead the scoring program refers to the history of the interactive scoring of the cross-referenced slot, and with that information, it the n determines the score for the set fill slot which cross-references the string fill slot .</Paragraph> <Paragraph position="7"> The scoring of the string fill slots was partially automated by using two methods . In the first method, used for mapping purposes, strings were considered correct if there was a one-word overlap and the word was not from a short list of premodifiers. In the second method, used for scoring purposes, some mismatching string fills could b e matched automatically by stripping these premodifiers from the key and response and seeing if the remaining material matched. Other mismatching string fills caused the user to be queried for the score . The automation of the set fill and string fill scoring was critical to the functioning of the content-based mapping.</Paragraph> <Paragraph position="8"> The content-based mapping restrictions were added to MUC-4 to prevent fortuitous mappings whic h occurred in MUC-3. In MUC-3, templates were mapped to each other based on a simple optimization of scores . Sometimes the optimal score was the result of a lucky mapping which was not really the most appropriate mapping .</Paragraph> <Paragraph position="10"> Certain slots such as incident type were considered essential for the mapping to occur in MUC-4 . The mapping restrictions can be specified in the scorer's configuration file using a primitive logic . For the MUC-4 testing, the templates must have at least a partial match on the incident type and at least one of the following slots : The content-based mapping restrictions could result in systems with sparse templates having few or no tem plates mapped. When a template does not map, the result is one missing and one spurious template. This kind of penalty is severe when the ALL TEMPLATES row is the official score, because the slots in the unmapped templates all count against the system as either missing or spurious. This aspect of the scoring was one of the main reasons that MUC-4 was more demanding than MUC-3.</Paragraph> <Paragraph position="11"> The focus on the ALL TEMPLATES score as opposed to the MATCHED/MISSING score in MUC-3 mean t that the strictest scores for a system were its official scores . So even if a system's official scores were the same for MUC-3 and MUC-4, the system had improved in MUC-4 . Additionally, the scores for the template id row were not included in the summary row tallies in MUC-4 as they had been in MUC-3 . Previously, systems were getting extra credit for the optimal mapping. This bonus was taken away by the exclusion of the template id scores from the score tallies in MUC-4.</Paragraph> <Paragraph position="12"> In addition to the more demanding scoring, MUC-4 also measured more information about system perfor mance. Object level scores were added to see how well the systems did on different groupings of slots concerning the incident, perpetrator, physical target, and human target. Also, the score for the string fill slots was tallied as a compar ison with the score for set fill slots that was already there in MUC-3. The text filtering scores gave additional information on the capabilities of systems to determine relevancy . The F-measures combined recall and precision to give a single measure of performance for the systems .</Paragraph> </Section> </Section> class="xml-element"></Paper>