File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1002_metho.xml

Size: 18,167 bytes

Last Modified: 2025-10-06 14:12:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1002">
  <Title>MUC-3 EVALUATION METRIC S</Title>
  <Section position="3" start_page="0" end_page="145" type="metho">
    <SectionTitle>
SCORE REPORT
</SectionTitle>
    <Paragraph position="0"> A semi-automated scoring system was developed for MUC-3 .</Paragraph>
    <Paragraph position="1"> The scoring system displayed the answer key templates, the response templates, and the message s using a flexibly customized emacs interface . During scoring, the user was asked to enter the score for displayed mismatches between the key and the respons e templates. Fills could generally be scored as matches, partial matches, or mismatches . Depending on the type of slot fill, the scoring system may or may not have allowe d full credit to be given .</Paragraph>
    <Paragraph position="2"> The interactive scoring was carried out following well defined scoring guidelines .</Paragraph>
    <Paragraph position="3"> Depending on the scoring guidelines, full, partial, or n o credit may have been allowed for each mismatch .</Paragraph>
    <Paragraph position="4"> After the interactive scoring wa s complete, the scoring system produced an official score report containing template by template score reports and a summary score report for the official record . A sample summary score report produced for human comparison against the ke y appears in Figure 1 . The following sections discuss the contents of the score report .</Paragraph>
    <Section position="1" start_page="17" end_page="145" type="sub_section">
      <SectionTitle>
Scoring Categorie s
</SectionTitle>
      <Paragraph position="0"> Individual slot fills in the response were scored as correct, partially correct , incorrect, noncommittal, spurious, or missing. A response was correct if it was th e same as the key, partially correct if it partially approximated the key, and incorrec t if it was not the same as the key . If the key and response were both blank, th e response was scored as noncommittal . If the key was blank but the slot was filled, th e response was scored as spurious . If the response was blank and the key was not, th e response was scored as missing. Figure 2 summarizes the scoring categories relating them to the corresponding columns in the score report .</Paragraph>
      <Paragraph position="1">  The summary score report rows show the totals for each of the categories ove r all templates. The slots are listed on the left hand side and the totals for each slot ove r all templates are given in the labeled columns .</Paragraph>
      <Paragraph position="2"> For example, the total number o f physical targets correctly identified was 54 . The number appears in the phys-targetids row and the COR column of the summary score report . Note that the bottom fou r rows of the score report are not slot scores but rather global summary rows describe d in a later section .</Paragraph>
      <Paragraph position="3"> During scoring, the scoring system automatically scored matches as correc t and some partially matching hierarchically organized items as partially correct . However, many of the mismatches were interactively scored by the user . To reflect the number of items interactively scored as correct or partially correct, two column s labeled ICR and IPA were provided .</Paragraph>
      <Paragraph position="4"> The first two columns in the score report contain the number of possible slo t fills (POS) and the actual number of slot fills (ACT) . The number of possible slot fill s is the number of slots fills in the key plus the number of optional slot fills in the ke y that were matched in the response . The number of possible slot fills for each system differs depending on the optional fills given by the system . The number of actua l fills given is the number of slot fillers in the response . The numbers in the possible and actual columns are used to calculate the metrics .</Paragraph>
      <Paragraph position="5"> Calculation of Metric s The metrics were calculated for each slot and for the summary rows . The calculations were based on information in the columns of the score report as well a s on some tallies kept internally by the scoring system . The first three metrics show n in the score report are recall, precision, and overgeneration .</Paragraph>
      <Paragraph position="6"> These were calculate d for each slot and were based on information contained in the score report .</Paragraph>
      <Paragraph position="7"> Recall is a measure of completeness and was calculated as follows .</Paragraph>
      <Paragraph position="8"> correct + (partial x 0.5) possible For example, recall for the human-target-ids slot was calculated as follows .</Paragraph>
      <Paragraph position="10"> = 0 .1 9 Fallout is a measure of the false alarm rate . The number of false alarms could only be measured for slots for which we knew the number of possible incorrect responses .</Paragraph>
      <Paragraph position="11"> A subset of the slots in the template fill task were filled from finite sets . The rest of the slots are filled from possibly infinite sets . Fallout measures were calculated for the finite set fill slots as follows .</Paragraph>
      <Paragraph position="12"> fallout = Incorrect + spuriouspossible incorrec t where &amp;quot;possible incorrect&amp;quot; is the number of possible incorrect answers which coul d be given in the response . The number of possible incorrect is not shown in the scor e report but a tally is kept internally by the scoring system . The method for keepin g this tally of possible incorrect has evolved during the course of the evaluation . actua l  In order to describe this evolution, a simple calculation of fallout for a singl e slot in a single template will be given .</Paragraph>
      <Paragraph position="13"> The instrument type slot has 16 possibl e fillers.</Paragraph>
      <Paragraph position="14"> If the key contains the filler GUN and the response contains the fille r GRENADE, then fallout would b e</Paragraph>
      <Paragraph position="16"> The number of possible incorrect is the cardinality of the set of possibl e answers minus the number correct in the key which is 16 - 1, or 15 .</Paragraph>
      <Paragraph position="17"> In phase one, the fallout measure assumed that the system was essentiall y choosing a subset of the finite set of possible fills when it gave a response . For example, if the key for the instrument type slot contained GUN and GRENADE and the response contained BOMB, GRENADE, and CUTTING DEVICE, the phase one fallout woul d  The number of possible incorrect was the cardinality of the set minus the total number of slot fills given in the key .</Paragraph>
      <Paragraph position="18"> During phase two, it was noticed that this simple approach to fallout was in fact erroneous for several reasons. Some finite set slots allowed multiple uses of set members due to cross-referencing requirements . For example, the slot fill CIVILIA N might be used multiple times in specifying the human target type for differen t human targets .</Paragraph>
      <Paragraph position="19"> CIVILIAN: &amp;quot;MARIO FLORES &amp;quot; CIVILIAN : &amp;quot;JOSE RODRIGUEZ &amp;quot; Further complications arose when alternatives were given in the key for eac h such slot fill. In order to solve all of these problems, the calculation of the possible correct for the slot fills was revised to coincide more closely with the calculation use d in information retrieval .</Paragraph>
      <Paragraph position="20"> Each separate slot fill item is now thought of as bein g chosen from the entire finite set of possible fill items .</Paragraph>
      <Paragraph position="21">  In general, the number of possible incorrect is given by the followin g formula.</Paragraph>
      <Paragraph position="23"> where keyval stands for each of the key values including blanks, IUI is th e cardinality of the finite set U of possible slot fillers, and Ikeyvall is the number of key values corresponding to the response . If there are alternative key values for a response, then Ikeyvall &gt; 1 . If the key is blank, then there are no corresponding key values and the contribution to the number of possible incorrect is the cardinality o f the finite set .</Paragraph>
      <Paragraph position="24"> Returning to our example of instrument types with the key containing GU N and GRENADE and the response containing BOMB, GRENADE, and CUTTING DEVICE , fallout will be recalculated using the new method of determining the possibl e incorrect . The number of possible incorrect is calculated by summing over the slot fills . For GUN, the number of possible incorrect is the cardinality of the set, which i s 16, minus the number of slot fill alternatives given in the key, which in this case i s 1 . For GRENADE, the number of possible incorrect is also 15 . So the number of possible incorrect for this slot is 15 + 15, or 30 . Since there is 1 incorrect and 1 spurious response, fallout is 2/30, or 7% . In phase one, fallout was 14% for this same example.</Paragraph>
      <Paragraph position="25"> If there are alternatives to a single slot fill in the key, the contribution to the number of possible incorrect by that slot fill is the cardinality of the finite set minu s the number of alternatives given . For example, if the key is GUN/GRENADE, the number of possible incorrect is 16 - 2, or 14 .</Paragraph>
      <Paragraph position="26"> If the key is blank, the number of possible incorrect is the cardinality of the finite set . For example, if the instrument type slot is blank in the key and th e response is GUN and GRENADE, then the fallout i s  = 0 .1 3 Notice that if the number of spurious responses is great enough, fallout can be more than 100% .</Paragraph>
      <Paragraph position="27"> Meaning of Metric s Recall is a measure of completeness in the sense that it measures the amount o f relevant data extracted relative to the total available . It is the true positive rate . A mnemonic for recall can be constructed by imagining that you have been asked t o read the entire answer key, then fill in templates with all that you hav e  &amp;quot;remembered&amp;quot; or &amp;quot;recalled .&amp;quot; Your score would be the total correctly recalled out o f the total possible.</Paragraph>
      <Paragraph position="28"> Precision is the accuracy with which a system extracts data. It is the amount of relevant data relative to the total put in by the system . A mnemonic for precision is to imagine that each time a system fills a slot it is throwing a dart at a dartboard . All of the bull's-eyes are correct . Precision is a measure of the number of bull's-eyes relative to the number of darts thrown . Precision can also be described as th e tendency of a system to avoid assigning bad fillers as it assigns more good fillers . Fallout is a measure of the false positive rate . It is the tendency of the system to assign incorrect fillers as the number of potential incorrect fillers increases . So, for a mnemonic, if you are imagining the dartboard again, fallout measures th e number of darts that &amp;quot;fall outside&amp;quot; of the bull's-eye relative to the size of the are a outside the bull's-eye .</Paragraph>
      <Paragraph position="29"> Fallout can only be assigned for slots with a calculabl e number of possible incorrect .</Paragraph>
      <Paragraph position="30"> Only some of the slots have a finite set of slot fill s associated with them .</Paragraph>
      <Paragraph position="31"> The others have fills that come from potentially infinite set s and hence cannot be assigned a fallout score .</Paragraph>
      <Paragraph position="32"> Overgeneration is a measure of spurious generation .</Paragraph>
      <Paragraph position="33"> It is the amount of spurious fillers assigned in relation to the total assigned . Overgeneration wa s calculated to deter overgeneration as an approach to higher scores . A mnemonic -for overgeneration can be constructed by imagining that required fills and extra fill s are in a box. Overgeneration is represented by the area that the extra fills take up i n relation to the total area .</Paragraph>
      <Paragraph position="34"> Summary Score s The last four rows of the score report in Figure 1 are summary score rows . In phase one, there was one summary score row that represented the totals of th e columns for the scoring categories including possible and actual . The metrics were then calculated based on those totals and appeared in the appropriate columns in th e lower righthand portion of the chart . The summary metrics are always calculate d from the items in the summary totals and are never the result of averaging th e metrics for the slots .</Paragraph>
      <Paragraph position="35"> In phase two, it was decided that the scoring system should keep the interna l tallies needed to supply several summary score rows, only one of which would be the total of the slot scores shown in the columns of the score report . The scoring of slot s in the missing and spurious templates was the issue which gave rise to multipl e summary rows.</Paragraph>
      <Paragraph position="36"> In phase one, spurious templates were scored as spurious in th e template id slot only. The spurious slot fillers aside from the template id slot fille r were not scored as spurious . Missing templates, however, were scored in the templat e id slot and in the individual missing slots . This method of scoring did not penalize a s much for overpopulating the database as it did for underpopulating it .</Paragraph>
      <Paragraph position="37"> In phase two, we wanted to find out how the systems scored if overpopulatin g and underpopulating the database were treated equally . Two summary rows were added, one of which scored spurious and missing in the template id only and th e other of which scored spurious and missing templates for all of the spurious and missing slot fills. The official scores were still taken from the same summary row a s in phase one, but the other two rows were there for analysis .</Paragraph>
      <Paragraph position="38">  The global summary rows are listed on the score report in order of strictnes s based on the scoring of missing and spurious templates . The MATCHED ONLY row has missing and spurious templates only scored in the template id slot . This row contains the least strict of the scores for the system . The MATCHED/MISSING row contains the official test results . The missing template slots are scored as missing . The spuriou s templates are scored only in the template id slot . The totals in this row are the total s of the tallies in the columns as shown. The ALL TEMPLATES row has missing templat e slots scored as missing and spurious template slots scored as spurious . This row contains the strictest scores for the system .</Paragraph>
      <Paragraph position="39"> A fourth summary row was added to allow analysis of system performance o n only the set fill slots . The SET FILLS ONLY row contains totals for slots with finite se t fills only . A global fallout score is calculated for these slots and given in the fallou t column of this row .</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="145" end_page="145" type="metho">
    <SectionTitle>
CONCLUSIONS AND FURTHER RESEARCH
</SectionTitle>
    <Paragraph position="0"> The evaluation metrics for MUC-3 had utility for system development and fo r the reporting and analysis of the results of the evaluation. The metrics were adapte d from simpler task models and were still evolving when the evaluation wa s performed . There has been consistent agreement on the necessity of basi c measurements of completeness, accuracy, false alarm rate, and overgeneration .</Paragraph>
    <Paragraph position="1"> These measurements were accomplished through the metrics of recall, precision , fallout, and overgeneration as defined for MUC-3 .</Paragraph>
    <Paragraph position="2"> The global summary score s provide several different views of system performance . However, further analysis o f the current results is possible based on the information in the official score reports . The template by template scores are officially reported and can be used as a basis for further analysis .</Paragraph>
    <Paragraph position="3"> For example, performance at the message level can be calculate d from the template by template scores for the systems .</Paragraph>
    <Paragraph position="4"> While the metrics of recall, precision, fallout, and overgeneration have bee n defined for MUC-3, further research into the metrics and their implementation need s to be done. Additional measurements may be required . More refined definitions of the current measurements are probably needed . The complexities of optional fills , alternatives in the key, partial credit, and distribution of partial credit over ke y values, to name a few, still need to be examined more closely with consideration given to their effects on the metrics.</Paragraph>
    <Paragraph position="5"> These complexities have made it difficult t o fully test the scoring system software and require more attention to be paid t o detecting and isolating subtle errors .</Paragraph>
    <Paragraph position="6"> A different treatment of the slots will need to be attempted. For example, the template id slot is unique among the slots and will b e kept separate when the summary measures are calculated in the future.</Paragraph>
    <Paragraph position="7"> A single overall measure of performance may be possible in the future once the roles o f recall and precision are more fully determined. All of these avenues of furthe r research have been opened up by the definition of a set of metrics for MUC-3 and th e development of a scoring system embodying those metrics .</Paragraph>
  </Section>
  <Section position="5" start_page="145" end_page="145" type="metho">
    <SectionTitle>
ACKNOWLEDGEMENT S
</SectionTitle>
    <Paragraph position="0"> Many of the scoring issues were debated and resolved in consultation wit h members of the Program Committee . David Lewis provided technical guidance . Pete Halverson developed the scoring system . The participants provided feedback. Beth Sundheim was a sounding board for many of the issues that arose during th e evolution of the MUC-3 scoring .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML