File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1044_metho.xml

Size: 20,627 bytes

Last Modified: 2025-10-06 14:13:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1044">
  <Title>APPENDIX C: GUIDELINES FOR SCORING MISMATCHES</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
APPENDIX C:
GUIDELINES FOR SCORING MISMATCHES
BETWEEN SYSTEM RESPONSES AND ANSWER KEY
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> This document, although fairly extensive, is not intended to give you an exhaustive list of &amp;quot;do's&amp;quot; and &amp;quot;don'ts&amp;quot; about doing the interactive scoring of the templates. Instead, it presents you with guidelines and some examples, in order to imbue you with the spirit of the enterprise. It is up to you to carefully consider your reasons before judging mismatching responses to be &amp;quot;completely&amp;quot; or &amp;quot;partially&amp;quot; correct. If you have any doubt whether any given system response deserves to be judged completely/partially correct, count it incorrect.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. SETTING UP THE SCORING PROGRAM IN INTERACTIVE MODE
</SectionTitle>
    <Paragraph position="0"> You must use the latest official version of the scoring program together with the latest slotconfig.el file. You are not permitted to make any modifications of your own to the scoring software or the files it uses, except to define the pathnames in the config.el file for the files that it reads in.</Paragraph>
    <Paragraph position="1"> The configuration (config.el) files supplied with the test package set the :queryverbose option on, which places the scoring program in interactive mode. (See MUC Scoring System User's Manual, section 5.2.) The only feature of the interactive scoring that you are *not* permitted to take advantage of is the option to change a key or response template! This feature is controlled by the :disable-edit option, which is set on in the config.el files supplied in the test package and should not be modified.</Paragraph>
    <Paragraph position="2"> Although there may be errors in the key templates, you are not permitted to fix them, as we do not have sufficient time to make the corrections known to all sites. Score your system under the assumption that the answer key is correct, make note of any perceived errors in the key, and email them to NRaD along with your results. If there is sufficient evidence that errors were made that affect the scores obtained, a new key wil be prepared after the conference, and sites will be given the opportunity to rescore their system responses. The new scores will replace the old ones as the official results.</Paragraph>
    <Paragraph position="3"> Included among your options for interactive scoring is the manual realignment of response templates with key templates (see section 3.2.1 below and section 4.7 of User's Manual). If you are not already comfortable using the interactive scoring features of the scoring program, take some time to practice on some texts in the training set before you attempt to do the scoring for the test set. Also be sure to read the document on test procedures carefully to learn how to save your history buffer to a file for use in other scoring sessions required for completing the test procedure.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. SCORING MISMATCHED SLOT FILLERS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 By Type of Fill
</SectionTitle>
      <Paragraph position="0"> These subsections deal in turn with string fills, set fills, and other types of fills.</Paragraph>
      <Paragraph position="1"> Following that is a section concerning cross-reference tags.</Paragraph>
      <Paragraph position="2">  In the case of a mismatch on fillers for string-fill slots, the scoring program will permit you to score the response as fully correct, partially correct, or incorrect.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.1.1.1 Fully Correct
</SectionTitle>
    <Paragraph position="0"> NRaD has attempted to provide a choice of good string options for each string slot.</Paragraph>
    <Paragraph position="1"> If you get a mismatch, before you score a filler fully correct you should consider carefully whether your system's filler is both complete enough and precise enough to show that the system found exactly the right information. It is reasonable, for example, to assign full credit if your system picks up a string that is equivalent in meaning to the one in the key (e.g., &amp;quot;urban guerrillas&amp;quot; vs. &amp;quot;urban terrorists&amp;quot; in the PERP: INDIVIDUAL ID slot) but comes from a portion of the text that is distant from the portion containing most of the slot-filler information.</Paragraph>
    <Paragraph position="2"> The most likely situation where &amp;quot;fully correct&amp;quot; would be justified is in a case where the system or the key includes &amp;quot;nonessential modifiers&amp;quot; such as articles, quantifiers, and adjectivals for nationalities (e.g., SALVADORAN). The scoring program attempts to do this automatically, but it does not have an exhaustive list of nonessential modifiers.</Paragraph>
    <Paragraph position="3"> EXAMPLE (slot 19): RESPONSE &amp;quot;THE 3 PEASANTS&amp;quot;</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KEY &amp;quot;PEASANTS&amp;quot;
</SectionTitle>
    <Paragraph position="0"> In filling the key templates, stfch nonessential modifiers were generally included in the individual perpetrator ID slot (since there are no slots specifically for the number and nationality of the perpetrators). They were generally excluded from fillers for the other string slots, unless they seemed to be part of a proper name (e.g.</Paragraph>
    <Paragraph position="1"> THE EXTRADITABLES).</Paragraph>
    <Paragraph position="2"> &amp;quot;Fully correct&amp;quot; is also warranted if the system response contains more modifying words and phrases than the answer key, as long as all the modifiers are modifiers of the noun phrase. However, in most cases the answer key should already contain options such as these.</Paragraph>
    <Paragraph position="3"> EXAMPLE (slot 19): RESPONSE &amp;quot;OLD PEASANTS WHO WERE WITNESSES&amp;quot;</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KEY &amp;quot;PEASANTS&amp;quot; / &amp;quot;OLD PEASANTS&amp;quot;
</SectionTitle>
    <Paragraph position="0"> Finally, if your system does not generate an escape (backslash) character in front of the inner double quote marks of a filler that is surrounded by double double quotes, you may score the system response as completely correct if it would otherwise match the key.</Paragraph>
    <Paragraph position="1"> EXAMPLE.&amp;quot; RESPONSE &amp;quot;'TOO ....</Paragraph>
    <Paragraph position="2"> KEY W'FO0\ .... / &amp;quot;FO0&amp;quot;</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.1.1.2 Partially Correct
</SectionTitle>
    <Paragraph position="0"> You may score a filler partially correct, but not fully correct, if your system goes overboard and includes adjuncts in the response string that aren't part of the desired noun phrase.</Paragraph>
    <Paragraph position="1"> EXAMPLE (slot 19): RESPONSE</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KEY
</SectionTitle>
    <Paragraph position="0"> &amp;quot;THE 3 PEASANTS, WHICH THE GOVERNMENT</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ADMITTED WAS A MISTAKE&amp;quot;
&amp;quot;PEASANTS&amp;quot;
</SectionTitle>
    <Paragraph position="0"> Scoring a filler partially correct is also appropriate in cases where the key contains a proper name (in the most complete form found in the text) and the response contains only part of the name (i.e., uses an incomplete form found in the text).</Paragraph>
    <Paragraph position="1">  As described in section 5.2 of the MUC Scoring System User's Manual, the scoring program allows the user to &amp;quot;distribute&amp;quot; a partially correct score for a response across multiple key values. This action causes the scoring program to give the system credit for multiple partially correct fillers even though it only generated one. This is not allowed for set-fill slots, which are scored fully automatically, but it is allowed for other types of slots. The user is likely to find occasion to make use of this functionality primarily when scoring the target id/description/number slots.</Paragraph>
    <Paragraph position="2">  In the case of a mismatch on fillers for set-fill slots, the scoring program normally will automatically count the filler incorrect. But under certain conditions it will automatically assign partial credit instead (see subsections of section 3.2).  In the case of a mismatch on fillers for slots requiring other types of fills, the scoring program will normally query you to score the fillers as fully correct, partially correct, or incorrect. (However, assignment of partial credit for the LOCATION slot is sometimes assigned automatically -- see section 3.2.3.) Section 3.1.1.3, above, describes &amp;quot;distributed&amp;quot; partially correct score assignment. The only non-set-fill slots that include cross-reference tags are HUM TGT: DESCRIPTION, HUM TGT: NUMBER, and PHYS TGT: NUMBER. Notes on scoring these slots are found in the appropriate subsections of section 3.2.</Paragraph>
    <Paragraph position="3"> NRaD has attempted to offer all the possible alternative correct fillers as options in the key; however, scoring a filler completely or partially correct may be justified in certain cases. See the appropriate subsections of section 3.2 below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 By Individual Slot
</SectionTitle>
      <Paragraph position="0"> The guidelines here concern the manual realignment of templates in the case where the automatic template mapping facility provided by the scoring program fails to identify the optimal mapping between the set of response templates for a message and the set of key templates for that message. Guidelines are needed because it is possible for the user to elect not to map a response template to any key template at all, i.e., to map a response template to NIL and a key template to NIL rather than mapping the templates to each other. The user may wish to do this in cases where the match between the response and the key is so poor and the number of mismatching fillers so large that the user would rather penalize the system's recall and overgeneration (by mapping to NIL) than penalize the system's precision.</Paragraph>
      <Paragraph position="1"> However, to ensure the validity of the performance measures and to ensure comparability among the systems being evaluated, it is important that this option not be overused. The basic rule is that the user must permit a mapping between a response template and a key template if there is a full or partial match on the incident type. (The condition concerning a partial match covers the two basic situations described in the section below on INCIDENT: TYPE.) If there is no match on the incident type, manually mapping to NIL is allowed, at the discretion of the user.</Paragraph>
      <Paragraph position="2"> If the user wishes to make a template map to a different one than the one determined by the automatic mapping algorithm, the scoring program will permit it as long as the content-based mapping conditions are met. The content-based  mapping conditions require at least a partial match on INCIDENT: TYPE, plus at least a partial match on at least one of the perpetrator slots (INDIV ID or ORG ID), one of the physical target slots (ID or TYPE), or one of the human target slots (NAME, DESCRIPTION, or TYPE).</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FULLY CORRECT OR PARTIALLY CORRECT:
</SectionTitle>
    <Paragraph position="0"> System response is close to the key's date or range of dates (if the date is difficult to calculate). In the example below, the system's response may be judged fully correct, since the system has calculated a more precise date than what was expected by the key.</Paragraph>
    <Paragraph position="1">  1. System response is part of the date contained in the key (either if an incident occurred between two dates or if the filler in the key is a default value, i.e., consists of a range with the date from the message dateline as the upper anchor).  1. The key expresses a range between two known locations, and the system response contains only one location.</Paragraph>
    <Paragraph position="2"> EXAMPLE: RESPONSE COLOMBIA: MEDELLIN (CITY) KEY COLOMBIA: MEDELLIN (CITY) -CALI (CITY) 2. The response is completely correct except for the country. EXAMPLE: RESPONSE BOLIVIA: ANTIOQUIA (DEPARTMENT): MEDELLIN (CITY) KEY COLOMBIA: ANTIOQUIA (DEPARTMENT): MEDELLIN (CITY) NOTE: The scoring program will automatically score a response partially correct when it contains the correct country but no specific place. Partial credit can be interactively assigned when the response contains the correct country and an incorrect specific place.</Paragraph>
    <Paragraph position="3">  The scoring system will automatically score all mismatches as incorrect, with the following exception: The scoring program will automatically score the slot partially correct in the case where the filler in the response is ATTACK and the filler in the key is any other incident type.</Paragraph>
    <Paragraph position="4">  The scoring program will automatically score mismatching set fills incorrect, with the following exception: The scoring program will automatically score the fill partially correct when the system response is a set list item that is a superset of the  filler in the key, as determined by the shallow hierarchy of instrument types provided in the task documentation. This scoring is done irrespective of the correctness of the cross-reference tag.</Paragraph>
    <Paragraph position="5">  1. See sections 3.1.1.2 and 3.1.1.3.</Paragraph>
    <Paragraph position="6"> 2. Key contains rather general data and the response contains consistent, but inferior, general strings.</Paragraph>
    <Paragraph position="7"> EXAMPLE: RESPONSE &amp;quot;TERRORIST ACTIONS&amp;quot; KEY &amp;quot;URBAN TERRORISTS&amp;quot; 3.2.10 Slot 10 -- PERP: ORGANIZATION ID FULLY CORRECT: 1. In general, the guidelines in section 3.1.1.1 do not apply to this slot, since this slot is intended to be filled only with proper names. However, the term &amp;quot;proper names&amp;quot; is not completely defined, especially with respect to the expected fillers in the case of STATE-SPONSORED TERRORISM. You have more leeway to score fillers as fully correct in such cases.</Paragraph>
    <Paragraph position="8"> EXAMPLE: RESPONSE &amp;quot;POLICE&amp;quot; KEY &amp;quot;SECRET POLICE&amp;quot; 2. Response string includes both acronym and expansion (where they appear juxtaposed in the text) instead of just one or the other.</Paragraph>
    <Paragraph position="9"> EXAMPLE: RESPONSE &amp;quot;ARMY OF NATIONAL LIBERATION (ELN)&amp;quot; KEY &amp;quot;ARMY OF NATIONAL LIBERATION&amp;quot; / &amp;quot;ELN&amp;quot; PARTIALLY CORRECT: See sections 3.1.1.2 and 3.1.1.3.</Paragraph>
    <Paragraph position="10"> 3.2.11 Slot 11 -- PERP: ORGANIZATION CONFIDENCE  All mismatching set fills will automatically be scored incorrect, with the following exception: The scoring program will automatically score the system response partially correct in the case where the system generates SUSPECTED OR  The scoring program will automatically score mismatching set fills incorrect, with the following exception: The scoring program will automatically score the system response partially correct in the case where the system generates POLITICAL</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FIGURE OFFICE OR RESIDENCE instead of GOVERNMENT OFFICE OR RESIDENCE. This
</SectionTitle>
    <Paragraph position="0"> scoring is done irrespective of the correctness of the cross-reference tag.</Paragraph>
    <Paragraph position="1"> 3.2.14 Slot 14 -- PHYS TGT: NUMBER</Paragraph>
  </Section>
  <Section position="12" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PARTIALLY CORRECT:
</SectionTitle>
    <Paragraph position="0"> The number of cases where it is justifiable to score this slot partially correct should be extremely limited, especially in cases other than the following: response has a single number, and key has a range which includes that number as an anchor; response has a single number, and key has a tilde in front of that same number. In such cases, partial credit may be assigned irrespective of the correctness of the cross-reference tag.</Paragraph>
    <Paragraph position="1"> EXAMPLE: RESPONSE 7: &amp;quot;PYLONS&amp;quot; KEY 5 - 7: &amp;quot;PYLONS&amp;quot;</Paragraph>
    <Paragraph position="3"> It is also possible to &amp;quot;distribute&amp;quot; a partially correct score across multiple key values, as described in section 3.1.1.3. It would be justifiable to do this only in those cases where distribution of a partially correct score had already been done on the referenced filler in the PHYS TGT: ID slot.</Paragraph>
  </Section>
  <Section position="13" start_page="0" end_page="0" type="metho">
    <SectionTitle>
EXAMPLE: RESPONSE
KEY
</SectionTitle>
    <Paragraph position="0"> 3: &amp;quot;VEHICLES&amp;quot; 1: &amp;quot;AMBULANCE&amp;quot; 1: &amp;quot;FUEL TRUCK&amp;quot; 1: &amp;quot;STATION WAGON&amp;quot; 3.2.15 Slot 15 -- PHYS TGT: FOREIGN NATION  The scoring program will automatically score mismatching set fills incorrect. 3.2.16 Slot 16 -. PHYS TGT: EFFECT OF INCIDENT The scoring program will automatically score mismatching set fills incorrect, with the following exception: The scoring program will automatically score the fill partially correct if the system response is DESTROYED instead of SOME DAMAGE. (The reasoning here is that an understandable error would be to generate DESTROYED C-8 rather than SOME DAMAGE if a text says that a bomb destroyed part of a target (e.g., a few offices in a building that is identified as a target) and doesn't explicitly say that this implies that the target as a whole was merely damaged.) This scoring is done irrespective of the correctness of the cross-reference tag. 3.2.17 Slot 17 -- PHYS TGT: TOTAL NUMBER</Paragraph>
  </Section>
  <Section position="14" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PARTIALLY CORRECT:
</SectionTitle>
    <Paragraph position="0"> The number of cases where it is justifiable to score this slot partially correct should be extremely limited, especially in cases other than the following: response has a single number, and key has a range which includes that number as an anchor; response has a single number, and key has a tilde in front of that same number.</Paragraph>
    <Paragraph position="1">  1. See section 3.1.1.1.</Paragraph>
    <Paragraph position="2"> 2. Response is a correct proper name, but person's title/role is included as part of name, rather than in the HUM TGT: DESCRIPTION slot.</Paragraph>
  </Section>
  <Section position="15" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FULLY CORRECT: See section 3.1.1.1. However, when the filler for this slot includes a
</SectionTitle>
    <Paragraph position="0"> cross-reference tag, you may score the entire filler as fully correct only if the filler of the slot indicated by the cross-reference tag was also scored as fully correct.</Paragraph>
    <Paragraph position="1"> EXAMPLE: RESPONSE &amp;quot;MAYOR&amp;quot;: &amp;quot;TORRES&amp;quot;</Paragraph>
  </Section>
  <Section position="16" start_page="0" end_page="0" type="metho">
    <SectionTitle>
KEY &amp;quot;MAYOR OF ACHI&amp;quot;: &amp;quot;TORRES&amp;quot;
PARTIALLY CORRECT:
</SectionTitle>
    <Paragraph position="0"> 1. See sections 3.1.1.2 and 3.1.1.3.</Paragraph>
    <Paragraph position="1"> 2. Filler has the correct title or role but includes the person's name. EXAMPLE: RESPONSE &amp;quot;MR. XYZ&amp;quot; KEY &amp;quot;MR. &amp;quot;: &amp;quot;XYZ&amp;quot; 3. The non-tag portion of the filler doesn't match the key but is deemed completely correct, and the cross-reference tag is incorrect or missing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML