XML Viewer - m92-1043

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1043_metho.xml
Size: 20,463 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1043">
  <Title>APPENDIX B : PROCEDURE FOR MUC-4 FINAL TESTIN G NOTE :</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
APPENDIX B :
PROCEDURE FOR MUC-4 FINAL TESTIN G
NOTE :
</SectionTitle>
    <Paragraph position="0"> This test procedure references the following files to be ftp'ed fro m /pub/muctest . READ THE TEST PROCEDURE BEFORE ACCESSING THESE FILES .</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. SCHEDULE
</SectionTitle>
    <Paragraph position="0"> You are not to ftp the test files until you are ready to start testing . Testing may be done any time between 26-31 May . The only requirement is that all reports (se e section 7, below) be submitted by first thing Monday morning, 1 June . Permission to attend MUC-4 in June may be revoked if you do not meet this deadline ! If you intend to carry out any of the optional testing (see below, section 4), you must report the planned optional test(s) to NRaD before starting the test procedure .</Paragraph>
    <Paragraph position="1"> This means that you should describe, in some meaningful terms, specifically how yo u will alter the behavior of the system and what kind of performance differences yo u expect to obtain .</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. PERFORMANCE OBJECTIVE S
</SectionTitle>
    <Paragraph position="0"> In reporting the results of MUC-4, we will be focusing on three aspects of th e scoring : a. Recall and precision in the Matched Only (MO), Matched/Missing (M/M) , Matched/Spurious (M/S), and All Templates (AT) rows . When displayed together in a scatter plot, the four data points form the corners of a rectangle that we are calling a system's basic &amp;quot;region of performance . &amp;quot; b. The overgeneration scores in the MO, M/M, M/S, and AT rows .</Paragraph>
    <Paragraph position="1"> c. The recall and precision scores in the Text Filtering row .</Paragraph>
    <Paragraph position="2"> When it is necessary to single out one set of scores from among the MO, M/M, an d M/S, and AT rows, we will usually single out the AT scores, since they penaliz e equally for missing and spurious data (unlike M/M and M/S) and they penalize bot h at the template level and at the slot-filler level (unlike MO) . Statistical significance testing of the overall results will be done on the basis of the AT scores .</Paragraph>
    <Paragraph position="3"> When it is advisable to present a scientifically valid means of determining a ranking of the systems, we will use a formula for calculating what is known as the F measure. Given two systems whose overall recall and precision sum up to the same  number and given equal weights for recall and precision in the F-measure formula , the formula will rank the system whose recall and precision are more equal highe r than the system whose recall and precision are more divergent . In order to sho w how the rankings may vary depending on the relative weight assigned to recall v s precision, we will present three different calculations of the F-measure, one i n which recall and precision are weighted equally, one in which recall is weighte d twice as heavily as precision, and one in which precision is weighted twice a s heavily as recall. We intend to conduct statistical significance testing at least for the version of the formula in which the weight of recall is equal to the weight o f precision .</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. REQUIRED TESTIN G
</SectionTitle>
    <Paragraph position="0"> The final test has three required components : a. a template-by-template and message-by-message performance test on TST3 , which is a test set of 100 articles taken from the same source and covering the sam e time period as those that comprise DEV, TST1, and TST2 ; b. a template-by-template and message-by-message performance test on TST4 , which is a test set of 100 articles taken from the same source as the other sets but representing incidents from a somewhat different time period ; c. an &amp;quot;adjunct&amp;quot; performance test on TST3 in which selected messages in the test set have been sorted into different categories and are to be scored separatel y template by template . See the README-adjunct-testl file for further information o n the nature of this test .</Paragraph>
    <Paragraph position="1"> [A second adjunct test will be carried out by GE and will require no additional effort on the part of the other participants. A description of this adjunct test i s provided in README-adjunct-test2. Please note that if you do not wish to participate in this test, you must notify NRaD by June 1 .] To complete the required testing, you will need approximately the same amoun t of time as it would normally take the system to produce templates for two sets of 10 0 new texts and for you to interactively score them once (to do any manual remapping s and produce template-by-template score reports) and to non-interactively mak e several more runs (a) to produce message-by-message score reports (total of 2 scoring runs) and (b) to use the remaining configuration files to produce template-by-template score reports for the adjunct test (total of 4 short scoring runs) and fo r measuring progress since MUC-3 (1 scoring run) .</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. OPTIONAL TESTIN G
</SectionTitle>
    <Paragraph position="0"> You are encouraged to design interesting experiments in which you hypothesiz e significant performance differences that can be obtained by such means a s removing a module or inserting one that's not part of the basis system or by alterin g the control structure of the system such that it produces templates more aggressivel y or more conservatively . An experiment may result in a single set of new scores or i n a continuous recall-precision &amp;quot;curve&amp;quot; .</Paragraph>
    <Paragraph position="1"> The objective of the optional testing is to learn more about the controlle d tradeoffs that some systems may be designed to make between recall and precision . If  your system meets one of the following two criteria, it is a candidate for optiona l testing : a) if the system can control the tradeoff between recall and precision in order t o produce a set of data points sufficient to plot the outline of a recall-precision curve ; b) if the system's recall and precision can be consciously manipulated by th e loosening or tightening of analysis constraints, etc ., in order to produce at least on e data point that contrasts in an interesting way with the results produced by th e required testing .</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. TEST PROCEDURE
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Freezing the System and FTP'ing the Test Packag e
</SectionTitle>
      <Paragraph position="0"> When you are ready to run the test, ftp the files in the test package fro m /pub/muctest. You are on your honor not to do this until you have completely froze n your system and are ready to conduct the test . You must stop all system development once you have ftp'ed the test package .</Paragraph>
      <Paragraph position="1"> Note : If you expect to be running the test over the weekend and are concerned that a host or network problem might interfere with your ability to ftp, you may ftp th e files on Friday . However, for your own sake, minimize the accessibility of those files , e.g., put them in a protected directory of someone who is not directly involved i n system development .</Paragraph>
      <Paragraph position="2">  Generating the System Response Template s There are 100 texts in tst3-muc4 and 100 texts in tst4-muc4 . Without looking at the texts, run your system against the files and name the output files response .tst3 and response .tst4, respectively . (For your information, the format of the message ID s is TST3-MUC4-nnnn and TST4-MUC4-nnnn .) You are to run the test only once -- you are not permitted to make any changes t o your system until the test is completed. If you get part way through the test and ge t an error that requires user intervention, you may intervene only to the extent tha t you are able to continue processing with the NEXT message . You are not allowed to  back up ! Notes : 1) If you run short on time and wish to break up the test sets and run portion s of them in parallel, that ' s fine as long as you are truly running in parallel with a single system or can completely simulate a parallel environment, i .e ., the systems are identically configured . You must also be sure to concatenate the outputs befor e submitting them to the scoring program .</Paragraph>
      <Paragraph position="3"> 2) No debugging of linguistic capability can be done when the system breaks . For example, if your system breaks when it encounters an unknown word and you r only option for a graceful recovery is to define the word, then abort processing an d start it up again on the next test message.</Paragraph>
      <Paragraph position="4"> B-3 3) If you get an error that requires that you reboot the system, you may do so, bu t  you must pick up processing with the message FOLLOWING the one that was bein g processed when the error occurred. If, in order to pick up processing at that point , you need to create a new version of the test set that excludes the messages alread y processed or you need to start a new output file, that 's ok. Be sure to concatenate th e output files before submitting them to the scoring program .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Editing Config Files to Supply Proper Pathname s
</SectionTitle>
      <Paragraph position="0"> Follow the instructions in this section before initializing the scoring program .</Paragraph>
      <Paragraph position="1"> The scoring program configuration (config) files contain arguments to th e define-muc-configuration-options function, which you will have to edit to supply the proper pathnames . Make no further edits to the config files .</Paragraph>
      <Paragraph position="2"> Also included in the test package are slotconfig-tst3 .el and slotconfig-tst4 .el , which have been updated to recognize the message IDs that are used in the test sets .</Paragraph>
      <Paragraph position="3"> Be sure that you have put the right pathname to each slotconfig file in each confi g file .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Remapping Template s
</SectionTitle>
      <Paragraph position="0"> It is recommended that this step be carried out BEFORE you start scoring .</Paragraph>
      <Paragraph position="1"> After the scoring program has been initialized using config-tst3 .el or config-tst4 .el (or config files for any optional tests) as the argument to initialize-mucscorer, you may wish to browse through the templates to see if there are an y mappings you wish to change using the manual template remapping feature of th e scoring program. When you have finished updating the mappings, exit the browse r and continue with the instructions given below .</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Scoring the System Response Templates
</SectionTitle>
      <Paragraph position="0"> Follow the instructions in this section each time the scoring program i s initialized .</Paragraph>
      <Paragraph position="1"> 5.5.1 For the Basic Tes t Scoring for the basic test is done by using config-tst3 .el, config-tst4.el, and  This section applies to config-tst3 .el and config-tst4 .el. Having started up the scoring program using config-tst3 .el or config-tst4.el as the argument to initialize-muc-scorer, type C-u s (i .e., Control-u followed by the letter &amp;quot;s&amp;quot;) so that the scoring program will produce template-by-template scor e reports .</Paragraph>
      <Paragraph position="2"> Refer to the interactive scoring guidelines while scoring . When you have finished scoring, save the score buffer (*MUC Score Display*) to the appropriate fil e  This section applies to config-tst3 .el and config-tst4.el. Before you reinitialize the scoring program with the next config file, type C-u 1 (i.e ., Control-u followed by the letter &amp;quot;1&amp;quot;) so that the scoring program will do a message-by-message scoring and the final summary table in the score buffer wil l include the TEXT FILTERING row .</Paragraph>
      <Paragraph position="3"> When you have finished scoring, save the score buffer (*MUC Score Display*) t o the appropriate file name : a) for config-tst3 .el, save it to scores .tst3-pass2 ; b) for config-tst4 .el, save it to scores .tst4-pass2 .</Paragraph>
      <Paragraph position="4"> After saving the score buffer, save the history to file using the &amp;quot;h&amp;quot; comman d (this will overwrite the version saved at the end of the template-by-template run) . 5.5.1.3 Scoring to Measure Progress Since MUC-3 This section applies only to config-progress-tst3 .el .</Paragraph>
      <Paragraph position="5"> This scoring run is to be done only for TST3 ; it makes use of config-progresstst3.el. The results of this scoring will be used as a point of comparison with MUC-3 . Therefore, the &amp;quot;display-type&amp;quot; option is set to &amp;quot;matched-missing&amp;quot; rather than to &amp;quot;all templates.&amp;quot; This run scores only the slots that are not in conflict with the templat e design that was used last year for MUC-3. This means that it does not score th e instrument slots nor any of the number slots .</Paragraph>
      <Paragraph position="6"> Even if you did not participate in MUC-3, you are asked to make this scoring run . (NRaD is using a similar config file to rescore an updated version of last year' s response templates for MUC-3 veteran sites.) The config file specifies that this scoring run will make use of the history fil e that you created when you originally scored TST3 . Thus, no interaction should b e needed when scoring .</Paragraph>
      <Paragraph position="7"> Having started up the scoring program using config-progress-tst3 .el as the argument to initialize-muc-scorer, type C-u s (i .e ., Control-u followed by the lette r &amp;quot;s&amp;quot;) so that the scoring program will produce template-by-template score reports . When you have finished scoring, save the score buffer (*MUC Score Display*) t o  b) config-lMT-tst3 .el , c) config-NST-tst3 .el , d) config-2MT-tst3 .el .</Paragraph>
      <Paragraph position="8"> Each config file specifies that the scoring run will make use of the history fil e that you created when you originally scored TST3.</Paragraph>
      <Paragraph position="9"> Thus, no interaction should b e needed when scoring.</Paragraph>
      <Paragraph position="10"> Furthermore, each run will score only a small number o f templates; thus, it should take little time to complete each run . Having started up the scoring program using the appropriate config file as th e argument to initialize-muc-scorer, type C-u s (i .e., Control-u followed by the lette r &amp;quot;s&amp;quot;) so that the scoring program will produce template-by-template score reports . When you have finished scoring, save the score buffer (*MUC Score Display*) t o the appropriate file name : a) for config-1ST-tst3 .el, save it to scores-1ST .tst3; b) for config-lMT-tst3 .el, save it to scores-IMT .tst3 ; c) for config-NST-tst3 .el, save it to scores-NST .tst3; d) for config-2MT-tst3 .el, save it to scores-2MT .tst3.</Paragraph>
      <Paragraph position="11"> You do not need to save the history file, and you do not need to do message-by-message scoring.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6. SPECIAL INSTRUCTIONS FOR OPTIONAL TESTIN G
</SectionTitle>
    <Paragraph position="0"> For each optional run, modify the system as you described in advance to NRaD .</Paragraph>
    <Paragraph position="1"> Then follow the applicable procedures in section 5 to produce and score ne w templates for TST3, using modified versions of config-tst3 .el. Depending on the objectives of your optional testing, you should produce template-by-template scores , message-by-message scores, or both .</Paragraph>
    <Paragraph position="2"> To yield these additional data points, you will generate and score new syste m response templates for TST3, using the history file generated during the require d testing. NO SYSTEM DEVELOPMENT IS PERMITTED BETWEEN OFFICIAL TESTING AND</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
OPTIONAL TESTING -- ONLY MODIFICATION OF SYSTEM CONTROL PARAMETERS AND/O R
REINSERTION OR DELETION OF EXISTING CODE THAT AFFECTS THE SYSTEM'S BEHAVIO R
WITH RESPECT TO THE TRADEOFF BETWEEN RECALL AND PRECISION .
</SectionTitle>
    <Paragraph position="0"> If, as a consequence of altering the system's behavior, templates are generate d that weren't generated during the required testing or slots are filled differently, yo u may find it necessary to add to the history file and to change some of the manua l template remappings . Start the scoring of each optional test with the history fil e generated during the previous run, minus the manual template remappings ; save any updated histories to new file names .</Paragraph>
    <Paragraph position="1"> In order to obtain these data points, you may wish to conduct a number of test s and throw out all but the best ones . Remember, however, that you are to notify NRa D  of ALL the planned experiments in advance (see section 1) . Thus, it would be wise t o experiment on the training data and use the results to know what different runs are worth making during the test .</Paragraph>
    <Paragraph position="2"> If you wish to conduct your optional tests on TST4 as well as on TST3, you may d o so, but please submit the results for TST4 only if you find the differences in scores t o be significant.</Paragraph>
    <Paragraph position="3"> Once you have determined which of the optional runs to submit to NRaD for th e official record, name the files for those runs in some meaningful, easily-understoo d fashion, fitting these patterns :  trace-tst4 (system trace for the 100 TST4 messages ) -- You may submit whatever you think is appropriate, i .e., whatever woul d serve to help validate the results of testing .</Paragraph>
    <Paragraph position="4"> Additional response, history, score, and trace files are expected for each optiona l test that you wish to have included in the official record .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 How to Submit Files
</SectionTitle>
      <Paragraph position="0"> Before you submit the expected files, PLEASE TAR AND COMPRESS THE FILES .</Paragraph>
      <Paragraph position="1"> Please help us identify your files by labeling the compressed tar file as follows : &lt;sitename&gt;-muctest .TAR .Z.</Paragraph>
      <Paragraph position="2"> B-7 The compressed tar file is to be submitted via anonymous ftp to the director y named /incoming. Please notify NRaD and SAIC by email after the file has been successfully transferred.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8.0 RESCORING OF RESULTS
</SectionTitle>
    <Paragraph position="0"> The interactive scoring should be done in strict conformance to the scorin g guidelines . If you perceive errors or other problems in the guidelines or in th e answer keys as you are doing the scoring, please make note of them and send a summary to NRaD.</Paragraph>
    <Paragraph position="1"> We will rescore everyone's response-muc4 .tst3 files, creating a cumulativ e history file. Your notes on perceived errors will be useful to us when we prepare t o do that rescoring. We will then distribute the cumulative history file to all sites an d will send the individual sites our version of their system's complete score report . The rescored version of the summary score reports will be labeled anonymously an d distributed to the MUC-4 sites prior to the conference .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML