File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1038_metho.xml
Size: 18,641 bytes
Last Modified: 2025-10-06 14:12:49
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1038"> <Title>APPENDIX B : TEST PROCEDURES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> APPENDIX B : TEST PROCEDURES 1. GENERAL INSTRUCTION S </SectionTitle> <Paragraph position="0"> Testing may be done any time during the week of 6-12 May . The onl y requirement is that all reports (see section 4, below) be received by NOSC by firs t thing Monday morning, 13 May . Permission to attend MUC-3 at NOSC on 21-23 Ma y may be revoked if you do not meet this deadline ! To complete the required testing, you will need approximately the same amount o f time as it would normally take you to run 100 texts in DEV and interactively scor e them, plus some time to permit you to be extra careful doing the interactive scorin g (since the resulting history file is to be used for all passes through the scorin g program) and some time for the initializations of the scoring program with the different configuration files required for the various linguistic phenomena tests . If you carry out the optional testing, you will need to allow time to generate at least a couple new sets of response templates . In that case, you will also need time to add t o the history file as needed during the additional scoring runs .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> IF YOU INTEND TO CARRY OUT ANY OF THE OPTIONAL TESTING, YOU MUST REPORT THE PLANNED &quot;PARAMETER SETTINGS&quot; TO NOSC FOR BOTH THE REQUIRED TEST AND TH E OPTIONAL TESTING BEFORE STARTING THE TEST PROCEDURE. This means that you should </SectionTitle> <Paragraph position="0"> describe, in some meaningful terms, SPECIFICALLY how you will alter the behavio r of the system so that it will produce each of the different tradeoffs in metric s described in the sections below .</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 REQUIRED TESTING : MAXIMIZED RECALL/PRECISION TRADEOF F </SectionTitle> <Paragraph position="0"> To ensure comparability among the test results for all systems, THE REQUIRE D</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> TESTING MUST BE CONDUCTED WITH THE SYSTEM SET TO MAXIMIZE THE TRADEOF F BETWEEN RECALL AND PRECISION IN THE MATCHED/MISSING ROW IN THE SCOR E SUMMARY REPORT. The maximum of recall and precision does not mean an ADDITIVE </SectionTitle> <Paragraph position="0"> maximization, but that the total scores for each of the two metrics should be as clos e together and as high as possible . For most systems, this is probably the normal way the system operates .</Paragraph> <Paragraph position="1"> Several passes through the scoring program will be required, one for the officia l test on generating templates for the whole test set and the others for the experimental tests on generating the specific slots called out by the linguisti c phenomena tests .</Paragraph> <Paragraph position="2"> You generate only one set of system responses, and only the firs t pass through the scoring program will require user interaction .</Paragraph> <Paragraph position="3"> The history fil e produced during this interaction will be used in the scoring of the linguisti c phenomena tests .</Paragraph> <Paragraph position="4"> (It will also serve as the basis for scoring any optional tests tha t are conducted . )</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 1.2 OPTIONAL TESTING : OTHER RECALL/PRECISION TRADEOFF S </SectionTitle> <Paragraph position="0"> The objective of the optional testing is to learn more about the tradeoffs that som e systems may be designed to make between recall and precision . It is intended to elici t extra data points only on those systems that are currently designed to make som e theoretically interesting tradeoffs in some controlled fashion.</Paragraph> <Paragraph position="1"> Thus, we are interested in having you conduct the optional testing in either of th e two following cases, but not otherwise : 1) if the system can control the tradeoff between recall and precision in order t o produce a set of data points sufficient to plot the outline of a recall-precisio n curve ; 2) if the system's recall and precision can be consciously manipulated by th e loosening or tightening of analysis constraints, etc ., in order to produce a t least one data point that contrasts in an interesting way with the result s produced by the required testing .</Paragraph> <Paragraph position="2"> To yield these additional data points, you will generate and score new syste m response templates, using the history file generated during the required testing . N O</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> SYSTEM DEVELOPMENT IS PERMITTED BETWEEN OFFICIAL TESTING AND OPTIONA L TESTING -- ONLY MODIFICATION OF SYSTEM CONTROL PARAMETERS AND/O R REINSERTION OR DELETION OF EXISTING CODE THAT AFFECTS THE SYSTEM'S BEHAVIO R WITH RESPECT TO THE TRADEOFF BETWEEN RECALL AND PRECISION. </SectionTitle> <Paragraph position="0"> If, as a consequence of altering the system's behavior, templates are generate d that weren't generated during the required testing or slots are filled differently, you may find it necessary to add to the history file and to change some of the manua l template remappings. START THE SCORING OF EACH OPTIONAL TEST WITH THE HISTOR Y</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> FILE GENERATED DURING THE REQUIRED TESTING, MINUS THE MANUAL TEMPLAT E REMAPPINGS ; SAVE ANY UPDATED HISTORIES TO NEW FILE NAMES . </SectionTitle> <Paragraph position="0"> In order to obtain these data points, you may wish to conduct a number of test s and throw out all but the best ones . Remember, however, that you are to notify NOS C of ALL the planned parameter settings in advance (see section 1) . Thus, it would b e wise to experiment on the training data and use the results to know what differen t runs are worth making during the test . If, among the &quot;throwaways&quot; there are som e results that you find significant, you may wish to include them in your site report fo r the MUC-3 proceedings, but they will not be part of the official record .</Paragraph> <Paragraph position="1"> You may submit results for the experimental linguistic phenomena tests as part of the optional testing if you wish, but please do so only if you find the differences i n scores to be significant.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. SPECIFIC PROCEDURES FOR THE REQUIRED TESTIN G 2.1 FREEZING THE SYSTEM AND FTP'ING THE TEST PACKAG E </SectionTitle> <Paragraph position="0"> When you are ready to run the test, ftp the files in the test package fro m /pub/tst2. You are on your honor not to do this until you have completely froze n your system and are ready to conduct the test. You must stop all system developmen t once you have ftp'ed the test package .</Paragraph> <Paragraph position="1"> Note : If you expect to be running the test over the weekend and are concerned tha t a host or network problem might interfere with your ability to ftp, you may ftp the B--2 files on Friday . However, for your own sake, minimize the accessibility of those files , e .g., put them in a protected directory of someone who is not directly involved i n system development.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 GENERATING THE SYSTEM RESPONSE TEMPLATE S </SectionTitle> <Paragraph position="0"> There are 100 texts in tst2-muc3, and the message IDs have the following format : TST2-MUC3-nnnn . Without looking at the texts, run your system against the file an d name the output file response-max-tradeoff.tst2 .</Paragraph> <Paragraph position="1"> You are to run the required test only once -- you are not permitted to make an y changes to your system until the test is completed. If you get part way through the test and get an error that requires user intervention, you may intervene only to th e extent that you are able to continue processing with the NEXT message . You are not allowed to back up!</Paragraph> <Paragraph position="3"> If you run short on time and wish to break up tst2-muc3 and run portions of i t in parallel, that's fine as long as you are truly running in parallel with a single system or can completely simulate a parallel environment,</Paragraph> <Paragraph position="5"> systems are identically configured .</Paragraph> <Paragraph position="6"> You must also be sure to concatenate th e outputs before submitting them to the scoring program . 2) No debugging of linguistic capability can be done when the system breaks . For example, if your system breaks when it encounters an unknown word an d your only option for a graceful recovery is to define the word, then abor t processing and start it up again on the next test message . 3) If you get an error that requires that you reboot the system, you may do so, bu t you must pick up processing with the message FOLLOWING the one that wa s being processed when the error occurred . If, in order to pick up -processing at that point, you need to create a new version of tst2-muc3 that excludes th e messages already processed or you need to start a new output file, that's ok. Be sure to concatenate the output files before submitting them to the scorin g program .</Paragraph> <Paragraph position="7"> 2 .3 SCORING THE SYSTEM RESPONSE TEMPLATE S 2 .3.1 SCORING ALL SYSTEM RESPONSES FOR OFFICIAL, REQUIRED TES T Run the scoring program on the system response templates, using key-tst2 as th e answer key and entering config .el as the argument to initialize-muc-scorer . (The config file contains arguments to the define-muc-configuration-options function , which you will have to edit to supply the proper pathnames) . When you enter th e scoring program, type &quot;is&quot; so that the score buffer will contain detail table s (template by template) as well as the final summary table . Save the score buffer (*MUC Score Display*) to a file called scores-max-tradeoff .tst2.</Paragraph> <Paragraph position="8"> separately) for interactively assigning full and partial credit . Also refer to key-tst2notes (in the ftp directory) for NOSC's comments on how the answer key wa s generated . See section 5, below, for information on the plans for handling th e rescoring of results .</Paragraph> <Paragraph position="9"> Following the instructions in the user manual for the scoring program, save th e history to a file called history-max-tradeoff .tst2.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 2.3.2 SCORING SPECIFIC SETS OF SLOTS FOR THE EXPERIMENTAL, REQUIRE D LINGUISTIC PHENOMENA TEST S </SectionTitle> <Paragraph position="0"> Read the file readme.phentest. Run the scoring program again for each of th e linguistic phenomena tests, i .e., type the configuration file names that appear in th e test package in sequence as the argument to the function initialize-muc-scorer .</Paragraph> <Paragraph position="1"> (These files must be edited to provide the proper pathnames for your environment .) Scoring for the phenomena testing should be done using the history file create d when all templates were scored. No updates to the history file should be made durin g these runs . Save each score buffer (*MUC Score Display*) to the file name scores <phenomenon test name>-max-tradeoff.tst2, where <phenomenon test name> matche s the names in the config files .</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. SPECIFIC PROCEDURES FOR OPTIONAL TESTIN G </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 WITH MODIFIED SYSTEM CONTROL PARAMETERS FOR ALL TEMPLATE S </SectionTitle> <Paragraph position="0"> For each optional run, modify the system as specified IN ADVANCE to NOSC . Then follow the procedures described in section 1 .2 and section 2 . Save the system respons e templates to files with unique, meaningful names . When you do the scoring, start th e scoring program each time with the history file generated during the require d testing (minus the manually remapped templates, since you may wish to chang e them), and save the history when you have finished scoring (whether it was update d or not) and the scores to files with names that permit them to be matched up with th e corresponding system response template file .</Paragraph> <Paragraph position="1"> Once you have determined which of the optional runs to submit to NOSC for th e official record, name the files for those runs in some meaningful, easily-understoo d fashion (fitting these patterns : response-<meaningful name here>.tst2, scores <meaningful name here> .tst2, and history-<meaningful name here>.tst2) and provid e them along with a readme file that explains the significance of the files an d identifies their corresponding parameter setting .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 FOR LINGUISTIC PHENOMENA TESTS, USING MODIFIED SYSTEM CONTRO L PARAMETER S </SectionTitle> <Paragraph position="0"> After you have produced the files listed at the end of section 3 .1, above, follow the procedures in section 2 .3 .2 if you wish to produce separate linguistic phenomena tes t results for any/all of them . Use the history file corresponding to each of those response files .</Paragraph> <Paragraph position="1"> Please submit these linguistic phenomena test scores to NOSC only if they ar e significantly different from those produced for the required testing . If you do submi t these scores, name the file for each of the phenomena tests to correspond with th e appropriate response file, using the following pattern : scores-<phenomenon test name>-<meaningful name here> .tst2 .</Paragraph> <Paragraph position="2"> B--4</Paragraph> </Section> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 . REPORTS TO BE SUBMITTED TO NOSC BY MONDAY MORNING , MAY 13 </SectionTitle> <Paragraph position="0"> All results submitted to NOSC are considered &quot;official,&quot; with the exception of th e results of the linguistic phenomena testing, which are considered &quot;experimental . &quot; All results, whether official or experimental, may be included, in part or in full, i n publications resulting from MUC-3 . However, only the official results may be used fo r any comparative ranking or rating of systems . The proper means of using th e official results for that purpose will be discussed during the conference at NOSC . The results of the linguistic phenomena testing are to be used only to gain insight int o the linguistic performance of individual systems and into the testing methodology .</Paragraph> <Paragraph position="1"> The files listed below are to be submitted to NOSC by Monday morning, May 13, vi a email to sundheim@nosc.mil. TO HELP NOSC FILE THE MESSAGES ACCURATELY, PLEAS E</Paragraph> </Section> <Section position="11" start_page="0" end_page="0" type="metho"> <SectionTitle> SUBMIT EACH FILE IN A SEPARATE MESSAGE, AND IDENTIFY YOUR ORGANIZATION AN D THE FILE NAME IN THE SUBJECT LINE OF THE MESSAGES . 4.1 REQUIRED TESTING (MAXIMIZED RECALL/PRECISION TRADEOFF ) </SectionTitle> <Paragraph position="0"> 1. response-max-tradeoff .tst2 2. history-max-tradeoff .tst2 3. scores-max-tradeoff.tst2 4. trace-max-tradeoff .tst2 (system trace for the 100 messages) -- You may submi t whatever you think is appropriate, i .e., whatever would serve to help validat e the results of testing . If the traces are voluminous and you do not wish t o email them, please compress them and ftp them to the /pub directory ; send sundheim@nosc .mil an email message to identify the file name . 5. scores-<phenomenon test name>-max-tradeoff .tst2 -- where <phenomenon tes t name> matches the names in the config files (see readme .phentest)</Paragraph> </Section> <Section position="12" start_page="0" end_page="0" type="metho"> <SectionTitle> 4.2 OPTIONAL TESTING (OTHER RECALL/PRECISION TRADEOFFS) </SectionTitle> <Paragraph position="0"> Items 1-5, below, are required for EACH optional test run that is reported to NOSC .</Paragraph> <Paragraph position="1"> 1. history-<meaningful name here> .tst 2 2. response-<meaningful name here> .tst 2 3. scores-<meaningful name here>.tst2 4. readme-optional-testing .tst2 -- See section 3 .1, above.</Paragraph> <Paragraph position="2"> 5. trace-<meaningful name here> .tst2 -- See note in section 4 .1, above. 6. scores-<phenomenon test name>-<meaningful name here> .tst2 -- where <phenomenon test name> matches the names in the config files (se e readme .phentest) .</Paragraph> <Paragraph position="3"> Submit these scores only if significantly different fro m those obtained for the required testing .</Paragraph> </Section> <Section position="13" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.0 RESCORING OF RESULT S </SectionTitle> <Paragraph position="0"> The interactive scoring that is done during testing should be done in stric t conformance to the scoring guidelines . If you perceive errors in the guidelines o r in the answer keys as you are doing the scoring, please make note of them and send a summary to NOSC along with the items listed in section 4, above . When all the result s are in, NOSC will attempt to merge everyone's history-max-tradeoff .tst2 files and rescore everyone's response-max-tradeoff.tst2 files .</Paragraph> <Paragraph position="1"> Your notes on perceived error s may be useful to NOSC at that time. If the errors are not easy to rectify and if they appear to be serious enough to significantly affect the legitimacy of the scoring, w e may have to wait to rectify them after the conference and rescore the respons e templates at that time . THE RESULTS OF RESCORING BEFORE AND/OR AFTER TH E</Paragraph> </Section> <Section position="14" start_page="0" end_page="0" type="metho"> <SectionTitle> CONFERENCE WILL BECOME THE OFFICIAL RESULTS . </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>