File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1032_metho.xml
Size: 15,148 bytes
Last Modified: 2025-10-06 14:12:19
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-1032"> <Title>PLANS FOR A TASK-ORIENTED EVALUATION OF NATURAL LANGUAGE UNDERSTANDING SYSTEMS</Title> <Section position="4" start_page="197" end_page="198" type="metho"> <SectionTitle> THE TEXT CORPUS </SectionTitle> <Paragraph position="0"> It is important that the amount of time and effort required for a system to be able to participate in the evaluation be as short as possible. For this reason, a serous effort has been made to collect texts in a narrow domain and to provide types of documentation that will reduce the amount of knowledge acquisition and engineering required. We have selected and prepared documentation on a set of 155 Navy messages written in a format known as OPREP-3 Pinnacle/Front Burner (OPREP-3 PFB), whose use and format are prescribed in OPNAVINST 3100.6D, &quot;Special Incident Reporting,&quot; an unclassified Navy instruction. The examples selected concern encounters among aircraft, surface ships, and/or submarines. The encounters range in intensity from simple detection through overt hostility directed toward one of these &quot;platform&quot; types or an ashore facility. The nature of these messages is felt to be constrained in domain but not overly specialized.</Paragraph> <Paragraph position="1"> OPREP-3 PFBs consist of several different paragraphs, each containing a prescribed type of information. The format of the information provided in each paragrph is generally unrestricted, and much of the information is supplied by message originators as free-form English text. The three major free-text paragraphs are (1) a narrative account of the incident, (2) a description of casualties suffered by personnel and equipment, and (3) miscellaneous remarks on the incident.</Paragraph> <Paragraph position="2"> The OPREP-3 PFBs in the corpus have many features which make them tractable texts for current NLP systems: . They usually report on one or more closely-related general topics. The reported events fit into a fairly circumscribed set of scenarios concerning basic kinds of interaction between opposing forces of different types. Thus, the vocabulary is relatively limited, and so are the semantics of the domain.</Paragraph> <Paragraph position="3"> .</Paragraph> <Paragraph position="4"> .</Paragraph> <Paragraph position="5"> They contain little speculation. At least in the narrative line, the author is attempting to report events as they occurred and not to speculate on those events. Thus, there is not too much in the way of complex constructions that convey an analysis (e.g. &quot;\[I\] Believe that \[the\] attack was successful.&quot;). They contain little embellishing information. They typically give only time, location and sensor/weapon information to supplement the recounting of the events. The succinct style preferred for Navy messages discourages the use of nonessential descnptve or qualifying expressions. This further reduces the number of different English constructions that a system would need to be able to syntactically parse, and restricts semantic interpretation mainly to representing fundamental attributes of agent, object, time, place, and instrument.</Paragraph> <Paragraph position="6"> 4. They stick basically to one topic per message. For the most part, it is not necessary to unravel a complex story, matching various events with different agents, objects, etc., and figuring out the time sequence. Of course, there are also reasons why the text portions of OPREP-3 PFBs are in some ways very difficult to analyze. Some of the more superficial features that distinguish them from standard expository texts are 1. Poorer than average use of punctuation. Periods, especially, are sometimes omitted, leading to run-on sentences and increased amounts of ambiguity.</Paragraph> <Paragraph position="7"> 2. Heavy evidence of ellipsis (telegraphic style). Subjects, objects, articles, and prepositions are frequently omitted.</Paragraph> <Paragraph position="8"> 3. Use of special constructions, e.g., for representing time, date, and location.</Paragraph> <Paragraph position="9"> 4. Frequent misspellings. This is much more evident than in highly edited texts.</Paragraph> <Paragraph position="11"> of the difficult distinguishing semantic features of OPREP-3 PFBs are Assumption of knowledge of a specialized domain. The events, objects, and relationships in the Navy domain, e.g., what types of weapons can be used by what type of ship for what purpose, are not common knowledge. Frequently, the meaning of some part of a narrative will be somewhat ambiguous or vague to a nonspecialist, but completely clear to a knowledgeable person. Until a system developer has acquired a sound knowledge of the domain and has imparted it to the system's knowledge bases, the system is unlikely to perform any task very well.</Paragraph> <Paragraph position="12"> Assumption of knowledge of contents of other paragraphs in the message. The narrative paragraphs are not intended to stand alone. The first paragraph of the message, for example, alerts the reader to the general subject of the message, so the narrative may omit some information that it would otherwise have included. That information may not be absolutely necessary for understanding the narrative in isolation but would help at least to reduce the degree of vagueness and ambiguity that the reader or system must resolve.</Paragraph> <Paragraph position="13"> INPUTS TO NLP SYSTEMS: DEVELOPMENT AND TEST SETS A total of 155 OPREP-3 PFBs are in the current corpus. Of these, 105 have been designated as development (i.e., training) data, and 50 have been set aside as test data. The current plan is to divide the test data into two sets of 25 messages each so that they can be used at different times in the future.</Paragraph> <Paragraph position="14"> The corpus has been subdivided into four groups, according to the types of platforms involved in the interaction. There is one group each for incidents involving aircraft, surface ships, submarines, and land targets. The test data includes examples from each of these groups, in numbers proportional to the number of messages the development set contains for each group.</Paragraph> <Paragraph position="15"> The inputs to the NLP systems are expected to be the OPREP-3 PFB narrative lines only. The intent is to limit the input to free text only and to about one paragraph in length. In that way, the task will focus on text understanding capabilities in general rather than on the understanding of a specialized message format, and it will include some, but not overwhelming, challenges for discourse-level processing.</Paragraph> <Paragraph position="16"> As an alternative to the verbatim narrative lines, a set of modified versions is being prepared. The purpose is to allow systems that have not dealt extensively with the problems of telegraphic, often ill-formed texts to participate in the evaluation without having to undergo the extensive amount of development effort that would be required before they could be expected to have much success with the original narratives. Modifications will be made that minimize the superficial problems identified in the previous section (ellipsis, bad punctuation, specialized notation and misspellings). The evaluation of a system may be carried out using either the verbatim narratives or the modified versions, or both. For those systems which can analyze the unmodified inputs, a partial measurement of system utility can be obtained.</Paragraph> </Section> <Section position="5" start_page="198" end_page="199" type="metho"> <SectionTitle> OUTPUTS FROM NLP SYSTEMS: DESCRIPTION OF THE TEMPLATE FILL TASK </SectionTitle> <Paragraph position="0"> The outputs are in the form of templates, simulating a simple database. No formal database management system is required. The software which must be developed especially for the benchmark test is a back end that takes the results of the analysis and extracts or derives the desired information to fill the slots in the template. This process is portrayed graphically below:</Paragraph> </Section> <Section position="6" start_page="199" end_page="199" type="metho"> <SectionTitle> NARRATIVES MODULES DERIVER QUTPUTS TEMPLATE FILLS (DB UPDATES) </SectionTitle> <Paragraph position="0"> The intention is that the back-end module required for the task be quite small and simple, since the test is meant to focus on the understanding capabilities, not on the sophistication of the system's database update capabilities.</Paragraph> <Paragraph position="1"> Systems will have to have mechanisms for mapping many kinds of data into canonical forms (see below), but there is no requirement for performing calculations on the data nor for other non-linguistic manipulation of the data.</Paragraph> <Paragraph position="2"> The simulated database that will be created by the NLP systems is intended to capture basic information about events that are of significant interest. The events that will cause the system to fill in a template concern hostile or potentially hostile encounters between one or more members of the U.S. forces and one or more members of an enemy force -- detecting the enemy, tracking it, targeting it, harassing it, or attacking it. A template is also to be filled in if the action goes the opposite direction, i.e., where it is the enemy platform that is detecting, tracking, targeting, harassing, or attacking. Thus, the simulated database that is being created consists of the equivalent of two tables, one where the U.S. force carries out the action, and one where the enemy force carries out the action.</Paragraph> <Paragraph position="3"> Each time a new template is filled out, the equivalent of a new record is created for that table.</Paragraph> <Paragraph position="4"> Not all OPREP-3 PFBs report one of the events mentioned above, however. There are some which report intentions rather than past events, and ones which report events that are &quot;not of interest&quot; to the database. Only the MESSAGE ID and EVENT slots (see below) should be filled out in these cases. This provides a check on the degree of understanding that a system is capable of, since there are times when a system that depended too heavily on key words, such as &quot;attack,&quot; would mistakenly fill out a template.</Paragraph> </Section> <Section position="7" start_page="199" end_page="199" type="metho"> <SectionTitle> SPECIFICATION OF THE TEMPLATE SLOTS </SectionTitle> <Paragraph position="0"> The template used in the benchmark test bears little resemblance to a comprehensive template schema such as that used by Logicon's Data Base Generator system for stofing information on space event messages. It is intentionally simple, in an attempt to limit the amount of specialized back-end software the task requires, to limit the anticipated confusion and debate among system developers over what the expected &quot;fight answers&quot; are, and to increase the comprehensibility of the output for all concerned. Unfortunately, by keeping the template simple, some specificity is lost that one would like to have in a database.</Paragraph> <Paragraph position="1"> There are ten main slots in the template, plus one to identify the message that the data comes from. The slots and their fill requirements are given on the next page. The slots are meant to provide answers to the questions of What? Who? How? Where? When? With what outcome? The expected fill for each slot falls into one of two categofies: selection of an item from a set list of possible answers, or strings (phrases) from the input text. As many of the fills as possible will come from predefined sets of possible names and categories. For the nomenclature identifying specific agents, objects, instruments, and locations, there will be correspondence tables that can be implemented to output a canonical form of identification.</Paragraph> <Paragraph position="2"> Slot #1, which answers the question What?, is intended to indicate how sefious the incident is by identifying the greatest level of hostility reported. In ascending order of hostility, the events are DETECT, TRACK, TARGET, HARASS, and A'VFACK. The other possible fill for that slot is OTHER, meaning that the event is not of interest to the database. The remainder of the template should be left blank in that case. If the event is of interest to the database, the rest of the slots should be filled in; if information is not available for any of them, the phrase NO FRIENDLY, HOSTILE, NO DATA AIR, SURF, SUB, NO DATA AIR, SURF, SUB, LAND, NO DATA Canonical form of name(s), else taxonomic category name(s) or organizational entity I.D., else NO DATA Same as slot 5 Same as slot 5, where item(s) is/are: 1. sensor - for CONTACT, TRACK, TARGET 2. weapon - for HARASS, ATTACK Canonical form of location name(s), or text string with absolute or relative location(s), else NO DATA String with absolute time(s) of 1. use of sensor - for DETECT, TRACK, TARGET 2. weapon launch or impact - for HARASS, ATTACK; else NO DATA 1. RESPONSE BY OPPOSING FORCE 2. HOLDING CONTACT, LOST CONTACT 3. CONTINUING TO TRACK, STOPPED TRACKING 4. HOLDING TARGET, LOST TARGET 5. (NO) DAMAGE OR LOSS TO AGENT, (NO) DAMAGE OR LOSS TO OBJECT 6. else, NO DATA A number of problems arose in preparing examples of filled templates, e.g., questions of how many templates were warranted and cases where the answers were unclear or did not fit the requirements exactly. On the other hand, there were many cases where the task showed promise of providing significant insights into the ability of NLP systems to correlate data, make inferences, fdter out negative cases, and accommodate complex or ill-formed structures.</Paragraph> </Section> <Section position="8" start_page="199" end_page="201" type="metho"> <SectionTitle> TEST PARAMETERS AND MEASURES </SectionTitle> <Paragraph position="0"> Several different measurements can be obtained from tests using the OPREP-3 PFB corpus. These can be termed &quot;recall,&quot; &quot;precision,&quot; &quot;generality/potential,&quot; &quot;utility,&quot; and &quot;progress.&quot; The table below describes how measurements of them will be obtained and summarizes their significance as evaluation measures. Tests will be conducted by the system developers at their own sites at two different times. They will test the system upon receipt of 25 test narratives, which will come after a two-month period of updates for the development set. At that time, tests will be run separately for the development and test sets. After an additional month of updating to better handle the test set, the test will be rerun. As a final data point and stimulus for discussion, approximately 10 previously unseen narratives will be run by developers at the meeting following the period of updating. These narratives will be manufactured to be variations of narratives already seen, using the same situations and terminology in novel ways.</Paragraph> </Section> class="xml-element"></Paper>