File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1064_intro.xml
Size: 4,527 bytes
Last Modified: 2025-10-06 14:05:17
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1064"> <Title>NEAL-MONTGOMERY NLP SYSTEM EVALUATION METHODOLOGY</Title> <Section position="3" start_page="0" end_page="323" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> An appreciable drawback to current corpus-based (eg., \[BBN; 1988\], \[Flickinger, et al; 1987\], \[Hendrix, et al; 1976\], \[Malhotra; 1975\]) and task-based (eg., \[&quot;Proceedings&quot;; 1991\]) methodologies for evaluating Natural Language Processing Systems is the requirement for transportation of the system to a test domain. The expense and time consumption are sizable and, as the port may be minimal or incomplete, the evaluation may be based on a demonstration of less than the full potential of the system. Further, current evaluation methodologies do not fully elucidate NLP system capabilities for possible future applications.</Paragraph> <Paragraph position="1"> Under contract to Rome Laboratory, Dr. Jeannette Neal (Calspan Corporation) and Dr. Christine Montgomery (Language Systems Incorporated) are in the final months of developing an NLP system evaluation methodology that produces descriptive, objective profiles of system linguistic capabilities without a requirement for system adaptation to a new domain. The evaluation methodology is meant to produce consistent results for varied haman users.</Paragraph> <Section position="1" start_page="0" end_page="323" type="sub_section"> <SectionTitle> 1.1. Evaluation Methodology Description </SectionTitle> <Paragraph position="0"> Within the Neal-Montgomery NLP System Evaluation Methodology each identified linguistic (lexical, syntactic, semantic, or discourse) feature is first carefully defined and explained in order to establish a standard delimitation of the feature. Illustrative language patterns and sample sentences then guide the human evaluator to the formulation of an input that tests the feature on the NLP system within the system's native domain.</Paragraph> <Paragraph position="1"> Based on clear and specific evaluation criteria for test item inputs, NLP system responses are scored as follows: S: The system successfully met the stated criteria and demonstrated understanding with respect to the feature under test.</Paragraph> <Paragraph position="2"> C: The system responded in a way that was correct (that is, correctly answered the question posed), but the criteria were not met.</Paragraph> <Paragraph position="3"> P: The system responded in a way that was only partially correct.</Paragraph> <Paragraph position="4"> F: The system responded in a way that was incorrect, failing to meet the criteria.</Paragraph> <Paragraph position="5"> N: The system was unable to accept the input or form a response (for example, the system vocabulary lacks appropriate words to complete a test inpu0.</Paragraph> <Paragraph position="6"> Each linguistic feature is tested by more than one methodology item to make sure that results are not based on spurious responses, and each item examines only one as-yet-untested capability, or one as-yet-untested combination of capabilities. Test inputs that are dependent on capabilities previously shown to be unsuccessful are avoided. Scores are then aggregated into percentages for hierarchically-structured classes of linguistic capabilities which produce descriptive profiles of NLP systems. The profiles can be viewed at varying levels of granularity. Figure 1 shows a sample system profile from the top level of the hierarchy.</Paragraph> <Paragraph position="7"> Note that the scoring nomenclature (above) has been refined and expanded since project experiments produced the profiles and results presented in this paper. In Figures 1 and 2, &quot;Unable to Compose Input&quot; is equivalent to an 'N' in the newer nomenclature. A score of &quot;Indeterminate&quot; earlier meant the human evaluator could not determine if the NLP system correctly processed the test input. The new system of scores will be applied for the final project self-assessment activities.</Paragraph> <Paragraph position="8"> The columns at the far fight of Figure 1 display the total time (in hours and minutes) the user required to complete that section of the evaluation, and the average time per item (hours:minutes:seconds) for the section.</Paragraph> <Paragraph position="9"> Figure 2 displays part of the evaluation to the methodology's most detailed level of granularity.</Paragraph> </Section> </Section> class="xml-element"></Paper>