File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2019_intro.xml
Size: 2,837 bytes
Last Modified: 2025-10-06 14:04:50
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2019"> <Title>A PROPOSAL FOR SLS EVALUATION</Title> <Section position="2" start_page="0" end_page="135" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> The DARPA community has recently moved forward in beginning to define methods for common evaluation of spoken language systems. We consider the existing consensus to include at least the following points: * Common evaluation involves working on a common domain (or domains). A common corpus of development queries (in both spoken and transcribed form), and answers to those queries in some canonical format, are therefore required.</Paragraph> <Paragraph position="1"> * One basis for system evaluation will be answers to queries from a common database, perhaps in addition to other measures.</Paragraph> <Paragraph position="2"> * Automatic evaluation methods should be used whenever they are feasible.</Paragraph> <Paragraph position="3"> * System output will be scored by NIST, though all sites will be able to use the evaluation program internally. null * Development and test corpora should be subdivided into several categories to support different kinds of evaluation (particularly concerning discourse phenomena). null An implicit assumption here is that we are considering database query systems, rather than any of the various other natural language processing domains (message understanding, command and control, etc.). Evaluating systems for these other domains will naturally require other evaluation procedures.</Paragraph> <Paragraph position="4"> Building on the points of consensus listed above, this proposal presents an evaluation procedure for the DARPA Common Task which is essentially domainindependent. The key component is a program, designated the Comparator, for comparing canonical answers to the answers supplied by a Spoken Language System. A specification for such answers, which incorporates the requirements of the Comparator, is presented in Section 2. This specification, called the Common Answer Specification (CAS), is not intended to be suitable for interactive systems: rather, it is designed to facilitate automatic evaluation. While we have attempted to cover as broad a range of queries and phenomena as possible, data which fall outside the scope of the CAS can simply be left out of test corpora for now.</Paragraph> <Paragraph position="5"> Section 3 presents some of the justification supporting the proposal in Section 2, as well as amplifying several points. Details on the Comparator are given in Section 4. Section 5 concludes with a discussion of corpus development, looking at what kind of data should be collected and how corpora should be annotated to facilitate testing various types of natural language.</Paragraph> </Section> class="xml-element"></Paper>