File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/h92-1005_concl.xml

Size: 2,006 bytes

Last Modified: 2025-10-06 13:56:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1005">
  <Title>Experiments in Evaluating Interactive Spoken Language Systems 1</Title>
  <Section position="8" start_page="279" end_page="279" type="concl">
    <SectionTitle>
CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> The results of these experiments are very encouraging. We believe that it is possible to define metrics that measure the performance of interactive systems in the context of interactive problem solving. We have had considerable success in designing end-to-end task completion tests. We have shown that it is possible to design such scenarios, that the subjects can successfully perform the  designated task in most cases, and that we can define objective metrics, including time to task completion, number of queries, and number of system non-responses. In addition, these metrics appear to be correlated. To assess correctness of system response, we have shown that evaluators can produce better than 90% agreement evaluating the correctness of response based on examination of query/answer pairs from the log file. We have implemented an interactive tool to support this evaluation, and have used it in two separate experiments. Finally, we demonstrated the utility of these metrics in characterizing two systems. There was good correspondence between how effective the system was in helping the user arrive at a correct answer for a given task, and metrics such as time to task completion, number of queries, and percent of correctly answered queries (based on log file evaluation). These metrics also indicated that system behavior may not be uniform over a range of scenarios - the robust parsing system performed better on three scenarios, but had a worse DARPA score on the fourth (and probably most difficult) scenario. Based on these experiments, we believe that these metrics provide the basis for evaluating spoken language systems in a realistic interactive problem solving context.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML