File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0902_intro.xml
Size: 8,043 bytes
Last Modified: 2025-10-06 14:01:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0902"> <Title>Empirical Methods for Evaluating Dialog Systems</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Purpose </SectionTitle> <Paragraph position="0"> Performance can be measured in myriad ways.</Paragraph> <Paragraph position="1"> Indeed, for evaluating dialog systems, the one problem designers do not encounter is lack of choice. Dialog metrics come in a diverse assortment of styles. They can be subjective or objective, deriving from questionnaires or log files. They can vary in scale, from the utterance level to the overall dialog (Glass et al., 2000).</Paragraph> <Paragraph position="2"> They can treat the system as a &quot;black box,&quot; describing only its external behavior (Eckert et al., 1998), or as a &quot;glass box,&quot; detailing its internal processing. If one metric fails to suffice, dialog metrics can be combined. For example, the PARADISE framework allows designers to predict user satisfaction from a linear combination of objective metrics such as mean recognition score and task completion (Kamm et al., 1999; Litman & Pan, 1999; Walker et al., 1997).</Paragraph> <Paragraph position="3"> Why so many metrics? The answer has to do with more than just the absence of agreed upon standards in the research community, notwithstanding significant efforts in that direction (Gibbon et al., 1997). Part of the reason deals with what purpose a dialog metric serves. Designers often have multiple and sometimes inconsistent needs. Four of the most typical needs are: * Provide an accurate estimation of how well a system meets the goals of the domain task.</Paragraph> <Paragraph position="4"> * Allow for comparative judgments of one system against another, and if possible, across different domain tasks.</Paragraph> <Paragraph position="5"> * Identify factors or components in the system that can be improved.</Paragraph> <Paragraph position="6"> * Discover tradeoffs or correlations between factors.</Paragraph> <Paragraph position="7"> The above list of course is not intended to be exhaustive. The point of creating the list is to highlight the kinds of obstacles designers are likely to face in trying to satisfy just these typical needs. Consider the first need.</Paragraph> <Paragraph position="8"> Providing an accurate estimation of how well a system meets the goals of the domain task depends on how well the designers have delineated all the possible goals of interaction. Unfortunately, users often have finer goals than those anticipated by designers, even for domain tasks that seem well defined, such as airline ticket reservation. For example, a user may be leisurely hunting for a vacation and not care about destination or time of travel, or the user may be frantically looking for an emergency ticket and not care about price. The &quot;appropriate&quot; dialog metric should reflect this kind of subtlety. While &quot;time to completion&quot; is more appropriate for emergency tickets, &quot;concept efficiency rate&quot; is more appropriate for the savvy vacationer. As psychologists have long recognized, when people engage in conversation, they make sure that they mutually understand the goals, roles, and behaviors that can be expected (Clark, 1996; Clark & Brennan, 1991; Clark & Schaefer, 1989; Paek & Horvitz, 1999, 2000). They evaluate the &quot;performance&quot; of the dialog based on their mutual understanding and expectations.</Paragraph> <Paragraph position="9"> Not only do different users have different goals, they sometimes have multiple goals, or more often, their goals change dynamically in response to system behavior such as communication failures (Danieli & Gerbino, 1995; Paek & Horvitz, 1999). Because goals engender expectations that then influence evaluation at different points of time, usability ratings are notoriously hard to interpret, especially if the system is not equipped to infer and keep track of user goals (Horvitz & Paek, 1999; Paek & Horvitz, 2000).</Paragraph> <Paragraph position="10"> The second typical need for a dialog metric allowing for comparative judgments, introduces yet further obstacles. In addition to unanticipated, dynamically changing user goals, different systems employ different dialog strategies operating under different architectural constraints, rendering the search for dialog metrics that generalize across systems a lofty if not unattainable pursuit. While the PARADISE framework facilitates some comparison of dialog systems in different domain tasks, generalization is limited because different architectural constraints obviate certain factors in the statistical model (Kamm et al., 1997). For example, although the ability to &quot;barge-in&quot; turns out to be a significant predictor of usability, many systems do not support this. Task completion based on the kappa statistic appears to be a good candidate for a common measure, but only if every dialog system represented the domain task as an Attribute-Value Matrix (AVM). Unfortunately, that requirement excludes systems that use Bayesian networks or other non-symbolic representations. This has prompted some researchers to argue that a &quot;common inventory of concepts&quot; is necessary to have standard metrics for evaluation across systems and domain tasks (Kamm et al., 1997; Glass et al., 2000). As we discuss in the next section, the argument is actually backwards; we can use the metrics we already have to define a common inventory of concepts. Furthermore, with the proper set of descriptive statistics, we can exploit these metrics to address the third and fourth typical needs of designers, that of identifying contributing factors, along with their tradeoffs, and optimizing them.</Paragraph> <Paragraph position="11"> This is not to say that comparative judgments are impossible; rather, it takes some amount of careful work to make them meaningful. When research papers describe evaluation studies of the performance of dialog systems, it is imperative that they provide a baseline comparison from which to benchmark their systems. Even when readers understand the scale of the metrics being reported, without a baseline, the numbers convey very little about the quality of experience users can expect of the system. For example, suppose a paper reports that a dialog system received an average usability score of 9.5/10, a high concept efficiency rate of 90%, and a low word error rate of 5%. The numbers sound terrific, but they could have resulted from low user expectations resulting from a simplistic interface. Practically speaking, to make sense of the numbers, readers either have to experience interacting with the system themselves, or have a baseline comparison for the domain task. This is true even if the paper reports a statistical model for predicting one or more of the dialog metrics from the others, which may reveal tradeoffs but not how well the system performs relative to the baseline.</Paragraph> <Paragraph position="12"> To sum up, in considering the purpose a dialog metric serves, we examined four typical needs and discussed the kinds of obstacles designers are likely to face in finding a dialog metric that satisfies those needs. The obstacles themselves present distinct challenges: first, keeping track of user goals and performance expectations based on the goals, and second, establishing a baseline from which to benchmark systems and make comparative judgments.</Paragraph> <Paragraph position="13"> Assuming that designers equip their system to handle the first challenge, we now propose empirical methods that allow them to handle the second. These methods do not require new dialog metrics, but instead take advantage of existing ones through experimental design and a basic set of descriptive statistics. They also provide a practical means of optimizing the system.</Paragraph> </Section> class="xml-element"></Paper>