File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0902_metho.xml

Size: 20,626 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0902">
  <Title>Empirical Methods for Evaluating Dialog Systems</Title>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3 Empirical methods
</SectionTitle>
    <Paragraph position="0"> If designers want to make comparative judgments about the performance of a dialog system relative to another system so that readers unacquainted with either system can understand the reported metrics, they need a baseline.</Paragraph>
    <Paragraph position="1"> Fortunately, in evaluating dialog between humans and computers, the &amp;quot;gold standard&amp;quot; is oftentimes known; namely, human conversation.</Paragraph>
    <Paragraph position="2"> The most intuitive and effective way to substantiate performance claims is to compare a dialog system on a particular domain task with how human beings perform on the same task.</Paragraph>
    <Paragraph position="3"> Because human performance constitutes an ideal benchmark, readers can make sense of the reported metrics by assessing how close the system approaches the gold standard.</Paragraph>
    <Paragraph position="4"> Furthermore, with a benchmark, designers can Figure 1. Wizard-of-Oz study for the purpose of establishing a baseline comparison.</Paragraph>
    <Paragraph position="5"> optimize their system through component analysis and cost valuation.</Paragraph>
    <Paragraph position="6"> In this section, we outline an experimental protocol for obtaining human performance data that can serve as a gold standard. We then highlight a basic set of descriptive statistics for substantiating performance claims, as well as for optimization.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experimental protocol
</SectionTitle>
      <Paragraph position="0"> Collecting human performance data for establishing a gold standard requires conducting a carefully controlled wizard-of-oz (WOZ) experiment. The general idea is that users communicate with a human &amp;quot;wizard&amp;quot; under the illusion that they are interacting with a computational system. For spoken dialog systems, maintaining the illusion usually involves utilizing a synthetic voice to output wizard responses, often through voice distortion or a text-to-speech (TTS) generator.</Paragraph>
      <Paragraph position="1"> The typical use of a WOZ study is to record and analyze user input and wizard output. This allows designers to know what to expect and what they should try to support. User input is especially critical for speech recognition systems that rely on the collected data for acoustic training and language modeling. In iterative WOZ studies, previously collected data is used to adjust the system so that as the performance of the system improves, the studies employ less of the wizard and more of the system (Glass et al., 2000). In the process, design constraints in the interface may be revealed, in which case, further studies are  conducted until acceptable tradeoffs are found (Bernsen et al., 1998).</Paragraph>
      <Paragraph position="2"> In contrast to the typical use, a WOZ study for establishing a gold standard prohibits modifications to the interface or experimental &amp;quot;curtain.&amp;quot; As shown in Figure 1, all input and output through the interface must be carefully controlled. If designers want to use previously collected performance data as a gold standard, they need to verify that all input and output have remained constant. The protocol for establishing a gold standard is straightforward:  * Select a dialog metric to serve as an objective function for evaluation and optimization.</Paragraph>
      <Paragraph position="3"> * Vary the component or feature that best matches the desired performance claim for the dialog metric.</Paragraph>
      <Paragraph position="4"> * Hold all other input and output through the interface constant so that the only unknown variable is who does the internal processing.</Paragraph>
      <Paragraph position="5"> * Repeat using different wizards, making  sure that each wizard follows strict guidelines for interacting with subjects.</Paragraph>
      <Paragraph position="6"> To motivate the above protocol, consider how a WOZ study might be used to evaluate spoken dialog systems. As almost every designer has found, the &amp;quot;Achilles' heel&amp;quot; of spoken interaction is the fragility of the speech recognizer. System performance depends highly on the quality of the recognition. Suppose a designer is interested in bolstering the robustness of a dialog system by exploiting different types of repair strategies. Using task completion rate as an objective function, the designer varies the repair strategies utilized by the system. To make claims about the robustness of particular types of repair strategies, the designer must keep all other input and output constant. In particular, the protocol demands that the wizard in the experiment must receive utterances through the same speech recognizer as the dialog system. The performance of the wizard on the same quality of input as the dialog system constitutes the gold standard. The designer may also wish to keep the set of repair strategies constant while varying the use or disuse of the speech recognizer to estimate how much the recognizer alone degrades task completion rate.</Paragraph>
      <Paragraph position="7"> A deep intuition underlies the experimental control of the speech recognizer. As researchers have observed, people with impaired hearing or non-native language skills still manage to communicate effectively despite noisy or uncertain input. Unfortunately, the same cannot be said of computers with analogous deficiencies. People overcome their deficiencies by collaboratively working out the mutual belief that their utterances have been understood sufficiently for current purposes, a process referred to as &amp;quot;grounding&amp;quot; (Clark, 1996). Repair strategies based on grounding indeed show promise for improving the robustness of spoken dialog systems (Paek &amp; Horvitz, 1999; Paek &amp; Horvitz, 2000).</Paragraph>
      <Paragraph position="8">  In following the above protocol, we point out a few precautions. First, WOZ studies for establishing a gold standard work best with dialog systems that are highly modular. The more modular the architecture of the dialog system, the easier it will be to test components by replacing a particular module of interest with the wizard. Without modularity, it will be harder to guarantee that all other inputs and outputs have remained constant because component boundaries are blurred. Ironically, after a certain point, a high degree of modularity may in fact preclude the experimental protocol; components may be so specialized and quickly accessed by a system that it may not be feasible to replace that component with a human.</Paragraph>
      <Paragraph position="9"> A second precaution deals with the concept of a gold standard. What allows the performance of the wizard to be used as a gold standard is not the wizard, but rather the fact that the performance constitutes an upper bound. If an upper bound of performance has already been identified, then that is the gold standard. For example, graphical user interfaces (GUI) or touch-tone systems may represent a better gold standard for task completion rate if users finish their interactions with such systems ore often than with human operators. With spoken dialog systems, the question of when the use of speech interaction is truly compelling is often ignored. If a dialog designer runs the experimental protocol and observes that even human wizards cannot perform the domain task very well, that suggests that perhaps a gold standard may be  with respect to the gold standard.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Descriptive statistics
</SectionTitle>
      <Paragraph position="0"> After collecting data using the experimental protocol, designers can make comparative judgments about the performance of their system relative to other systems with a basic set of descriptive statistics. The statistics build on the initial step of fitting a statistical model on the data fro both wizards and the dialog system. We discuss precautions later. Plotting the fitted curves on the same graph sheds light on how best to substantiate any performance claims. The graph displays the performance of the dialog system along a particular dimension of interest with the wizard data constituting a gold standard for comparison. Consider how this kind of &amp;quot;benchmark graph&amp;quot; could benefit the evaluation of spoken dialog systems.</Paragraph>
      <Paragraph position="1"> Referring to previous example, suppose a designer is interested in evaluating the robustness of two dialog systems utilizing two sets of repair strategies. The designer varies which set is implemented, while holding constant the use of the speech recognizer. In general, as speech recognition errors increase, task completion rate, or dialog success rate, decreases. Not surprisingly, several researchers have found an approximately linear relationship in plotting task completion rate as a function of word error rate (Lamel et al., 2000; Rudnicky, 2000). Keeping this in mind, Figure 2 displays a benchmark graph for two dialog systems A and B, utilizing different repair strategies. The fitted curve for A is characteristically linear, while the curve for B is polynomial. Because wizards are presumably more capable of anticipating and recovering from speech recognition errors, their  systems from the gold standard.</Paragraph>
      <Paragraph position="2"> performance data comprise the gold standard.</Paragraph>
      <Paragraph position="3"> As such, the fitted curve for the gold standard in Figure 2 stays close to the upper right hand corner of the graph in a monotonically decreasing fashion; that is, task completion rate remains relatively high as word error rate increases and then gracefully degrades before the error rate reaches its highest level.</Paragraph>
      <Paragraph position="4"> Looking at the benchmark graph, readers immediately get a handle on substantiating performance claims about robustness. For example, by noticing that task completion rate for the gold standard rapidly drops from around 65% at the 80% mark to about 15% by 100%, readers know that at 80% word error rate, even wizards, with human level intelligence, cannot recover from failures with better than 65% task completion rate. In short, the task is not trivial. This means that if A and B report low numbers for task completion rate beyond the 80% mark for word error rate, they may be still performing relatively well compared to the gold standard.</Paragraph>
      <Paragraph position="5"> Numbers themselves are deceptive, unless they are put side by side with a benchmark.</Paragraph>
      <Paragraph position="6"> Of course, a designer might not have access to data all along the word error rate continuum as in Figure 2. If this presents a problem, it may be more appropriate to measure task completion rate as a function of concept error rate. The choice, as stated in the experimental protocol, depends on the performance claim a designer is interested in making. In spoken dialog, however, where speech recognition errors abound, another particularly useful benchmark graph is to plot word or concept error rate against user frustration. This experiment reveals any inherent</Paragraph>
      <Paragraph position="8"> bias users may have towards speaking with a computer in the first place.</Paragraph>
      <Paragraph position="9"> In making comparative judgments, designers can also benefit from plotting the absolute difference in performance from the gold standard as a function of the same independent variable as the benchmark graph. Figure 3 displays the difference in task completion rate, or &amp;quot;gold impurity,&amp;quot; for systems A and B as a function of word error rate. The closer a system is to the gold standard, the smaller the &amp;quot;mass&amp;quot; of the gold impurity on the graph. Anomalies are easier to see, as they noticeably show up as bumps or peaks. If a dialog system reports low numbers but evinces little gold impurity, reader can be assured that the system is as good as it can possibly be.</Paragraph>
      <Paragraph position="10"> Any crosses in performance can be revealing as well. For example, in Figure 3, although B performs worse at lower word error rates than A, after about the 35% mark, B stays closer to the gold standard. Hence, the designer in this case could not categorically prefer one system to the other. In fact, assuming that the only difference between A and B is the choice of repair strategies, the designer should prefer A to B if theaverageworderrorrateforthespeech recognizer is below 35%, and B to A, if the average error rate is about 40%. Of course, other cost considerations come into play, as we describe later.</Paragraph>
      <Paragraph position="11"> The final point to make about comparing dialog systems to a gold standard is that readers are able to substantiate performance claims across different domain tasks. They need only to look at how close each system approaches their respective gold standard in a benchmark graph, or how much mass each system puts out in a goldimpuritygraph.Theycanevendothis without having the luxury of experiencing any of the compared systems.</Paragraph>
      <Paragraph position="12">  Without a gold standard, making comparative judgments of dialog systems across different domain tasks poses a problem for two reasons: task complexity and interaction complexity.</Paragraph>
      <Paragraph position="13"> Tutoring physics is a generally more complex domain task than retrieving email. On the other hand, task complexity alone does not explain what makes one dialog more complex than another; interaction complexity also plays a significant role. Tutoring physics can be less challenging than retrieving email if the system accepts few inputs, essentially constraining users to follow a predefined script. Any dialog system that engages in &amp;quot;mixed initiative&amp;quot; will be more complex than one that utilizes &amp;quot;system-initiated&amp;quot; prompts because users have more actions at their disposal at any point in time.</Paragraph>
      <Paragraph position="14"> The way to evaluate complexity in a benchmark graph is to measure the distance of the gold standard to the absolute upper bound of performance. If wizards with human level intelligence cannot themselves perform reasonably close to the absolute upper bound, then either the task is very complex, or the interaction afforded by the dialog interface is too restrictive for wizards, or perhaps both. Because complexity is measured only in connection with the gold standard ceteris paribus, &amp;quot;benchmark complexity&amp;quot; can be computed as:  where U is the upper bound value of a performance metric, n is the upper bound value for an independent variable X,andg(x)isthe gold standard along that variable.</Paragraph>
      <Paragraph position="15"> Designers can use benchmark complexity to compare systems across different domain tasks if they are not too concerned about discriminating between task complexity and interaction complexity. Otherwise, they can treat benchmark complexity as an objective function and vary the interaction complexity of the dialog interface to scrutinize the effect of task complexity on wizard performance, or vice versa. In short, they need to conduct another experimental study.</Paragraph>
      <Paragraph position="16">  Before substantiating performance claims with a benchmark graph, designers must exercise prudence in model fitting. One precaution is to beware of insufficient data. Without collecting enough data, designers cannot be certain that differences in the performance of a dialog system from the gold standard cannot be explained simply by the variance in the fitted models. To determine when there is enough data to generate reliable models, designers can conduct WOZ studies in an iterative fashion.</Paragraph>
      <Paragraph position="17"> First, collect some data and fit a statistical model. Second, plot the least squares distance,  model, against the iteration. Keep collecting more data until the plot seems to asymptotically converge. Designers may need to report R  sfor the curves in their benchmark graphs to inform readers of the reliability of their models.</Paragraph>
      <Paragraph position="18"> Another precaution is to use different wizards, making sure that each wizard follows strict guidelines for interacting with subjects. The experimental protocol included this precaution because designers need to consider whether a consistent gold standard is even possible with a given dialog interface. Indeed, difference between wizards may uncover serious design flaws in the interface. Furthermore, using different wizards compels designers to collect more data for the gold standard.</Paragraph>
      <Paragraph position="19"> As a final precaution, designers need to watch out for violations of model assumptions regarding residual errors. These are typically well covered in most statistics textbooks. For example, because task completion rate as a performance metric has an upper bound of 100%, it is unlikely that residual errors will be equally spread out along the word error rate continuum. In regression analysis, this is called &amp;quot;heteroscedasticity.&amp;quot; Another common violation occurs with the non-normality of the residual errors. Designers would do well to take advantage of corrective measures for both.</Paragraph>
      <Paragraph position="20">  A gold standard naturally lends itself to optimization. With a gold standard, designers can identify which components are contributing the most to a performance metric by examining the gold impurity graph of the system with and without a particular component. This kind of test is similar to how dissociations are discovered in neuroscience through &amp;quot;lesion&amp;quot; experiments. Carrying out stepwise comparisons of the components, designers can check for tradeoffs, andevenuseallorpartofthegoldimpurityas an optimization metric. For example, suppose a designer endeavors to improve a dialog system from its current average task completion rate of 70% to 80%. In Figure 2, suppose B incorporates a component that A does not.</Paragraph>
      <Paragraph position="21"> Looking at the corresponding word error rates in the gold impurity graph for both systems, the  for improvements to task completion rate.</Paragraph>
      <Paragraph position="22"> mass under the curve for B is slightly greater than that for A. The designer can optimize the performance of the system by selecting components that minimize that mass, in which case, the component in B would be excluded.</Paragraph>
      <Paragraph position="23"> Because components often interact with each other in terms of their statistical effect on the performance metric, designers may wish to carry out a multi-dimensional analysis to weed out those components with weak main and interaction effects.</Paragraph>
      <Paragraph position="24">  Another optimization use of a gold standard is to minimize the amount of &amp;quot;gold&amp;quot; expended in developing a dialog system. Gold here includes more than just dollars, but time and effort as well. Designers can determine where to invest their research focus by calculating &amp;quot;average marginal cost.&amp;quot; To do this, they must first elicit a cost function that conveys what they are willing to pay, in terms of utility, to achieve various levels of performance in a dialog metric (Bell et al., 1988). Figure 4 displays what cost a designer might be willing to incur for various rates of task completion. The average marginal cost can be computed by weighting gold impurity by the cost function. In other words, average marginal cost can be computed as:  where f(x) is the performance of the system on a particular dialog metric X, g(x)isthegold standard on that metric, and c(x) is elicited cost function.</Paragraph>
      <Paragraph position="25"> Following the previous example, if the designer endeavors to improve a system that is currently operating at an average task completion rate of 70% to 80%, then the average marginal cost for that gain is simply the area under the cost function for that interval multiplied by the gold impurity for that interval. In deciding between systems or components, designers can exploit average marginal cost to drive down their expenditure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML