File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1062_metho.xml
Size: 14,194 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1062"> <Title>A Proposal for Incremental Dialogue</Title> <Section position="3" start_page="319" end_page="320" type="metho"> <SectionTitle> WHY EVALUATING DIALOGUE SYSTEMS BY SEEING WHETHER THEY REPRODUCE WIZARD SCENARIOS IS A BAD IDEA </SectionTitle> <Paragraph position="0"> We have been trying to think about dialogue evaluation in terms of measuring whether our systems can reproduce virtually the same answers that the wizard produced for an entire dialogue sequence. This is like asking one chess expert to exactly reproduce every move that some other expert made in a past game! There are several reasons why this cannot work in the human-computer environment either. We will briefly discuss three of these: ambiguous queries, failure to answer by the system, and wrong interpretations by the system.</Paragraph> <Section position="1" start_page="319" end_page="319" type="sub_section"> <SectionTitle> Ambiguous Queries </SectionTitle> <Paragraph position="0"> Ambiguous queries abound in the training sentences, and will certainly appear in test sets as well. As soon as a system can give a valid answer that is different from the wizard's, there is the potential for a significant divergence in the paths taken by the wizard's session and the system's execution of the dialogue.</Paragraph> <Paragraph position="1"> This means that the query following the ambiguous query in the test set (i.e., from the wizard's session) may not make sense as a follow-up query when the system processes it! Consider the following examples: QI: Show me flights from San Francisco to Atlanta.</Paragraph> <Paragraph position="2"> Q2: Show me flights that arrive after noon.</Paragraph> <Paragraph position="3"> Q3: Show me flights that arrive before 5pm.</Paragraph> <Paragraph position="4"> A3' (List of flights between noon and 5pm) A3&quot; (List of flights before 5pm, including morning arrivals) Q3: Show me the fare of AA123.</Paragraph> <Paragraph position="5"> The answer A3&quot; is a superset of the answer A3'. Therefore if the wizard answers as in A3&quot;, but a system being evaluated produces A3', and the flight referred to in Q3 is in A3&quot; but not A3', then the answer produced by the system is likely to be officially &quot;wrong&quot;, although it is perfectly correct, given the actual context available to the system at the time Q4 was processed. Of course, some ambiguities in the class A sentences in this domain may in practice not affect the course of the dialogue this way. However, in extended dialogues with context-dependencies, substantial ambiguities can quickly develop.</Paragraph> <Paragraph position="6"> No Answer Another source of dialogue path divergence occurs when the system is unable to answer a query which the wizard answered. For example: QI: Show me flights from San Francisco to Atlanta.</Paragraph> <Paragraph position="7"> Q2: Which is the cheapest one if I want to travel Monday. Q3: What is the fare on that flight? Q4: Show me its restrictions.</Paragraph> <Paragraph position="8"> If the system were unable to understand Q2, producing no answer, then Q3 and Q4 would not be understandable at all. But what would happen in a real human-computer dialogue (instead of the highly artificial task of trying to copy a pre-specified dialogue)? If the system clearly indicated that it didn't understand Q2 at all, a real live user with even moderate intelligence would never ask Q3 at all, because s/he would realize that the system did not have the proper context to answer it. Instead, s/he might continue the dialogue quite differently, for example: Q3: Give me the restrictions and fare of the cheapest Monday flight.</Paragraph> <Paragraph position="9"> Notice that by following this alternate path, the user may actually be able to get the data s/be ultimately wants sooner than the wizard scenario provided it (in 3 queries instead of 4, even though one of those 3 was not understood). How can one say that such a system is less good than one that mimics the wizard perfectly in this case?</Paragraph> </Section> <Section position="2" start_page="319" end_page="320" type="sub_section"> <SectionTitle> Wrong Interpretation </SectionTitle> <Paragraph position="0"> As in the previous two cases, an incorrect interpretation by the system causes a divergence in the dialogue tree. In addition, as in the No Answer case, in practice a wrong interpretation is likely to result in a glaring error that elicits follow-up clarification queries from the user. Even if the user does not follow up in exactly that way, s/he will take into aocount that the system did something unexpected, and will use that knowledge in the formulation of subsequent queries. Just as in the No Answer case, the dialogue branch followed after a wrong interpretation could easily be fundamentally different than the branch which the wizard would follow, but may lead to the same point (all the information the t~r wanted).</Paragraph> <Paragraph position="1"> What does it mean? Our goal should not be to produce systems that behave exactly like the wizard, but rather systems that respond reasonably to every input and that allow the user to reach his or her goal.</Paragraph> <Paragraph position="2"> Perhaps ff we give up the notion of &quot;the right answer&quot; in favor of &quot;a reasonable answer&quot; then we can develop more effective and meaningful evaluation methodologies. The following two proposals provide some concrete suggestions as to how to go about this process. The first deals with the aspect of breadth in the dialogue tree, the second deals with reasonable responses to partially understood input.</Paragraph> </Section> </Section> <Section position="4" start_page="320" end_page="321" type="metho"> <SectionTitle> PROPOSAL: DIALOGUE BREADTH TEST </SectionTitle> <Paragraph position="0"> One of the central problems of dialogue evaluation is that there is no single path of questions and answers to which a system must conform,but rather a plethora of possible questions at each point in the dialogue. A very important capability of a good system is to be able to handle many different queries at each point in a dialogue.</Paragraph> <Paragraph position="1"> We suggest here a methodology that deals with one important aspect of dialogue; specifically, it attempts to compare systems on their ability to handle diverse utterances in context. This method builds on our existing methodology without imposing ur~ealistic constraints on the system design. It aL~ meets the requirement for objective evaluation that it must be possible to agree on what constitutes a reference answer.</Paragraph> <Paragraph position="2"> Consider a dialogue that begins with two queries. At this point, if you ask ten different people what question they would ask next, you will likely get ten different queries, as illustrated in It is reasonable to ask how many of these natural, possible dialogue continuation utterances a particular NL system can understand, and it is also reasonable to compare one system with another based on which one can handle more of the possible continuation utterances.</Paragraph> <Paragraph position="3"> In this type of evaluation, the object is to control the initial part of a dialogue to that which a number of different systems can handle, and then to see how many possible alternative utterances (from a given set) each system could handle at that point in the dialogue. Of course, the continuation utterances should be as meaningful and natural as possible and should not necessarily be context dependent, as in Q3e and Q3f.</Paragraph> <Paragraph position="4"> Dialogue Breadth Test Methodology We present here an example of how an evaluation to assess a system's capabilities in handling dialogue breadth would take place. We call this test the Dialogue Breadth (DB) test. The numbers and other details are for illustrative purposes only.</Paragraph> <Paragraph position="5"> Each site would be given a set of 15 dialogue starters (initially, let's assume that is 15 Q1 utterances). With each Q1, there would be a set of 10 alternative Q2 utterances; this would form a test set of 150 Q1-Q2 dialogue pairs. Sites would run the complete dialogues through their'systems, and would return 100 test items and answers for scoring. (The reduction from 150 would enable sites to remove many cases in which Q1 was not processed correctly, thus focussing the test on the issue of dialogue analysis, not Class A processing.) These 100 answers would be automatically compared to reference answers and scores computed in the usual way: as the number (and percentage) of the continuation uaerances that were answered correctly, incorrectly, or not answered at all.</Paragraph> <Paragraph position="6"> How would the test sets of &quot;next utterances&quot; for the test set be obtained? It could easily be done just as described above -- by showing at least 10 people the original problem scenario that the original wizard subject was trying to solve, showing them the initial dialogue context, and then asking them to add a single utl~rance.</Paragraph> <Paragraph position="7"> The first time the evaluation is tried, it might make sense to use a single Class A utterance as the context, as desoribed above, and consider the breadth that is possible at just the second step of a dialogue. Later evaluations could be carried out on slightly longer dialogue fragments, where each context is a pair of Q1 and Q2 utterances, followed by a set of 10 alternate Q3 utterances.</Paragraph> <Paragraph position="8"> Advantages of the Dialogue Breadth Test There are a number of advantages to be gained by adding the dialogue breadth test to the SLS community's growing set of evaluation tools.</Paragraph> <Paragraph position="9"> 1. This methodology builds on the Class D methodology which the community has developed thus far.</Paragraph> <Paragraph position="10"> 2. It examines an extremely important aspect of dialogue systems: the ability to handle a variety of utterances at a point in mid-dialogue.</Paragraph> <Paragraph position="11"> 3. As long as short dialogue contexts are used, it does not depend on each site building systems that try to duplicate the output of the wizard, since intermediate answers to the initial dialogue utterances are not scored.</Paragraph> <Paragraph position="12"> 4. It requires no more training data than our current dialogue evaluation mechanism. Although it would be useful for sites to have a small amount of dialogue breadth training data, most system development can proceed using the data that has already been collected, or more data of that type.</Paragraph> <Paragraph position="13"> 5. It requires no changes to the classification scheme, or to other information associated with the training data, such as the reference answers.</Paragraph> </Section> <Section position="5" start_page="321" end_page="321" type="metho"> <SectionTitle> SUGGESTION: MODIFY METRICS TO ENCOURAGE PARTIAL UNDERSTANDING </SectionTitle> <Paragraph position="0"> In human-human dialogue (let us call the participants the questioner and the information agent), it is often the case that one party only partially understands the other, and realizes that this is the case. There are several things that the information agent can do at this point: 1. ask for clarification 2. provide art answer based on partial understanding 3. provide an answer based on partial understanding, while indicating that it may not be precisely what was desired 4. decline to answer, and make the other party ask again. Clearly, 2 and 4 are the least preferred responses, since they may mislead or frustrate the questioner, respectively. But both 1 and 3 are often reasonable responses.</Paragraph> <Paragraph position="1"> For example, ff an information agent hears, &quot;What are the flights from BOS to DFW that ..... lunch&quot; (where some language in the middle was not heard or understood for some reason), it is reasonable to respond with either a request for clarification, such as, &quot;Do you want BOS to DFW flights that serve lunch?&quot; or with a qualified answer, such as, &quot;I didn't entirely understand you, but I think you were asking for BOS to DFW flights with lunch, so here they are: TWA 112 ...&quot;.</Paragraph> <Paragraph position="2"> Either response is acceptable, even to a questioner who asked for flights &quot;that do not serve lunch&quot;, or flights &quot;that serve breakfast or lunch&quot; because the system made clear that its understanding was unceaain.</Paragraph> <Paragraph position="3"> The idea of allowing our SLS systems to respond, in effect, with an answer qualified by &quot;I'm not sure I caught everything you said, but here's my best guess&quot; is a powerful one that is clearly oriented toward making systems useful in application.</Paragraph> <Paragraph position="4"> A Suggestion for Change The need for permitting systems some leeway in responding to partially understood input is clear, but the mechanism for doing so is less clear, and would require some thought by all of those involved in developing the SLS evaluation methodology.</Paragraph> <Paragraph position="5"> For example, a new class of system response could be allowed, called &quot;Qualified Answer&quot;, and that two new categories called Qualified Answer Reasonable, Qualified Answer Not Reasonable be added to the current set of Right, Wrong, and No Answer. Judging a qualified answer as reasonable or unreasonable would almost certainly have to be done by a human judge or judges, since it is unlikely to be possible to anticipate all the possible reasonable answers to a query. It would also be necessary to develop an explicit scoring metric for the new categories which would not penalize qualified answers too harshly.</Paragraph> <Paragraph position="6"> The evaluation committee and the SLS steering committee should consider these suggestions for possible inclusion in future common SLS evaluations.</Paragraph> </Section> class="xml-element"></Paper>