File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0603_metho.xml
Size: 12,623 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0603"> <Title>GENERALITY AND OBJECTIVITY Central Issues in Putting a Dialogue Evaluation Tool into Practical Use</Title> <Section position="4" start_page="17" end_page="18" type="metho"> <SectionTitle> 2. Tool Development </SectionTitle> <Paragraph position="0"> DET was developed in the course of designing, implementing and testing the dialogue model for the Danish dialogue system (Bernsen et al. 1997b). The system is a walk-up-and-use prototype SLDS for over-the-phone ticket reservation for Danish domestic flights. The system's dialogue model was developed using the Wizard of Oz (WOZ) simulation method. Based on the problems of dialogue interaction observed in the WOZ corpus we established a set of guidelines for the design of cooperative spoken dialogue. Each observed problem was considered a case in which the system, in addressing the user, had violated a guideline of cooperative dialogue.</Paragraph> <Paragraph position="1"> The WOZ corpus analysis led to the identification of 14 guidelines of cooperative spoken human-machine dialogue based on analysis of 120 examples of user-system interaction problems. If those guidelines were observed in the design of the system's dialogue behaviour, we assumed, this would increase the smoothness of user-system interaction and reduce the amount of user-initiated meta-communication needed for clarification and repair.</Paragraph> <Paragraph position="2"> The guidelines were refined and consolidated through comparison with a well-established body of maxims of cooperative human-human dialogue which turned out to form a subset of our guidelines (Grice 1975, Bemsen et al. 1996). The resulting 22 guidelines were grouped under seven different aspects of dialogue, such as informativeness and partner asymmetry, and split into generic guidelines and specific guidelines. A generic guideline may subsume one or more specific guidelines which specialise the generic guideline to a certain class of phenomena. The guidelines are presented in Figure 1.</Paragraph> <Paragraph position="3"> The consolidated guidelines were then tested as a tool for the diagnostic evaluation of a corpus of 57 dialogues collected during a scenario-based, controlled user test of the implemented system. The fact that we had the scenarios meant that problems of dialogue interaction could be objectively detected through comparison between expected (according to the scenario) and actual user-system exchanges. Each detected problem was (a) characterised with respect to its symptom, (b) a diagnosis was made, sometimes through inspection of the log of system module communication, and (c) one or several cures were proposed. The 'cure' part of diagnostic analysis suggests ways of repairing system dialogue behaviour.</Paragraph> <Paragraph position="4"> The diagnostic analysis may demonstrate that new guidelines of cooperative dialogue design must be added, thus enabling continuous assessment of the scope of DET.</Paragraph> <Paragraph position="5"> We found that nearly all dialogue design errors in the user test could be classified as violations of our guidelines. Two specific guidelines on meta-communication, SGI0 and SGI 1, had to be added, however. This was no surprise as meta-comrnunication had not been simulated and thus was mostly absent in the WOZ corpus.</Paragraph> </Section> <Section position="5" start_page="18" end_page="19" type="metho"> <SectionTitle> 3. Generalising the Tool </SectionTitle> <Paragraph position="0"> As pointed out in Section 2, success in early tool development is not enough if the aim is to be able to recommend the tool to other SLDS developers on a solid basis. The early development phase focused on one SLDS with one particular dialogue structure, in one particular domain of application, designed for a particular type of task, i.e. reservation, in one particular development phase, i.e. evaluation of an implemented system, and in circumstances of controlled user testing where we had available the scenarios used by the subjects as well as the full design specification of the system. To test and increase the generality of the tool, we are currently applying DET as a dialogue design guide to a WOZ corpus from the Sundial project (Peckham 1993).</Paragraph> <Paragraph position="1"> Ideally, testing DET on the Sundial corpus will increase the generality that can be claimed for the tool in four different ways: (1) the system dialogue is different from that of the Danish dialogue system; (2) the task type is different, i.e. information vs. reservation; (3) the test type~tool purpose pairs are different: whereas in the case of the Danish dialogue system, DET was used for diagnostic evaluation in a controlled user test, the tool is being used as an early dialogue design guide in the case of Sundial; and (4) circumstances are different because</Paragraph> <Paragraph position="3"> *Make your contribution as informative as is required (for the current purposes of the exchange).</Paragraph> <Paragraph position="4"> SG 1 Be fully explicit in communicating to users the commitments they have made.</Paragraph> <Paragraph position="5"> SG2 Provide feedback on each piece of information provided by the user. *Do not make your contribution more informative than is required. *Do not say what you believe to be false.</Paragraph> <Paragraph position="6"> *Do not say that for which you lack adequate evidence.</Paragraph> <Paragraph position="7"> *Be relevant, i.e. be appropriate to the immediate needs at each stage of the transaction.</Paragraph> <Paragraph position="8"> *Avoid obscurity of expression.</Paragraph> <Paragraph position="9"> *Avoid ambiguity.</Paragraph> <Paragraph position="10"> SG3 Provide same formulation of the same question (or address) to users everywhere in the system's dialogue turns.</Paragraph> <Paragraph position="11"> *Be brief (avoid unnecessary prolixity).</Paragraph> <Paragraph position="12"> *Be orderly.</Paragraph> <Paragraph position="13"> Inform the dialogue partners of important non-normal characteristics which they should take into account in order to behave cooperatively in dialogue. Ensure the feasibility of what is required of them.</Paragraph> <Paragraph position="14"> SG4 Provide clear and comprehensible communication of what the system can and cannot do.</Paragraph> <Paragraph position="15"> SG5 Provide clear and sufficient instructions to users on how to interact with the system.</Paragraph> <Paragraph position="16"> Take partners' relevan! background knowledge into account. Take into account possible (and possibly erroneous) user inferences by analogy from related task domains.</Paragraph> <Paragraph position="17"> Separate whenever possible between the needs of novice and expert users (user-adaptive dialogue).</Paragraph> <Paragraph position="18"> Take into account legitimate partner expectations as to your own background knowledge.</Paragraph> <Paragraph position="19"> The generic guidelines are expressed at the same level of generality as are the Gricean maxims (marked with an *). Each specific guideline is subsumed by a generic guideline. The left-hand column characterises the aspect of dialogue addressed by each guideline.</Paragraph> <Paragraph position="20"> we do not have the scenarios used in Sundial and do not have access to the early design specification of the Sundial system. IfDET works well under circumstances (4), we shall know more on how to use it for the analysis of corpora produced without scenarios, such as in field tests, or without the scenarios being available.</Paragraph> <Paragraph position="21"> The important generalisation (4) poses a particular problem of objectivity. When, as in controlled user testing, the scenarios used by subjects are available, it is relatively straightforward to detect the dialogue design errors that are present in the transcribed corpus using objective methods. The objectivity problem then reduces to that of whether different analysers arrive at the same classifications of the identified problems. When, as in many realistic cases in which DET might be used, no scenarios exist or are available, an additional problem arises of whether the corpus analysers are actually able to detect the same problems in a dialogue prior to classifying them. If not, DET will not necessarily be useless but will be less useful in circumstances in which the objective number of dialogue design errors matters. In the test case, objectivity of detection will have to be based on the empirical fact, if it is a fact, that developers who are well-versed in using the tool actually do detect the same problems.</Paragraph> </Section> <Section position="6" start_page="19" end_page="20" type="metho"> <SectionTitle> 4. The Simulated System </SectionTitle> <Paragraph position="0"> The Sundial dialogues are early WOZ dialogues in which subjects seek time and route information on British Airways flights and sometimes on other airline flights as well. The emerging system seems to understand the following types of domain information: 1. Departure airport including terminal.</Paragraph> <Paragraph position="1"> 2. Arrival airport including terminal.</Paragraph> <Paragraph position="2"> 3. Time-tabled departure date.</Paragraph> <Paragraph position="3"> 4. Time-tabled departure time.</Paragraph> <Paragraph position="4"> 5. Time-tabled arrival date.</Paragraph> <Paragraph position="5"> 6. Time-tabled arrival time.</Paragraph> <Paragraph position="6"> 7. Flight number.</Paragraph> <Paragraph position="7"> 8. Actual departure date (not verified). 9. Actual departure time.</Paragraph> <Paragraph position="8"> 10. Actual arrival date (not verified). 11. Actual arrival time.</Paragraph> <Paragraph position="9"> 12.Distinction between BA flights which it knows about, and other flights which it does not know about but for which users are referred to airport help desks, sometimes by being given the phone numbers of those desks.</Paragraph> <Paragraph position="10"> By contrast with the Danish dialogue system, the Sundial system being developed through the use of the analysed corpus uses a delayed feedback strategy. Instead of providing immediate feedback on each piece of information provided by the user, the system waits until the user has provided the information necessary for executing a query to its database. It then provides implicit feedback through answering the query. Until the user has built up a flail query, which of course may be done in a single utterance but sometimes takes several utterances to do- the system would only respond by asking for more information or by correcting errors in the information provided by the user. The delayed feedback strategy is natural in human-human communication but might be considered somewhat dangerous in SLDSs because of the risk of accumulating system misunderstandings which the user will only discover rather late in the dialogue. We would not argue, however, that the delayed feedback strategy is impossible to implement and suecessflally use for flight information systems of the complexity of the intended Sundial system. Still, this complexity is considerable, in particular, perhaps, due to the intended ability of the system of distinguishing between timetabled and actual points in time. It is not an easy design task to get the system's dialogue contributions right at all times when this distinction has to be transparently present throughout.</Paragraph> <Paragraph position="11"> Another point about the corpus worth mentioning is that the simulated system understands the user amazingly well and in many respects behaves just like a human travel agent. The implication is that several of the guidelines in Figure 1, such as GGll/SG6/SG7 on background knowledge, and GG13, SG9/SG10/SGI 1 on meta-communication are not likely to be violated in the transcribed dialogues. It should be added that it is not accidental that exactly these guidelines are not likely to be violated in the transcribed dialogues. The reason is that it is difficult to realistically simulate the limited meta-communication and background-understanding abilities of implemented systems. As to the novice/expert distinction (SG7), this is hardly relevant to sophisticated flight information systems such as the present one. A final guideline which is not likely to be violated in the transcriptions, is SG I on user commitments. The reason simply is that users seeking flight information do not make any commitments: they merely ask for information.</Paragraph> </Section> class="xml-element"></Paper>