File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0603_evalu.xml
Size: 16,182 bytes
Last Modified: 2025-10-06 14:00:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0603"> <Title>GENERALITY AND OBJECTIVITY Central Issues in Putting a Dialogue Evaluation Tool into Practical Use</Title> <Section position="7" start_page="20" end_page="24" type="evalu"> <SectionTitle> 5. Methodology and Results </SectionTitle> <Paragraph position="0"> The Sundial WOZ corpus comprises approx. 100 flight travel information dialogues concerning British Airways flights. The corpus was produced by 10 subjects who each performed 9 or 10 dialogues based on scenarios selected from a set of 24 scenarios. We do not have these scenarios. The transcriptions came with a header which identifies each dialogue, markup of user and system utterances, consecutive numbering of the lines in each dialogue transcription, and markup of pauses, ahs, hmms and coughs. For the first generality test of DET, we have selected 33 dialogues. Three dialogues were used for initial discussions among the two analysers. The remaining 30 dialogues were split into two sub-corpora of 15 dialogues each. Each sub-corpus was analysed by the two analysers. Methodologically, we analysed each system utterance in isolation as well as in its dialogue context to identify violations of the guidelines. Utterances which reflected one or more dialogue design problems were annotated with indication of the guideline(s) violated and a brief explanation of the problem(s). Using TEI, we have changed the existing markup of utterances to make each utterance unique across the entire corpus.</Paragraph> <Paragraph position="1"> In addition, we have a&led markup for guideline violation. An example is shown in Figure 2.</Paragraph> <Paragraph position="2"> Having independently analysed the two sub-corpora of 15 dialogues each, the analysers discussed each of the 384 claimed guideline violations and sought to reach consensus on as many classifications-by-guideline as possible. This lead to the following 10 status descriptors for the claimed guideline violations: <u id=&quot;U 1:7-1&quot;> (0.4) #h yes I'm enquiring about flight number bee ay two eight six flying in later today from san francisco (0.4) could you tell me %coughs% 'souse me which airport and terminal it's arriving at and what</Paragraph> <Paragraph position="4"> london heathrow terminal four at thirteen ten <violation ref=&quot;Sl:7-6&quot; guideline=&quot;SG2&quot;> Date not mentioned. The tabled arrival time is probably always the same for a given flight number but there may be days on which there is no flight with a given number.</Paragraph> <Paragraph position="5"> <violation ref=&quot;Sl:7-6&quot; guideline=&quot;GG7&quot;> It is not clear if the time provided is that of the timetable or the actual (expected) arrival time of the flight.</Paragraph> <Paragraph position="6"> Figure 2, Markup of part of a dialogue from the Sundial corpus. The excerpt contains a user question and the system's answer to that question. The user's query was first misunderstood but this part of the dialogue has been left out in the figure (indicated as: .... ). The system's answer violates two guidelines, SG2 and GG7, as indicated in the markup.</Paragraph> <Paragraph position="7"> (id) Identity = The same design error case identified by both annotators.</Paragraph> <Paragraph position="8"> (c) Complementarity = A design error case identified by one annotator..</Paragraph> <Paragraph position="9"> (cv) Consequence violations = Design error cases that would not have arisen had a more fundamental design error been avoided.</Paragraph> <Paragraph position="10"> (us) User symptoms = Symptoms of design errors as evidenced from user dialogue behaviour.</Paragraph> <Paragraph position="11"> (a) Alternatives = Alternative classifications of the same design error case by the two annotators.</Paragraph> <Paragraph position="12"> (re) Reclassification = Agreed reclassification of a design error case.</Paragraph> <Paragraph position="13"> (roe) Reclassification to already identified case = Agreed reclassification of a design error case as being identical to one that had already been identified.</Paragraph> <Paragraph position="14"> (ud) Undecidable = Agreed undecidable design error classification. null (deb) Debatable = The annotators disagreed on a higher null level issue involved in whether to classify a system utterance as a design error case.</Paragraph> <Paragraph position="15"> (rej) Rejects = Agreed rejections of attributed design error cases.</Paragraph> <Paragraph position="16"> Based on the consensus discussion, the analysers created two tables, one for each sub-corpus. The tables were structured by guideline and showed the violations of a particular guideline that had been identified by one of the NOB is NOB's annotation of the NOB sub-corpus. LD-NOB is LD's annotation of that sub-corpus. The table contains 18 cases of which 16 are agreed violations of GG1 (id, c and a), one is undecidable (ud) and one was rejected (rej). The table shows that 4 cases were reclassified (re), that the two cases of alternative classifications involved GG1 and GG7, and that an agreed classification involved a debate on a component issue (id/c+deb). two analysers, each violation being characterised, in addition, by its unique utterance identifier, its status descriptor and a brief description (Figure 3).</Paragraph> <Paragraph position="17"> Of the 384 claimed guideline violations, 344 were agreed upon as constituting actual guideline violations, comprising the status descriptors identity, complementarity, consequence violations, user symptoms, altematives, reclassification (re) and reclassification (rce). 40 claimed guideline violations were undecidable, not agreed upon or jointly rejected by the analysers. These figures are not very meaningful in themselves, however, because many identified design guideline violations were identical. This is illustrated in Figure 3 in which the case of offer/give phone no. recurs no less than 8 times. The analysers agreed that the system should always offer the phone number of an alternative information service when it was not itself able to provide the desired information, instead of merely telling users to ring that alternative service. The analysers disagreed, however, on whether the system should start by offering the phone number or provide the phone number right away (of. deb in Figure 3), What we need as SLDS developers is not a tool which tells us many times of the same dialogue design error but a tool which helps us find as many different dialogue design errors as quickly as possible. We take this to mean that when annotating spoken dialogue transcriptions, it can be waste of time and effort to annotate the same design error twice. A single annotation, once accepted, will lead to a different and improved de,-Guide- No. of agreed No. of line violations types i sorted by guideline violated. Note that Figure 4 does not include the cases and types that were either undecidable, disagreed, or rejected (see Figure 5).</Paragraph> <Paragraph position="18"> sign. However, if resource limitations enforce restrictions on the number of dialogue design errors which can be repaired, the number and severity of the different dialogue design errors will have to be taken into account. Following the reasoning of the preceding paragraph, the analysers proceeded to distil the different types of guideline violations or dialogue design errors identified in the corpus. This led to a much simpler picture, as shown in Figure 4.</Paragraph> <Paragraph position="19"> Figure 5 shows the nature of the types of guideline violation referred to in Figure 4 as well as the types that were undecidable, disagreed upon or rejected. It should be noted that the term &quot;type&quot; is in this context rather vague. Some of the types of guideline violation in Figure 5 are very important to the design of a habitable human-system spoken dialogue, such as the demand for a more informative opening presentation of what the system can and cannot do, others are of less importance because they appear to be rather special cases, such as when the system offers a phone number to a user who already told the system that s/he had this phone number; some types cover a wealth of different individual cases, such as the many differences in phrasing the same message to the user, others cover just a single case or a number of identical cases; and, of course, some types are more difficult to repair than others. However, common to all these guideline violations is that they should be remedied in the implemented system if at all possible. Jointly, Figures 4 and 5 show that 15 guideline violation types were found by both analysers, 9 types were found by one analyser only, one type, in fact, a single case, was undecidable on the evidence provided by the transcription, 3 types were disagreed upon, and 6 types were rejected during the consensus discussion, No types were found that demanded revision or extension of the guidelines. The Sundial corpus was analysed by two of the DET tool developers. It cannot be excluded, therefore, that others might in the corpus have found types that demanded revision or extension of the guidelines.</Paragraph> <Paragraph position="20"> This will have to be tested in a future exercise. However, on the evidence provided, the guidelines generalise well to a different dialogue and task type (el. Section 3). We also found that the guidelines generalise well to the different test type~tool purpose pair of the Sundial corpus. In fact, it is not much different to use the guidelines for early evaluation during WOZ and using the guidelines for diagnostic evaluation of an implemented system. In both cases, one works with transcribed data to which the guidelines are then applied.</Paragraph> <Paragraph position="21"> Turning now to the objectivity or intersubjectivity of the performed analysis, we mentioned earlier that this raises two issues wrt. the Sundial corpus: (a) to which extent do the analysers identify the same cases/types of guideline violation? and (b) to which extent do the analysers classify the identified cases/types in the same way? During DET development, we never tested for objectivity of annotation.</Paragraph> <Paragraph position="22"> no feedback on arrival/departure day, on BA and/or on route missing/ambiguous feedback on time U: has phone no. S: offers phone no.</Paragraph> <Paragraph position="23"> departure time instead of arrival time provided phone number provided although user has it already S: handles all flights - &quot;BA does not handle Airline X.&quot; S: &quot;no flights are leaving Crete today&quot; scheduled vs. actual arrival/dep, time not distinguished AM and PM not distinguished many variations in S's phrases too little said on what system can and cannot do: BA often missing, time-table enquiries always missing S should specify the information it needs S provides insufficient information for the user to determine if it is the wanted answer S repeats more than the 4 phone no. digits asked for &quot;flight info.&quot; known to be false: S knows only BA S: encourages inquiry on airline unknown to it S: &quot;flights between London and Aberdeen are not part of the BA shuttle service, there is a service from London Heathrow terminal one&quot; (rc from GG5 to GG6) U: arriving flights?, S: leaving flights: imprecise feedback system says it is not sure of the information it provided open S intro requires interaction instructions on waiting, verbosity etc. 2 different interpretations possible of $8:1-5 whether to just offer or actually give phone no.</Paragraph> <Paragraph position="24"> delayed feedback strategy BA to Zurich: when open meta-cornrnunication Rejected no need to mention arrival airport no S-goodbye: U hung up! delayed feedback strategy could defend this case response package OK no BA flights from Warsaw the system needs not have recent events info As to (a), the comparatively high number of guideline violation types found by one analyser but not by the other, i.e. 9 types compared to the 15 types found by both analysers, either shows that we are not yet experts in applying the guidelines to novel corpora, or that the tool is inherently only &quot;62.5 % objective&quot;. This needs further investigation. However, a different consideration is pertinent here. Consider, for instance, analyser A1. A I found the 15 guideline violation types which were also found by A2 plus another 6 guideline violation types. Compared to these 21 types, analyser A2 only managed to add 3 new guideline violation types. Suppose that, on average, either of two expert analysers find equally many guideline violation types not found by the other. In the present case, this number would be 4.5 guideline violation types. A single expert in using the tool would then find 19.5/24 or 81% of the guideline violation types found by two analysers together. Still, we don't know how many new guideline violations a third expert might find and whether we would see rapid convergence towards zero new guideline violations. It would of course be encouraging if this proved to be the case. The 3 types disagreed upon and the 6 rejected types illustrate, we suggest, that dialogue design is not an exact science! Taken together, however, the 4.5 guideline violation types added by the second analyser and the 9 disagreed or rejected types suggest the usefulness of having two different developers applying the tool to a transcribed corpus. Finally, the single undecidable case was one in which the (non-transcribed) prosody of what the user said might have made the difference. Following the system's statement that &quot;I'm sorry there are no flights leaving Crete today&quot;, the user asked &quot;did you say there aren't any flights leaving Crete today?&quot; One analyser took the user's question to be a simple request to have the system's statement repeated, in which case no guideline violation would have been committed by the system. The other analyser took the user's question to be an incredulous request for more information (&quot;did you say there AREN'T ANY flights leaving Crete today?&quot;), in which case the system's subsequent reply &quot;Yes&quot; would have been a violation of GG 1.</Paragraph> <Paragraph position="25"> As to (b), Figure 5 shows that the two analysers produced several alternative classifications. It should be noted, however, that the number of these disagreements has been exaggerated by the data abstraction that went into the creation of a small number of types as shown in Figures 4 and 5. In fact, alternative classifications were only made in 7 cases. It appears to be a simple fact that there will always be data on guideline violation which legitimately may be classified in different ways. Depending on the context, the fact that the system says too little about what it can and cannot do can be a violation of either SG4 or SG8. If it says so up front, this is an SG4 but if it later demonstrates that it has said too little, this should be an SG8 but it is comparatively innocuous if an analyser happens to classify the violation as an SG4. GG1 (say enough) and GG7 (don't be ambiguous) are sometimes two faces of the same coin: if you don't say enough, what you say may be ambiguous.</Paragraph> <Paragraph position="26"> Similarly GGI (say enough) and GG5 (be relevant), may on occasion be two faces of the same coin: if you don't say enough, what you actually do say may be irrelevant. The same applies to GG2 (superfluous information) and GG5 (relevance): superfluous information may be irrelevant information. SG2 (provide feedback) and GG7 (don't be ambiguous) may also overlap in particular cases: missing feedback on, e.g., time may imply that the utterance becomes ambiguous. Finally, GG6 (avoid obscurity) and GG7 (don't be ambiguous) may on occasion be difficult to distinguish: obscure utterances sometimes lend themselves to a variety of interpretations.</Paragraph> </Section> class="xml-element"></Paper>