File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1607_metho.xml
Size: 22,047 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1607"> <Title>Comparing Several Aspects of Human-Computer and Human-Human Dialogues</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> Expert:WHAT TIME DO YOU [exp-init] NEED TO DEPART User:AS SOON AS POSSIBLE [exp-init] AFTER FIVE P.M. </SectionTitle> <Paragraph position="0"> Expert:THE FIRST FLIGHT AFTER [exp-init] FIVE P.M. ON THAT DATE IS AT FIVE THIRTY FIVE P.M.</Paragraph> <Paragraph position="1"> ARRIVING IN CHICAGO AT SIX OH SIX P.M. ON U.S. AIR User:IS THAT O'HARE [user-init] (1)Expert: i have an American [exp-init] Airlines ight departing Seattle at twelve fty ve p.m. , arrives Tokyo at three p.m. the next day.</Paragraph> <Paragraph position="2"> Is that OK? (2)User: yes I'll take it [exp-init] (3)Expert: Will you return to seattle [exp-init] from tokyo? (4)User: what airport [user-init] (5)Expert: Will you return to seattle [exp-init] from tokyo? Table 4: Initiative tagging in an HC Exchange Our Kappa scores for interannotator agreement on the initiative tagging were somewhat lower than for DA tagging. Here, =0.68. In fact, our agreement was rather high, at 87%, but because there were so few instances of user initiative in the HC dialogues, our agreement would have to be exceptional to be re ected in a higher Kappa score. While we had believed this to be the easier task, with quite clear guidelines and only a binary tagging choice, it in fact proved to be quite di cult. We still believe that this tag set can give us useful insights into our data, but we would be interested in attempting further revisions to the tagging guidelines, particularly as regards the de nition of an \answer&quot;, i.e. when an answer is responsive and when it is not.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Analysis </SectionTitle> <Paragraph position="0"> We found a number of interesting di erences between the HH and HC dialogues. While we have not yet been able to test our hypotheses about why these di erences appear, we will discuss our ideas about them and what sorts of further work we would like to do to subject those ideas to empirical validation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Initiative Distribution </SectionTitle> <Paragraph position="0"> Based on researchers' descriptions of their systems (i.e. for the most part, \highly mixed-initiative&quot;), we had expected to nd some variance in the distribution of initiative across systems. As is evident from Table 5, the HC systems do not differ much from each other, but taken as whole, the dialogues di er dramatically from the HH dialogues. In the HH dialogues, users and expert share the initiative relatively equitably, while in the HC data the experts massively dominate in taking the initiative. Here, we are simply counting the number of turns tagged as user-initiative or expert-initiative.8 We also show turns to completion and overall user satisfaction scores for each system as a reference point. User satisfaction was calculated from ve questions asked of each user after each dialogue. The questions use a 5-point Likert scale.</Paragraph> <Paragraph position="1"> Turns to completion measures the total number of on-task turns. We found no signi cant correlations here, but cf. Walker et al. (2001) which provides more detailed analyses of the Communicator dialogues using user satisfaction and other metrics, within the PARADISE framework. It is worth noting, however, that the HC D has both the highest percentage of expert initiative and the highest satisfaction scores, so we should not conclude that more initiative will necessarily lead to in HH and HC Dialogues In the HC dialogues, we also see a di erence in success rate for user-initiative turns. By our de nition, the user \succeeds&quot; in taking the initiative in the dialogue if the system responds to the initiative on the rst possible turn. The rate of success 8A cautionary note is warranted here. We are not suggesting that more user-initiative is intrinsically preferable; it may well turn out to be the case that a completely system-directed dialogue is more pleasant/e cient/etc. Rather, we are seeking to quantify and assess what it means to be \mixed-initiative&quot; so that we can better evaluate the role of initiative in e ective (task-oriented) dialogues.</Paragraph> <Paragraph position="2"> is the ratio of successful user-initiatives attempts to total user-initiatives attempts. There appears to be a negative relationship between number of initiative attempts and their success rate. See</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> System </SectionTitle> <Paragraph position="0"> There is no determinable relationship between user experience (i.e., the number of calls per systems) and either the amount of user-initiative or the success rate of user-initiative.</Paragraph> <Paragraph position="1"> We also looked at user-initiative with respect to dialogue act type. Most user-initiatives are request-action (26%) and requestinformation (19%). Request-information dialogue acts (e.g., What cities do you know in Texas?, Are there any other ights?, Which airport is that?) are handled well by the systems (83% success rate) while request-action dialogue acts (e.g., start over, scratch that, book that ight) are not (48%). Most of the user-initiatives that are request-action dialogue acts are the start over command (16% of the total userinitiatives). Corrections to ight information presented by the systems consist of 20% of the total user-initiatives.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Overall Verbosity </SectionTitle> <Paragraph position="0"> In counting the number of words used, we nd that the computer experts are much more verbose than their human users, and are relatively more verbose than their human travel agent counterparts. In the HH dialogues, experts average 10.1 words/turn, while users average 7.2. In the HC dialogues on average, system have from 16.65-33.1 words/turn vs. the users' 2.8-4.8 words/turn. Figure 2 shows these di erences for each of the four systems and for the combined HH data.</Paragraph> <Paragraph position="1"> One DA which is a basic conversational tool and therefore an interesting candidate for analysis is the use of con rmations. Instances of short conrmation, typically back-channel utterances such as okay and uh huh were tagged as acknowledge, while instances of long con rmation, as when one participant explicitly repeats something that the other participant has said, were tagged as verify-X, where X=conversation-action, task-information and task-action, This tagging allows us to easily calculate the distribution of short and long con rmations.</Paragraph> <Paragraph position="2"> Overall we found in the HC dialogues a rather di erent con rmation pro le from the HH dialogues. In the HC dialogues, the systems use both types of con rmation far more than the users do (246 total system, 8 total user). Moreover, systems use long con rmation about ve times more often (210 vs. 36) than they use short con rmation. In contrast, the experts in the HH dialogues use somewhat more con rmations than users (247 vs. 173), but both parties use far more short than long con rmations (340 vs. 80), just the reverse of the HC situation. This di erence partially accounts for the total word count di erences we saw in the previous section. Tables 6 and 7 show the breakdowns in these numbers for each system and for the two sets of HH data, and begin to quantify the striking contrasts between human and computer con rmation strategies.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Number of Dialogue Acts </SectionTitle> <Paragraph position="0"> Another observation is that the computer experts appear to be trying to do more. They have signi cantly more DAs per turn than do their human users, whereas in the HH dialogues, the two participants have nearly the same number of DAs per turn (just over 1.3). In the HC dialogues, sys- null verify-X (percentage of total dialogue acts) tems have, on average 1.6 DAs per turn where users have just 1.0, as Figure 3 shows. If we take a DA as representing a single dialogue \move&quot;, then users in the HC dialogues are managing one move per turn, where the systems have at least one and often more. A common sequence for the computer experts is a verify-task-information followed by a request-task-information, such as A ight to Atlanta. What city are you departing</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Types of Dialogue Acts </SectionTitle> <Paragraph position="0"> One of our main questions going into this work was whether there would be interestingly di erent distributions of DAs in the HH and HC dialogues, and whether di erent distributions of DAs across systems would be correlated with user satisfaction. Unfortunately, we do not have user satisfaction scores for the HH data, but if new data were to be collected, this would be an essential addition.</Paragraph> <Paragraph position="1"> Tables 8 and 9 illustrate some of the main differences between the HH and HC dialogues, and as regards our rst research question, de nitely give an interesting view of the di erences between the HH and HC conversations.</Paragraph> <Paragraph position="2"> Computer dialogues, by percent of total DAs for column As expected in this domain, all DAs involving exchange of task information (give-task-info, request-task-info, and verify-task-info are frequent in both sets of dialogues. However, in the HH dialogues, acknowledge (e.g. the tag for back-channel responses and general con rmations such as right, uh huh and okay) is the second most common DA, and does not even appear in the top ve for the HC dialogues. The DA for positive responses, affirm, is also in the top ranking for the HH dialogues, but does not appear in the list for the HC dialogues. Finally, offer and apology appear frequently in the HC dialogues and not in the top HH DAs. The appearance of these two is a clear indication that the systems are doing things quite di erently from their human counterparts.</Paragraph> <Paragraph position="3"> Turning to di erences between experts and users in these top categories, we can see that human users and experts are about equally likely to ask for or give task-related information (givetask-info and request-task-info). In contrast, in the HC dialogues nearly half of the users' DAs are giving task information and hardly any are requesting such information, while almost a quarter of expert DAs are requesting information. There is some inequity in the use of verify-task-info in the HH dialogues, where experts perform about twice as many veri cations as users; however, in the HC dialogues, virtually all veri cation is done by the expert. All of these patterns reinforce our nding about initiative distribution; in the HC dialogues, one disproportionately nds the expert doing the asking and veri cation of task information, and the user doing the answering, while in the HH dialogues the exchange of information system, in terms of its overall distribution of DAs. These numbers are re ective of the system designers' decisions for their systems, and that means all DAs are not going to be used by all systems (i.e.</Paragraph> <Paragraph position="4"> 0.0% may mean that that DA is not part of the system's repertoire).</Paragraph> <Paragraph position="5"> We will concentrate here on the best and worst 9This gure combines the scores on ve user satisfaction questions. A perfect score is 100%.</Paragraph> <Paragraph position="6"> received systems in terms of their overall user satisfaction, HC D and HC A; the relevant numbers are boldfaced. They also have very di erent dialogue strategies, and that is partially re ected in the table. HC D's dialogue strategy does not make use of the 'social nicety' DAs employed by other systems (acknowledge, apologize, notunderstand), and yet it still had the highest user satisfaction of the four. This system also has the highest proportion of affirm (more than three times as many as the next highest system) and req-task-info DAs, which suggests that quite a lot of information is being solicited and the users (because we know from Table 9 that it is primarily the users responding) are more often than average responding a rmatively. The fact that the percentage of give-task-infos is somewhere in the middle of the range and affirms is so high may indicate that the HC D uses more yes/no than content questions.</Paragraph> <Paragraph position="7"> Looking at the lower scoring system, HC A, we see very di erent patterns. HC A has most of the demand-task-infos, the second highest percentage of req-task-infos and by far the most give-task-infos, so its dialogue strategy must involve a large number of attempts to extract information from the user, and yet it has the fewest offer DAs, so these don't appear to be resulting in suggestions of particular travel options.</Paragraph> <Paragraph position="8"> Turning to correlations between DA use by expert and user (combined across systems) and user satisfaction, we see some expected results but also some rather surprising correlations.</Paragraph> <Paragraph position="9"> Not unexpectedly, apologies and signals of non-understanding by the system are highly negatively correlated with satisfaction (-0.7 and -0.9, respectively). While it may seem counter-intuitive that open-close by the user is negatively correlated (at -0.8), those familiar with this data will undoubtedly have noticed that users often try to say Goodbye repeatedly to try to end a dialogue that is going badly. Discussion of situational information (e.g. phone use) by the expert is highly negatively correlated, but by the user, the DA reqsituation-info is perfectly positively correlated. We cannot account for this nding.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Unsolicited Information </SectionTitle> <Paragraph position="0"> In the HC data we noticed that users often provided more information than was explicitly solicited{we call this 'unsolicited information'.</Paragraph> <Paragraph position="1"> For example, when a system asks for one piece of information, On what day would you be departing Portland?, the user might respond with additional information such as, Thursday, October 5th before six pm from Portland back to Seattle.</Paragraph> <Paragraph position="2"> 78% of that unsolicited information is o ered in response to open-ended questions (e.g., How can I help you? or What are your travel plans?). While our initiative tagging partially captures this, there are cases where the answer may be considered responsive (i.e. initiative does not shift away from the participant asking the question) and yet unsolicited information has been o ered. Thus, this category is somewhat orthogonal to our characterization of initiative, although it is clearly one way of seizing control of the conversation.10 To get at this information, we developed a third tagging scheme for annotating unsolicited information. We began examining just the HC documents, because the phenomenon is prevalent in these data; we hope to perform a similar analysis on the HH data as well. We found that the systems we examined in general handle unsolicited information well. 70% of all unsolicited information is handled correctly by the systems, 22% is handled incorrectly, and the rest could not be accurately classi ed. Information o ered in response to open-ended questions is handled correctly more often by the systems than unsolicited information o ered at other points in the dialogue (74% versus 56%). The former gure is not surprising, since the systems are designed to handle \unsolicited&quot; information following open-prompts. However, we were surprised the systems did as well as they did on unsolicited information in contexts where it was not expected. Figure 4 shows the relationship between frequency of various types of unsolicited information and how well the system incorporates that information. There appears to be some correlation between the frequency of unsolicited information and the rate of success, but we do not have enough data to make a stronger claim.</Paragraph> <Paragraph position="3"> Furthermore, systems vary in response delay to pieces of unsolicited information. We de ne response delay as the number of system turns it takes before the information is acknowledged by the system (either correctly or incorrectly.) If a system responds immediately to the unsolicited information, a count of zero turns is recorded.</Paragraph> <Paragraph position="4"> Figure 5 shows the di erence among systems in responding to unsolicited information. We graphed both the average total number of system turns as well as the average number of turns minus repetitions. HC B responds almost immediately to 10This issue may also be related to where in the dialogue errors occur. We are pursuing another line of research which looks at automatic error detection, described in (Aberdeen et al., 2001). We believe we may also be able to detect unsolicited information automatically, as well as to see whether it is likely to trigger errors by the system.</Paragraph> <Paragraph position="5"> unsolicited information while HCs A and C take more turns to respond. HC D has trouble understanding the unsolicited information, and either keeps asking for clari cation or continues to ignore the human and prompts for some other piece of information multiple times.</Paragraph> <Paragraph position="6"> tems acknowledge unsolicited information for different elds. For example, departure city is recognized and validated almost immediately. Return date and ight type are incorporated fairly quickly when the system understands what is being said.</Paragraph> <Paragraph position="7"> If we look at the e ects of experience on the amount of unsolicited information o ered, as shown in Figure 7, we can see that users tend to provide more unsolicited information over time (i.e., as they make more calls to the systems).</Paragraph> <Paragraph position="8"> This e ect may be the result of increased user con dence in the systems at handling unsolicited information. It also may be attributed to user boredom; as time goes on, users may be trying to nish the task as quickly as possible. Even if this is true, however, it demonstrates attempts by users to take more control of the interactions as formation Our data also show that the success rate of incorporating unsolicited information improves with user experience. The ratio of successes to failures increases in later calls to the systems (Figure 8).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> This was a relatively small study, but many of the results are su ciently striking that we expect them to hold over large sets of dialogues. First, it is clear that (for our de nition of the term) initiative is skewed towards the computer expert in the human-computer dialogues, despite claims of developers to the contrary. Whether this is desirable or not is a separate issue, but we believe it is a move forward to be able to quantify this di erence. Second, there are clear di erences in dialogue act patterns between the HH and HC dialogues. When the DAs correspond to basic dialogue moves, like questions or signals of agreement, we can begin to see how the dialogue dynamic is di erent in the human computer situa- null tion. In general, the conversation was much more balanced between traveler and expert in the HH setting, in terms of amount of speech, types of dialogue acts and with respect to initiative. In the HC conversations, the system dominated, in number of words and dialogue acts and in initiative.</Paragraph> <Paragraph position="1"> We are very interested in the selection of the 'right' tag set for a given task. As we noted in our discussion of DA tagging, we had very different outcomes with two closely related tag sets. Clearly the choice of tag set is highly dependent on the use the tagged data will be put to, how easily the task can be characterized in the set of tagging guidelines, and what trade-o s in accuracy vs. richness of representation are acceptable. A central question we are left with is \Why don't the users talk more in HC dialogues?&quot; Is it that they are happy to just give short, speci c answers to very directed questions? Or do they \learn&quot; that longer answers are likely to cause the systems problems? Or perhaps users have preconceived notions (often justi ed) that the computer will not understand long utterances? We may speculate that poor speech recognition performance is a major factor shaping this behavior, leading system designers to attempt to constrain what users can say, while simultneously attempting to hold onto the initiative. (Walker et al. (2001) found sentence accuracy to be one of the signi cant predictors of user satisfaction in the Summer 2000 DARPA Communicator data collection.) There are some cases where the experts in the HC dialogues say things their human counterparts need not. One obvious case, which appears in even the small example dialogues we are using here, is that the systems tend to repeat utterances when there is some processing di culty. In the same vein, errors and misunderstandings are more frequent in the HC data, resulting in (some fairly verbose) e orts by the systems to identify the problem and get the conversation back on track.</Paragraph> </Section> class="xml-element"></Paper>