File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/h92-1009_evalu.xml
Size: 12,053 bytes
Last Modified: 2025-10-06 14:00:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1009"> <Title>Human-Machine Problem Solving Using Spoken Language Systems (SLS): Factors Affecting Performance and User Satisfaction</Title> <Section position="4" start_page="50" end_page="52" type="evalu"> <SectionTitle> 3. EXPERIMENTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="50" end_page="52" type="sub_section"> <SectionTitle> 3.1. The Effects of Speed and Accuracy Trade- </SectionTitle> <Paragraph position="0"> offs on User Satisfaction Since in general, speech understanding systems can trade accuracy for speed, we first assessed how these parameters might affect user behavior and acceptance of the system. The software version of the recognizer was slower than the hardware version (2.5 compared to 0.42 times the utterance duration), but was substantially more accurate (with a word error rate of 16.1% as compared with 24.8% on the same sound files).</Paragraph> <Paragraph position="1"> 1. Were the answers provided quickly enough? 2. Did the system understand your requests the first time? 3. I focused most of my attention on solving the problems, rather than trying to make the system understand me. 4. Do you think a person unfamiliar with computers could use the system easily? 5. Would you prefer this method to looking up the informa null To assess user satisfaction, we compared questionnaire responses for 46 subjects who used the hardware, 23 who used the software, and 46 who used the earlier wizard-mediated system. Mean responses are shown in Figure 1. In general, user satisfaction with the speed of the system correlated with the response time of the system they used; when asked, &quot;Were the answers provided quickly enough?&quot; 69.6% of the hardware users responded &quot;Yes.&quot; In contrast, only 34.8% of the software users and a mere 11.1% of the wizard-system users gave &quot;Yes&quot; responses, a significant difference from the hardware result, ~2 (df=4) = 35.6, p < .001. Although hardware users were pleased with the speed of the system; they were less likely than wizard system and software users to say they focused their attention on solving the problem rather than on trying to make the system understand them (33.3% as compared with 61.4% and 56.5%, respectively), a marginally significant effect, ~2 (df=4) = 7.8, p <.10.</Paragraph> <Paragraph position="2"> On several other measures users found the wizard-based system preferable to either the software or the hardware. More wizard-system users said that the system usually understood them the first time (47.8% as compared with 13.0% and 8.7% for the software and hardware users, respectively), ~2(df=4) = 22.5, p < .001. Overall, the wizard system users were more likely to say the system could be easily used by a person who was unfamiliar with computers (78% compared with 43.5% and 35.6% for the software and hardware, respectively) Z2 (df=4) = 20.5, p < .001. However, in terms of general satisfaction, as expressed in whether the subjects said they would prefer using the system to looking the information up in a book, there was no significant difference between the groups, with 52.3%, 60.9% and 55.6% &quot;Yes&quot; answers for the three groups respectively.</Paragraph> <Paragraph position="3"> Because the hardware system was least satisfying to users in terms of recognition accuracy, we concluded that the hardware would provide the greatest potential for user adaptation to the system. For this reason, we used the hardware system to collect data on the effects of user experience and instructions regarding hyperarticulation.</Paragraph> <Paragraph position="4"> 3.2. Effect of User Experience on Recognition User experience was evaluated in a within-subjects design, counterbalanced for scenario, that compared 24 users' first and second sessions. As a global measure of adaptation, we looked at how long it took subjects to complete their two scenarios. Although subjects were not told to solve the scenarios as quickly as possible, they nevertheless took less time (10.5 compared to 13.0 minutes) to complete their second scenarios, F(1,23) = 5.78, p < .05. This difference was partially but not completely attributable to a lower number of total utterances in the second scenario.</Paragraph> <Paragraph position="5"> The users also elicited fewer recognition errors in the second scenario. The mean word error rate was 20.4% for the first scenario but fell to 16.1% for the second, F(1,22) = 5.60, p < .05. However, not all users decreased their recognition error rate. There was a significant interaction between initial error rate and change in error rate from the first scenario to the second, F(1,22) = 10.98, p < .01. Subjects who had recognition error rates of 20% or worse in the first scenario (N=I 1) tended to improve recognition performance, while subjects who had better initial performance (N=13) did not (Figure 2). Subjects with initial error rates of 20% or higher went from an average of 31.3% errors down to 19.6%, while subjects with initially lower error rates showed no statistically significant change. For those subjects who did improve recognition performance, the improvement could only be due to user adaptation, since the same SLS version was used for both scenarios.</Paragraph> <Paragraph position="6"> The improvement in recognition may be due in part to user adaptation to the language models used. As a measure of deviation from the system's language models, we used test-set perplexity, which was based on the bigram probabilities of the observed word sequences. As would be expected, there was a significant, positive average correlation between utterance word error and perplexity: mean r = .28, t = 4.55, p < .001. Thus, one way for subjects to improve recognition accuracy would be to change their language to conform to that of the system model. Perplexity may therefore play a role in the decrease in recognition error rates observed over time for those subjects who had an error rate of 20% or worse in their first scenario. For this group of subjects, there was a tendency to produce queries with lower sentence perplexity in the second scenario (Figure 3). Using the median as a measure of central tendency (a more stable measure due to the inherent positive skew of perplexity), we found that the average median sentence perplexity was 25.3 for the first scenario and 19.4 for the second, a</Paragraph> <Paragraph position="8"> In addition to decreasing perplexity, subjects who had initial error rates of greater than 20% also tended to decrease the use of out-of-vocabulary words in the second scenario, whereas subjects who had lower error rates did not, a significant interaction, F(1,22) = 6.10, p < .05. Overall, however, the use of out-of-vocabulary words was rare.</Paragraph> <Paragraph position="9"> These findings indicate that at least to some degree, subjects adapted to the language models of the system and, in doing so, managed to improve the recognizer's performance.</Paragraph> <Paragraph position="10"> Quite possibly, subjects were finding ways to phrase their queries that produced successful answers, and then reproducing these phrases in subsequent queries. In future work, further analyses (for example, looking at dialogue) will address this issue in greater detail.</Paragraph> <Paragraph position="11"> 3.3. Effect of Instructions on Speech Style Another potential source of recognition errors arises when the speech of the user deviates from the acoustic models of the system. Since the vast majority of the data used to train the DECIPHER recognizer came from wizard-mediated data collection \[6\], where recognition performance was nearly perfect, examples of &quot;frustrated&quot; speech were rare. In human-human interaction, when an addressee (such as a foreigner) has difficulty understanding, speakers change their speech style to enunciate more clearly than usual (Ferguson \[3\]). We suspected that a similar effect might occur for people speaking to a machine that displayed feedback showing less than perfect understanding. We noticed that, when using an SLS as opposed to a wizard-mediated system, subjects tended to hyperarticulate: releasing stops, emphasizing initial word segments, pausing between words, and increasing vocal effort.</Paragraph> <Paragraph position="12"> Although hyperarticulation is a multifaceted behavior, it was nevertheless possible to make global judgments about individual utterances. Hyperarticulation was coded for each utterance on a three-point scale by listening to the utterances. Utterances were coded as (1) clearly natural sounding, (2) strongly hyperarticulated, or (3) somewhat hyperarticulated. The coding was done blindly without reference to session context or system performance.</Paragraph> <Paragraph position="13"> Using a within-subjects design, so that any differences in recognition performance could be attributed to a change in speech style, rather than speaker effects, we analyzed the speech style of 24 subjects' first scenarios (future analyses will also examine repeat scenarios). These subjects (of whom 20 were also included in the previous analysis of user experience) all used the hardware system. The subjects averaged about 10 natural sounding, 4 somewhat hyperarticulate, and 5 strongly hyperarticulate utterances each. For the 13 subjects who had at least three natural and three strongly hyperarticulated utterances, we compared recognition performance within subjects and found that the strongly hyperarticulate utterances resulted in higher word error rates, F(1,12) = 5.19, p < .05.</Paragraph> <Paragraph position="14"> Hyperarticulation was reduced, however, by giving users instructions not to &quot;overenunciate&quot; and by explaining that the system was trained on &quot;normal&quot; speech. We calculated a hyperarticulation score for each subject by weighting &quot;strongly hyperarticulated&quot; utterances as 1, &quot;somewhat hyperarticulated&quot; utterances as 0.5, and &quot;nonhyperarticulated&quot; utterances as 0, and taking the mean weight across all utterances in the scenario. The 12 subjects who heard the instructions (the &quot;instruction group&quot;) had lower mean hyperarticulation scores, 0.22 as compared with 0.60 for the 12 subjects who received no special instructions (the &quot;no instruction group&quot;), a significant difference F(1,22) = 11.97, p < .01.</Paragraph> <Paragraph position="15"> Given that the instruction group had significantly fewer hyperarticulated utterances, and given that hyperarticulation is associated with lower recognition accuracy, we would expect the instruction group to have better recognition performance overall. However, although the trend was in that direction (18.1% word error for the instruction group versus 22.5% for the no-instruction group.), the difference was not reliable. One possible explanation is a lack of power in the analysis, as a result of the small number of subjects and large individual differences in error rates. A second, not necessarily conflicting explanation is that the subjects given the instructions to &quot;speak naturally&quot; used somewhat less planned and less formal speech. We noticed that these subjects tended to have more spontaneous speech effects, such as verbal deletions, word fragments, lengthenings and filled pauses. Overall, spontaneous speech effects occurred in 15% of the 232 utterances for the instruction group, compared with 10% for the 229 utterances for the no-instruction group. Although these baseline rates are low, they may nevertheless have contnbuted to poorer recognition rates (see Butzberger et al. \[2\]). They may also be indicative of subtle speech style differences between the two groups not captured by the coding of hyperarticulation.</Paragraph> </Section> </Section> class="xml-element"></Paper>