File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1045_metho.xml
Size: 17,448 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1045"> <Title>Data-driven Generation of Emphatic Facial Displays</Title> <Section position="4" start_page="353" end_page="354" type="metho"> <SectionTitle> 3 Recording and Annotation </SectionTitle> <Paragraph position="0"> The recording script for the data collection consisted of 444 sentences in the domain of the COMIC multimodal dialogue system; all of the sentences described one or more features of one or more bathroom-tile designs. The sentences were generated by the full COMIC output planner, and were selected to provide coverage of all of the syntactic patterns available to the system. In addition to the surface text, each sentence included all of the contextual information from the COMIC 46. More about the current design they dislike the first feature, but like the second one There are GEOMETRIC SHAPES on the decorative tiles, but the tiles ARE from the planner: the predicted pitch accents--selected according to Steedman's (2000) theory of information structure and intonation--along with any information from the user model and dialogue history. The sentences were presented one at a time to the speaker, who was instructed to read each sentenceoutloudasexpressivelyaspossiblewhile lookingintoacameradirectedathisface. Thesegments for which the presentation planner specified pitch accents were highlighted, and any applicable user-model and dialogue-history information was included. Figure 1 shows a sample prompt slide.</Paragraph> <Paragraph position="1"> The recorded videos were annotated by the first author, using a purpose-built tool that allowed any setof facialdisplays tobe associatedwithany segmentofthesentence. First, thevideowassplitinto clips corresponding to each sentence. After that, the facial displays in each clip were annotated.</Paragraph> <Paragraph position="2"> The following were the displays that were considered: eyebrowraisingandlowering; eyesquinting; head nodding (up, small down, large down); head leaning (left and right); and head turning (left and right). Figure 2 shows examples of two typical display combinations. Any combination of these facial displays could be associated with any of the relevant segments in the text. The relevant segments included all mentions of tile-design properties (e.g., colours, designers), modifiers such as once again and also, deictic determiners (this, these), and verbs in contrastive contexts (e.g., are in Figure 1). The annotation scheme treated all facial displays as batons rather than underliners (Ekman, 1979); that is, each display was associated with a single segment. If a facial display spanned a longer phrase in the speech, it was annotated as a series of identical batons on each of the segments.</Paragraph> <Paragraph position="3"> Any predicted pitch accents and dialogue-history and user-model information from the COMIC presentation planner were also associated with each segment, as appropriate. We chose not to restrict our annotation to those segments with predicted pitch accents, because the speaker also made a large number of facial displays on segments with no predicted pitch accent; instead, we incorporated the predicted accent as an additional contextual factor. For the most part, the pitch accents used by the speaker followed the specifications on the slides. We did not explicitly consider the rhetorical or syntactic structure, as did, e.g., de Carolis et al. (2000); in general, the structure was fully determined by the context.</Paragraph> <Paragraph position="4"> There were a total of 1993 relevant segments in the recorded sentences. Overall, the most frequent display combination was a small downward nod onitsown, whichoccurredonjustover25%ofthe segments. The second largest class was no motion at all (20% of the segments), followed by downward nods (large and small) accompanied by brow raises. Further down the order, the various lateral motions appear; for this speaker, these were primarily turns to the right (Figure 2(a)) and leans to the left (Figure 2(b)).</Paragraph> <Paragraph position="5"> The distribution of facial displays in specific contextsdifferedfromtheoveralldistribution. The biggest influence was the user-model evaluation: left leans, brow lowering, and eye squinting were all relatively more frequent on objects with negative user-model evaluations, while right turns and brow raises occurred more often in positive contexts. Other factors also had an influence: for example, nodding and brow raises were both more frequent on segments for which the COMIC planner specified a pitch accent. Foster (2006) gives a detailed analysis of these recordings.</Paragraph> </Section> <Section position="5" start_page="354" end_page="355" type="metho"> <SectionTitle> 4 Modelling the Corpus Data </SectionTitle> <Paragraph position="0"> We built a range of models using the data from the annotated corpus to select facial displays to accompany generated text. For each segment in the text, a model selected a display combination from among the displays used by the speaker in a similar context. All of the models used the corpus countsofdisplaysassociatedwiththesegmentsdirectly, with no back-off or smoothing. The models differed from one another in two ways: the amount of context that they used, and the way in which they made a selection within a context. There were three levels of context: No context These models used the overall corpus counts for all segments.</Paragraph> <Paragraph position="1"> (a) Right turn + brow raise (b) Left lean + brow lower Surface only These models used only the context provided by the word(s)--or, in some cases, a domain-specific semantic class. For example, a model would use the class DECORA-TION rather than the specific word artwork. Full context Inadditiontothesurfaceform, these models also used the pitch-accent specifications and contextual information supplied by the COMIC presentation planner. The contextual information was associated with the tile-design properties included in the sentence and indicated (a) whether that property had been mentioned before, (b) whether it was explicitly contrasted with a property of a previous design, and (c) the expected user evaluation of that property.</Paragraph> <Paragraph position="2"> Within a context, there were two strategies for selecting a facial display: Majority Choose the combination that occurred the largest number of times in the context.</Paragraph> <Paragraph position="3"> Weighted Make a random choice from all combinations seen in the context, weighting the choice according to the relative frequency.</Paragraph> <Paragraph position="4"> For example, in the no-context case, a majority-choice model would choose the small downward nod (the majority option) for every segment, while a weighted-choice model would choose a small downward nod with probability 0.25, no motion with probability 0.2, and the other displays with correspondingly decreasing probabilities.</Paragraph> <Paragraph position="5"> These two factors produced a set of 6 models in total (3 context levels x 2 selection strategies). Throughout the rest of this paper, we will use two-character labels to refer to the models. The first character of each label indicates the amount of context that was used, while the second indicates the selection method within that context: for example, SM corresponds to a model that used the surface form only and made a majority choice.</Paragraph> </Section> <Section position="6" start_page="355" end_page="356" type="metho"> <SectionTitle> 5 Evaluation 1: Cross-validation </SectionTitle> <Paragraph position="0"> We first compared the performance of the models using 10-fold cross-validation against the corpus.</Paragraph> <Paragraph position="1"> For each fold, we built models using 90% of the sentences in the corpus, and then used those models to predict the facial displays for the sentences in the other 10% of the corpus. We measured the recall and precision on a sentence by comparing the predicted facial displays for each segment to the actual displays used by the speaker and averaging those scores across the sentence. We then used the recall and precision scores for a sentence to compute a sentence-level F score.</Paragraph> <Paragraph position="2"> Averaged across all of the cross-validation folds, the NM model had the highest recall score, while the FM model scored highest for precision and F score. Figure 3 shows the average sentence-level F score for all of the models. All but one of the differences shown are significant at the p < 0.01 level on a paired T test; the performance of the NM and FW models was indistinguishable on F score, although the FW model scored higher on precision and the NM model on recall.</Paragraph> <Paragraph position="3"> That the majority-choice models generally scored better on this measure than the weighted-choice models is not unexpected: a weighted-choice model is more likely to choose a lesscommon display, and if it chooses it in a context where the speaker did not, the score for that sentence is decreased. It is also not surprising that, within a selection strategy, the models that take into account more of the context did better than those that use less of it; this is simply an indication that there are patterns in the corpus, and that all of the contextual information contributes to the selection of displays.</Paragraph> </Section> <Section position="7" start_page="356" end_page="357" type="metho"> <SectionTitle> 6 Evaluation 2: User Ratings </SectionTitle> <Paragraph position="0"> The majority-choice models performed better on the cross-validation study than the weighted-choice ones did; however, this does not does not mean that users will necessarily like their output in practice. A large amount of the lateral motion and eyebrow movements occurs in the second- or third-largest class in a number of contexts, and is therefore less likely to be selected by a majority-choice model. If users like to see motion other than simple nodding, it might be that the schedules generated by the weighted-choice models are actually preferred. To address this question, we performed a user evaluation.</Paragraph> <Section position="1" start_page="356" end_page="356" type="sub_section"> <SectionTitle> 6.1 Experiment Design </SectionTitle> <Paragraph position="0"> Materials For this study, we generated 30 new sentences from the COMIC system. The sentences were selected to ensure that they covered the full range of syntactic structures available to COMIC and that none of them was a duplicate of anything from the recording script. We then generated a facial schedule for each sentence using each of the six models. Note that, for some of the sentences, more than one model produced an identical sequence of facial displays, either because the majority choice in a broader context was the same as in a more narrow context, or because a weighted-choice model ended up selecting the majority option in every case. All such identical schedules were retained in the set of materials; in Section 6.2, we discuss their impact on the results.</Paragraph> <Paragraph position="1"> We then made videos of every schedule for every sentence, using the Festival speech synthesiser (Clark et al., 2004) and the RUTH talking head (DeCarlo et al., 2004). Figure 4 shows synthesised versions of the facial displays from Figure 2.</Paragraph> <Paragraph position="2"> Procedure 33 subjects took part in the experiment: 17 female subjects and 16 males. They were primarily undergraduate students, between 20 and 24 years old, native speakers of English, with an intermediate amount of computer experience. Each subject in the study was shown videos of all 30 sentences in an individually-chosen random order. For each sentence, the subject saw two versions, each generated by a different model, and was asked to choose which version they liked better. The displayed versions were counterbalanced so that every subject performed each pair-wise comparison of models twice, once in each order. The study was run over the web.</Paragraph> </Section> <Section position="2" start_page="356" end_page="357" type="sub_section"> <SectionTitle> 6.2 Results2 </SectionTitle> <Paragraph position="0"> Figure 5(a) shows the overall preference rates for allofthemodels. Foreachmodel, thevalueshown 2 We do not include those trials where both videos were identical; if these are included, the results are similar, but the distinctions described here just fail to reach significance. on that graph indicates the proportion of the time that model was chosen over any of the alternatives. For example, in all of the trials where the FW model was one of the options, it was chosen over the alternative 55% of the time. Note that the values on that graph should not be directly compared against one another; instead, each should be individually compared with 0.5 (the dotted line) to determine whether it was chosen more or less frequently than chance. A binomial test on these values indicates that both the FW and the NW models were chosen significantly above chance, while those generated by the SM and NM models were chosen significantly below chance (all p < 0.05).</Paragraph> <Paragraph position="1"> The choices on the FM and SW models were indistinguishable from chance.</Paragraph> <Paragraph position="2"> If we examine the preferences within a context, we also see a preference for the weighted-choice models. Figure 5(b) shows the preferences for selection strategy within each context. For example, when choosing between schedules both generated by models using the full context (FM vs. FW), subjectschosetheonegeneratedbytheFW model 60% of the time. The trend in both the full-context and no-context cases is in favour of the weighted-choice models, and the combined values over all such trials (the rightmost pair of bars in the figure) show a significant preference for weighted choice over majority choice across all contexts (binomial test; p < 0.05).</Paragraph> <Paragraph position="3"> Gender differences There was a large gender effect on the users' preferences: overall, the male subjects (n = 16) tended to choose the majority and weighted versions with almost equal probabilities, while the female subjects (n = 17) strongly preferred the weighted versions in any context, and chose the weighted versions significantly more often in head-to-head comparisons (p < 0.001). In fact, all of the overall preference for weighted-choice models came from the responses of the female subjects. The graphs in Figure 6 show the head-to-head preferences in all contexts for both groups of subjects.</Paragraph> </Section> </Section> <Section position="8" start_page="357" end_page="358" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> The predicted rankings from the cross-validation study differ from those in the human evaluation: while the cross-validation gave the highest scores to the majority-choice models, the human judges actually showed an overall preference for the weighted-choice models. This provides support for our hypothesis that humans would prefer generated output that reproduced more of the variation in the corpus, even if the choices made on specific sentences differ from those mode in the corpus. When Belz and Reiter (2006) performed a similar study comparing natural-language generation systems that used different text-planning strategies, they also found similar results: automated measures tended to favour majority-choice strategies,whilehumanjudgespreferredthosethat made weighted choices. In general, this sort of automated measure will always tend to favour strategies that, on average, do not diverge far from what is found in the corpus, which indicates a drawback to using such measures alone to evaluate generation systems where variation is expected.</Paragraph> <Paragraph position="1"> The current study also suggests a further drawback to corpus-based evaluation: users may vary systematically amongst themselves in what they prefer. All of the overall preference for weighted-choice models came from the female subjects; the male subjects did not express any significant preference either way, but had a mild preference for the majority-choice models. Previous studies on embodied conversational agents have exhibited gender effects that appear related this result: Robertson et al. (2004) found that, among schoolchildren, girls preferred a tutoring system that included an animated agent, while boys preferred one that did not; White et al. (2005) found that a more expressive talking head decreased male subjects' task performance when using the full COMIC system; while Bickmore and Cassell (2005) found that women trusted the REA agent more in embodied mode, while men trusted her more over the telephone. Taken together, these results imply that male users prefer and perform better using an embodied agent that is less expressive and that shows less variation in its motions, and may even prefer a system that does not have an agent at all. These results are independent of the gender of the agent: the COMIC agent is male, REA is female, while the gender of Robertson's agents was mixed. In any case, there is more general evidence that females have superior abilities in facial expression recognition (Hall, 1984).</Paragraph> </Section> class="xml-element"></Paper>