File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4039_metho.xml
Size: 12,098 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4039"> <Title>Converting Text into Agent Animations: Assigning Gestures to Text</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Linguistic Theories and Gesture Studies </SectionTitle> <Paragraph position="0"> In this section we review linguistic theories and discuss the relationship between gesture occurrence and syntactic information.</Paragraph> <Paragraph position="1"> Linguistic quantity for reference: McNeill (McNeill, 1992) used communicative dynamism (CD), which represents the extent to which the message at a given point is 'pushing the communication forward' (Firbas, 1971), as a variable that correlates with gesture occurrence. The greater the CD, the more probable the occurrence of a gesture. As a measure of CD, McNeill chose the amount of linguistic material used to make the reference (Givon, 1985). Pronouns have less CD than full nominal phrases (NPs), which have less CD than modified full NPs. This implies that the CD can be estimated by looking at the syntactic structure of a sentence.</Paragraph> <Paragraph position="2"> Theme/Rheme: McNeill also asserted that the theme (Halliday, 1967) of a sentence usually has the least CD and is not normally accompanied by a gesture. Gestures usually accompany the rhemes, which are the elements of a sentence that plausibly contribute information about the theme, and thus have greater CD. In Japanese grammar there is a device for marking the theme explicitly. Topic marking postpositions (or &quot;topic markers&quot;), typically &quot;wa,&quot; mark a nominal phrase as the theme. This facilitates the use of syntactic analysis to identify the theme of a sentence. Another interesting aspect of information structure is that in English grammar, a whinterrogative (what, how, etc.) at the beginning of a sentence marks the theme and indicates that the content of the theme is the focus (Halliday, 1967). However, we do not know whether such a special type of theme is more likely to co-occur with a gesture or not.</Paragraph> <Paragraph position="3"> Given/New: Given and new information demonstrate an aspect of theme and rheme. Given information usually has a low degree of rhematicity, while new information has a high degree. This implies that rhematicity can be estimated by determining whether the NP is the first mention (i.e., new information) or has already been mentioned (i.e., old or given information).</Paragraph> <Paragraph position="4"> Contrastive relationship: Prevost (1996) reported that intonational accent is often used to mark an explicit contrast among the salient discourse entities. On the basis of this finding and Kendon's theory about the relationship between intonation phrases and gesture placements (Kendon, 1972), Cassell & Prevost (1996) developed a method for generating contrastive gestures from a semantic representation. In syntactic analysis, a contrastive relation is usually expressed as a coordination, which is a syntactic structure including at least two conjuncts linked by a conjunction.</Paragraph> <Paragraph position="5"> Figure 1 shows an example of the correlation between gesture occurrence and the dependency structure of a Japanese sentence. Bunsetsu units (8)-(9) and (10)-(13) in the figure are conjuncts. A &quot;bunsetsu unit&quot; in Japanese corresponds to a phrase in English, such as a noun phrase or a prepositional phrase. Each conjunct is accompanied by a gesture. Bunsetsu (14) is a complement containing a verbal phrase; it depends on bunsetsu (15), which is an NP. Thus, bunsetsu (15) is a modified full NP and thus has large linguistic quantity.</Paragraph> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Empirical Study </SectionTitle> <Paragraph position="0"> To identify linguistic features that might be useful for judging gesture occurrence, we videotaped seven presentation talks and transcribed three minutes for each of them. The collected data included 2124 bunsetsu units and 343 gestures.</Paragraph> <Paragraph position="1"> Gesture Annotation: Three coders discussed how to code the half the data and reached a consensus on gesture occurrence. After this consensus on the coding scheme was established , one of the coders annotated the rest of the data. A gesture consists of preparation, stroke, and retraction (McNeill, 1992), and a stroke co-occurs with the most prominent syllable (Kendon, 1972). Thus, we annotated the stroke time as well as the start and end time of each gesture. Linguistic Analysis: Each bunsetsu unit was automatically annotated with linguistic information using a Japanese syntactic analyzer (Kurohashi & Nagao, 1994) .</Paragraph> <Paragraph position="2"> The information was determined by asked the following questions for each bunsetsu unit.</Paragraph> <Paragraph position="3"> (a) If it is an NP, is it modified by a clause or a complement? null (b) If it is an NP, what type of postpositional particle marks its end (e.g., &quot;wa&quot;, &quot;ga&quot;, &quot;wo&quot;)? (c) Is it a wh-interrogative? (d) Are all the content words in the bunsetsu unit have mentioned in a preceding sentence? (e) Is it a constituent of a coordination? Moreover, as we noticed that some lexical entities frequently co-occurred with a gesture in our data, we used the syntactic analyzer to annotate additional lexical information based on the following questions. (f) Is the bunsetsu unit an emphatic adverbial phrase (e.g., very, extremely), or is it modified by a preceding emphatic adverb (e.g., very important is- null sue)? (g) Does it include a cue word (e.g., now, therefore)? (h) Does it include a numeral (e.g., thousands of people, 99 times)? We then investigated the correlation between these lexical and syntactic features and the occurrence of gesture strokes.</Paragraph> <Paragraph position="4"> Result: The results are summarized in Table 1. The baseline gesture occurrence frequency was 10.1% per bunsetsu unit (a gesture occurred once about every ten Inter-coder reliability among the three coders in categorizing the gestures (beat, iconic, etc.) was sufficiently high (Kappa = 0.81). Although we did not measure agreement on gesture occurrence itself, this result suggests that the coders had very similar schemes for recognizing gestures. To prevent the effects of parsing errors, errors in syntactic dependency analysis were corrected manually for about 13% of the data.</Paragraph> <Paragraph position="5"> Underlined phrases are accompanied by gestures, and strokes occur at double-underlined parts. Case markers are enclosed by square brackets [ ]. occurred with a bunsetsu unit forming a coordination (47.7%). When an NP was modified by a full clause, it was accompanied by a gesture 38.2% of the time. For the other types of noun phrases, including pronouns, when an accusative case marked with case marker &quot;wo&quot; was new information (i.e., it was not mentioned in a previous sentence), a gesture co-occurred with the phrase 28.1% of the time. Moreover, gesture strokes frequently co-occurred with wh-interrogatives (41.4%), cue words (41.5%), and numeral words (39.3%). Gesture strokes frequently occurred right after emphatic adverbs (35%) rather than with the adverb (24.4%). These cases listed in Table 1 had a 3 to 5 times higher probability of gesture occurrence than the baseline and accounted for 75% of all the gestures observed in the data. Our results suggest that these types of lexical and syntactic information can be used to distinguish between where a gesture should be assigned and where one should not be assigned. They also indicate that the syntactic structure of a sentence more strongly affects gesture occurrence than theme or rheme and than given or new information specified by local grammatical cues, such as topic markers and case markers.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 System Implementation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> We used our results to build a presentation agent system, SPOC (Stream-oriented Public Opinion Channel).&quot; This system enables a user to embody a story (written text) as a multimodal presentation featuring video, graphics, speech, and character animation. A snapshot of the SPOC viewer is shown in Figure 2.</Paragraph> <Paragraph position="1"> In order to implement a storyteller in SPOC, we developed an agent behavior generation system we call &quot;CAST (Conversational Agent System for neTwork applications).&quot; Taking text input, CAST automatically selects agent gestures and other nonverbal behaviors, calculates an animation schedule, and produces synthesized voice output for the agent. As shown in Figure 2, CAST consists of four main components: (1) the Agent Behavior Selection Module (ABS), (2) the Language Tagging Module (LTM), (3) the agent animation system, and (4) a text-to-speech engine (TTS). The received text input is first sent to the ABS. The ABS selects appropriate gestures and facial expressions based on the linguistic information calculated by the LTM. It then obtains the timing information from the TTS and calculates a time schedule for the set of agent actions. The output from the ABS is a set of animation instructions that can be interpreted and executed by the agent animation system.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Determining Agent Behaviors </SectionTitle> <Paragraph position="0"> Tagging linguistic information: First, the LTM parses the input text and calculates the linguistic information described in Sec. 3. For example, bunsetsu (9) in Figure 1 has the following feature set.</Paragraph> <Paragraph position="1"> {Text-ID: 1, Sentence-ID: 1, Bunsetsu-ID: 9, Govern: 8, Depend-on: 13, Phrase-type: VP, Linguistic-quantity: NA, Casemarker: NA, WH-interrogative: false, Given/New: new, Coordinate-with: 13, Emphatic-Adv: false, Cue-Word: false, Numeral: false} The text ID of this bunsetsu unit is 1, the sentence ID is 1, the bunsetsu ID is 9. This bunsetsu governs bunsetsu 8 and depends on bunsetsu 13. It conveys new information and, together with bunsetsu 13, forms a parallel phrase.</Paragraph> <Paragraph position="2"> Assigning gestures: Then, for each bunsetsu unit, the ABS decides whether to assign a gesture or not based on the empirical results shown in Table 1. For example, bunsetsu unit (9) shown above matches case C4 in Table 1, where a bunsetsu unit is a constituent of coordination. In this case, the system assigns a gesture to the bunsetsu with 47.7 % probability. In the current implementation, if a specific gesture for an emphasized concept is defined in the gesture animation library (e.g., a gesture animation expressing &quot;big&quot;), it is preferred to a &quot;beat gesture&quot; (a simple flick of the hand or fingers up and down (McNeill, 1992)). If a specific gesture is not defined, a beat gesture is used as the default.</Paragraph> <Paragraph position="3"> The output of the ABS is stored in XML format. The type of action and the start and end times of the action are indicated by XML tags. In the example shown in Figure 3, the agent first gazes towards the user. It then performs contrast gestures at the second and sixth bunsetsu units and a beat gesture at the eighth bunsetsu unit. Finally, the ABS transforms the XML into a time schedule by accessing the TTS engine and estimating the phoneme and bunsetsu boundary timings. The scheduling technique is similar to that described by (Cassell et al., 2001). The ABS also assigns visemes for the lip-sync and the facial expressions, such as head movement, eye gaze, blink, and eyebrow movement.</Paragraph> </Section> </Section> class="xml-element"></Paper>