File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0311_metho.xml
Size: 10,830 bytes
Last Modified: 2025-10-06 14:15:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0311"> <Title>Some Exotic Discourse Markers of Spoken Dialog</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Cue for Back-channel Feedback </SectionTitle> <Paragraph position="0"> In Japanese and in English, the prosody of the speaker's utterances can mark times when the listener is welcome to produce back-channel feedback.</Paragraph> <Paragraph position="1"> One specific cue is a region of low pitch. Behavior of Japanese listeners can be modeled with the following rule: Upon detection of a) a region of pitch less than the 28th-percentile pitch level and b) continuing for at least 110ms,</Paragraph> <Paragraph position="3"> Bunkyo-ku Tokyo 113-86.56 Japan.</Paragraph> <Paragraph position="4"> Thanks to Wataru Tsukahara, Dan Jurafsky. and anonymous reviewers for suggestions, and to the Hayao Nakav'ama FoLmdation and the \[naraori Foundation for support.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpus-based Evaluation </SectionTitle> <Paragraph position="0"> We have tested the predictions of the above rule against a corpus of ca~sual, mostly friendly Japanese conversations between mostly students (80 minutes total).</Paragraph> <Paragraph position="1"> The rule gave a coverage of 56(~ (of 873 back-channels in the corpus, the rule predicted 496, using a 500 ms tolerance) and an accuracy of 34~ (of 1447 predictions, 496 were correct). This did better than a rule which makes predictions at random (while obeying clauses c, d, and e), both on average (Table 1), and for most speakers in most conversations: in 34 cases out of 36 (2 sides times 18 conversations) the figure of nlerit (namely the product of coverage and accuracy) was higher for the low pitch rule.</Paragraph> <Paragraph position="2"> Analysis of the false predictions for Japanese suggests that roughly half are due to inter-speaker differences in back-channel behavior; this is the source of the &quot;'eavesdropping&quot; estimate.</Paragraph> <Paragraph position="3"> Similar results were obtained for a corpus of 68 minutes of English conversation data for the English rule (Table 2), although the accuracy was much lower, suggesting that in English factors other than low pitch are relatively more important than in Japanese.</Paragraph> </Section> <Section position="5" start_page="0" end_page="62" type="metho"> <SectionTitle> 4 Informal Experiments </SectionTitle> <Paragraph position="0"> We have also tested the performance of the rules in live conversation, mostly over the telephone. We used a human decoy to start the conversation. After the subject started talking, the decoy shut up and the system took over, producing back-channel feedback in response to regions of low pitch, This feedback was un for Japanese and a random selection between uh-huh and mm for English. \~ used pre-recorded samples of the decoy's voice to make it impossible for Predictions from Coverage Accuracy : Figure of Merit low pitch regions 56t~ (496/873) 34~)t (496/144.7) .195 random 25~, (2'22/873) 24~ (222/91,5) .062 utterance end 689~: (593/873) 22~X (.393/2751) .146 utterance end and low pitch region 36~c (314/873) 32~ (277/978) .115 utterance end and no low pitch region 32tX (279/873) 16cZ (279/1773) .050 eavesdropping human judge (estimate) 95% 67'/c .64 subjects to distinguish between the decoy's live voice and the system's output. If the conversation flagged. the decoy would speak up with another question or comment to get the subject talking again.</Paragraph> <Paragraph position="1"> We have done a few dozen runs in both Japanese and English. In general third party judges listening to the conversations could distinguish the low pitch based aizuchis from randomly produced ones: the former sounded natural and the latter sounded odd, with clear cases of inappropriate aizuchis and of inappropriate silences when an alzuchi was called for. However, those who were actually talking to the system were apparently seldom aware of nor affected much by when or whether back-channels were produced. null</Paragraph> </Section> <Section position="6" start_page="62" end_page="62" type="metho"> <SectionTitle> 5 Communicative Functions </SectionTitle> <Paragraph position="0"> The 110ms low pitch region has no single fixed meaning or function, at least in our data.</Paragraph> <Paragraph position="1"> One thing it often co-occurs with is completion of a grammatical clause. Here it often seems to serve as an indication that the speaker considers that he has transmitted some new information, and so the hearer is welcome to confirm receipt or understanding or interest, with a back-channel. (We can think of it as conveying &quot;this completes that thought, did you follow?&quot;) Sometimes what has been transmitted is a complete new fact or proposition, but often it is the introduction of just enough information for the listener to infer the speaker's point, especially in Japanese. In such causes back-channel feedback can appear before the speaker has completed a grammarital phrase or logical proposition, and sometimes back-channel feedback ill such causes rakes the form of completing the speaker's thought or sentence.</Paragraph> <Paragraph position="2"> The low pitch region also often co-occurs with repetitions of a word previously spoken, produced for emphasis or clarity and/or when recovering from a false start, especially in English. In such cases it often welcomes back-channel feedback, perhaps conveying &quot;'I said it again, did you get it that time?&quot; The low pitch region also occurs frequently with disfluencies and markers of formulation difficulties, especially in English. In these cases we can imagine that the low pitch region conveys, &quot;I'm stuck. but keep listening, something meaningful will come out soon&quot;. It also occasionally occurs as a speaker takes the floor. In some of these cases, especially for Japanese, it elicits back-channel feedback, presumably as encouragement to continue.</Paragraph> <Paragraph position="3"> Another place where the low pitch region often occurs is together with back-channel feedback itself.</Paragraph> <Paragraph position="4"> Such cases of back-channel feedback themselves occasionally elicit a confirmatory word or sigh.</Paragraph> </Section> <Section position="7" start_page="62" end_page="63" type="metho"> <SectionTitle> 6 Co-occurring Markers </SectionTitle> <Paragraph position="0"> The low pitch cue tends to occur together with other discourse markers.</Paragraph> <Paragraph position="1"> It co-occurs frequently with specific lexical items, as one would expect from the communicative functions identified in the previous section. In Japanese it often occurs with clause connectives (most commonly kara, -te and kedo), with 'agreement seeking sentence-final particles', especially he, and with the back-channel un. In English the association with specific lexical items is less strong, but the low pitch region falls most frequently on the (almost always in the lengthened, unreduced pronunciation indicating difficulty finding the next word to say), and, and urn. The low pitch region is often followed by silence at the end of a speaker's utterance. The energy drop that marks the start of this silence is, counterintuitively, not much of a cue for back-channel feedback, providing little or no information beyond that provided by the low pitch cue, as seen in the tables (results are for a rule which predicts a back-channel in response to 150ms of silence, subject to clauses c, d.</Paragraph> <Paragraph position="2"> and e of the corresponding low pitch rule). This also implies that the low pitch region is often a valid cue even when it appears in the middle of an utterance.</Paragraph> <Paragraph position="3"> The low pitch region occasionally segues into a rise in intonation (uptalk). This seems to turn an invitation for feedback into a demand for it.</Paragraph> <Paragraph position="4"> The low pitch cue sometimes co-occurs with vowel lengthening. This may be a consequence of the need to produce a low pitch region of sufficient length, in those cases where there is.only a single syllable of lexical content to work with, for example with ne ('you know').</Paragraph> <Paragraph position="5"> Gaze, posture, and facial and hand gestures also may correlate with low pitch regions.</Paragraph> <Paragraph position="6"> Given all these correlations, it is natural to wonder whether it is necessary to invoke a notion of low pitch cue to explain the data. We have found that no other single factor can for all the occurrences of back-channels that low pitch regions can; to say nmre will require further analysis.</Paragraph> </Section> <Section position="8" start_page="63" end_page="63" type="metho"> <SectionTitle> 7 Responses Evoked </SectionTitle> <Paragraph position="0"> There have been many studies of the lexical items used in back-channels and the types of semantic functions served thereby.</Paragraph> <Paragraph position="1"> This section discusses one problematic subset of back-channels, those sounds which do not seem to be words, For example, in the Japanese corpus, in addition to the ubiquitous un, there is also uu, uh, uun, ununun, huun, huh. hmrnm, hrn-um, and over a hundred other items not found in dictionaries, with diverse prosody and voicing. In English, there is a family containing uh-huh, um-hm, uh-hrn, hm, hmm, mm, un, and ahhh. and another containing okay, kay, and n-kay.</Paragraph> <Paragraph position="2"> Rather than consider each of these a distinct lexical item, a nmre parsimonious account may be reached by means of the following hypotheses: A: these sounds are not fixed sequences of phonemes, but are formed for each occasion from basic acoustic components; B: these acoustic components individually bear meanings; C: the meaning of a combination of acoustic components is the combination of the meanings of each component.</Paragraph> <Paragraph position="3"> Some specific hypothesized meanings, for Japanese and possibly English too, are agreement for na.salization, contemplation for m, deference for breathiness and h, willingness to listen for number of syllables, and coldness for sharpness of final energy drop. In addition energy and pitch height and slope appear to bear the usual meanings.</Paragraph> <Paragraph position="4"> If these hypotheses are correct, then these conversational sounds are &quot;iconic', or, in other words, involve &quot;sound symbolism&quot; or 'synaesthesia'. This discourse-functionM system of sound symbolism appears to be distinct from the onomatopoeic and mimetic systems Of sound symbolism.</Paragraph> </Section> class="xml-element"></Paper>