File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-3001_intro.xml

Size: 2,607 bytes

Last Modified: 2025-10-06 14:03:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-3001">
  <Title>Incorporating Gesture and Gaze into Multimodal Models of Human-to-Human Communication</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In human communication, ideas tend to unfold in a structured way. For example, for an individual speaker, he/she organizes his/her utterances into sentences. When a speaker makes errors in the dynamic speech production process, he/she may correct these errors using a speech repair scheme. A group of speakers in a meeting organize their utterances by following a floor control scheme. All these structures are helpful for building better models of human communication but are not explicit in the spontaneous speech or the corresponding transcription word string. In order to utilize these structures, it is necessary to first detect them, and to do so as efficiently as possible. Utilization of various kinds of knowledge is important; For example, lexical and prosodic knowledge (Liu, 2004; Liu et al., 2005) have been used to detect structural events.</Paragraph>
    <Paragraph position="1"> Human communication tends to utilize not only speech but also visual cues such as gesture, gaze, and so on. Some studies (McNeill, 1992; Cassell and Stone, 1999) suggest that gesture and speech stem from a single underlying mental process, and they are related both temporally and semantically.</Paragraph>
    <Paragraph position="2"> Gestures play an important role in human communication but use quite different expressive mechanisms than spoken language. Gaze has been found to be widely used in coordinating multi-party conversations (Argyle and Cook, 1976; Novick, 2005).</Paragraph>
    <Paragraph position="3"> Given the close relationship between non-verbal cues and speech and the special expressive capacity of non-verbal cues, we believe that these cues are likely to provide additional important information that can be exploited when modeling structural events. Hence, in my Ph.D thesis, I have been investigating the combination of lexical, prosodic, and non-verbal cues for detection of the following structural events: sentence units, speech repairs, and meeting floor control.</Paragraph>
    <Paragraph position="4"> This paper is organized as follows: Section 1 has described the research goals of my thesis. Section 2 summarizes the efforts made related to these goals.</Paragraph>
    <Paragraph position="5"> Section 3 lays out the research work needed to complete my thesis.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML