File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2711_metho.xml

Size: 9,839 bytes

Last Modified: 2025-10-06 14:10:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2711">
  <Title>The SAMMIE Multimodal Dialogue Corpus Meets the Nite XML Toolkit</Title>
  <Section position="3" start_page="0" end_page="69" type="metho">
    <SectionTitle>
2 Experiment Setup
</SectionTitle>
    <Paragraph position="0"> 24 subjects in SAMMIE-1 and 35 in SAMMIE-2 performed several tasks with an MP3 player application simulated by a wizard. For SAMMIE-1 we had two, for SAMMIE-2 six wizards. The tasks involved searching for titles and building playlists satisfying various constraints. Each session was 30 minutes long. Both users and wizards could speak freely. The interactions were in German (although most of the titles and artist names in the database were English).</Paragraph>
    <Paragraph position="1"> SAMMIE-2 had a more complex setup. The tasks the subjects had to fulfill were divided in two classes: with vs. without operating a driving simulator. When presenting the search results, the wizards were free to produce monoor multimodal output as they saw fit; they could speak freely and/or select one of four automatically generated screen outputs, which contained tables and lists of found songs/albums. The  natural language and/or selecting items on the screen. Both wizard and user utterances were immediately transcribed. The wizard's utterances were presented to the user via a speech synthesizer. To simulate acoustic understanding problems, the wizard sometimes received only part of the transcribed user's utterance, to elicit CRs. (See (Kruijff-Korbayov'a et al., 2005) for details.)</Paragraph>
  </Section>
  <Section position="4" start_page="69" end_page="69" type="metho">
    <SectionTitle>
3 Collected Data
</SectionTitle>
    <Paragraph position="0"> The SAMMIE-2 data for each session consists of a video and audio recording and a log file.5 The gathered logging information per session consists of Open Agent Architecture (Martin et al., 1999) (OAA) messages in chronological order, each marked by a timestamp. The log files contain various information, e.g., the transcriptions of the spoken utterances, the wizard's database query and the number of results, the screen option chosen by the wizard, classification of clarification requests (CRs), etc.</Paragraph>
  </Section>
  <Section position="5" start_page="69" end_page="70" type="metho">
    <SectionTitle>
4 Annotation Methods and Tools
</SectionTitle>
    <Paragraph position="0"> The rich set of features we are interested in naturally gives rise to a multi-layered view of the corpus, where each layer is to be annotated independently, but subsequent investigations involve exploration and automatic processing of the integrated data across layers.</Paragraph>
    <Paragraph position="1"> There are two crucial technical requirements that must be satisfied to make this possible: (i) stand-off annotation at each layer and (ii) alignment of base data across layers. Without the former, we could not keep the layers separate, without the latter we would not be able to align the separate layers. An additional equally important requirement is that elements at different layers of annotation should be allowed to have overlapping spans; this is crucial because, e.g., prosodic units and syntactic phrases need not coincide.</Paragraph>
    <Paragraph position="2"> Among the existing toolkits that support multi-layer annotation, it was decided to use NXT (Carletta et al., 2003)6 in the TALK project. The NXT-based SAMMIE-2 corpus we  are demonstrating has been created in several steps: (1) The speech data was manually transcribed using the Transcriber tool.7 (2) We automatically extracted features at various annotation layers by parsing the OAA messages in the log files. (3) We automatically converted the transcriptions and the information from the log files into our NXT-based data representation format; features annotated in the transcriptions and features automatically extracted from the log files were assigned to elements at the appropriate layers of representation in this step.</Paragraph>
    <Paragraph position="3"> Manual annotation: We use tools specifically designed to support the particular annotation tasks. We describe them below.</Paragraph>
    <Paragraph position="4"> As already mentioned, we used Transcriber for the manual transcriptions. We also performed certain relatively simple annotations directly on the transcriptions and coded them in-line by using special notation. This includes the identification of self-speech, the identification of expressions referring to domain objects (e.g., songs, artists and albums) and the identification of utterances that convey the results of database queries. For other manual annotation tasks (the annotation of CRs, task segmentation and completion, referring expressions and the relations between them) we have been building specialized tools based on the NXT library of routines for building displays and interfaces based on Java Swing (Carletta et al., 2003). Although NXT comes with a number of example applications, these are tightly coupled with the architecture of the corpora they were built for. We therefore developed a core basic tool for our own corpus; we modify this tool to suite each annotation task. To facilitate tool development, NXT provides GUI elements linked directly to corpora elements and support for handling complex multi-layer corpora. This proved very helpful.</Paragraph>
    <Paragraph position="5"> Figure 4 shows a screenshot of our CR annotation tool. It allows one to select an utterance in the left-hand side of the display by clicking on it, and then choose the attribute values from the pop-down lists on the right-hand side. Cre- null ating relations between elements and creating elements on top of other elements (e.g., words or utterances) are extensions we are currently implementing (and will complete by the time of the workshop). First experiences using the tool to identify CRs are promising.8 When demonstrating the system we will report the reliability of other manual annotation tasks.</Paragraph>
    <Paragraph position="6"> Automatic annotation using indexing: NXT also provides a facility for automatic annotation based on NiteQL query matches (Carletta et al., 2003). Some of our features, e.g., the dialogue history ones, can be easily derived via queries.</Paragraph>
  </Section>
  <Section position="6" start_page="70" end_page="70" type="metho">
    <SectionTitle>
5 The SAMMIE NXT Data Model
</SectionTitle>
    <Paragraph position="0"> NXT uses a stand-off XML data format that consist of several XML files that point to each other.</Paragraph>
    <Paragraph position="1"> The NXT data model is a multi-rooted tree with arbitrary graph structure. Each node has one set of children, and can have multiple parents.</Paragraph>
    <Paragraph position="2"> Our corpus consists of the following layers.</Paragraph>
    <Paragraph position="3"> Two base layers: words and graphical output events; both are time-aligned. On top of these, structural layers correspond to one session per subject, divided into task sections, which consist of turns, and these consist of individual utterances, containing words. Graphical output events will be linked to turns at a featural layer.</Paragraph>
    <Paragraph position="4"> Further structural layers are defined for CRs and dialogue acts (units are utterances), domain objects and discourse entities (units are expressions consisting of words). We keep independent layers of annotation separate, even when they can in principle be merged into a single hierarchy.</Paragraph>
    <Paragraph position="5"> Figure 2 shows a screenshot made with Amigram (Lauer et al., 2005), a generic tool for browsing and searching NXT data. On the left-hand side one can see the dependencies between the layers. The elements at the respective layers are displayed on the right-hand side.</Paragraph>
    <Paragraph position="6"> Below we indicate the features per layer: * Words: Time-stamped words and other sounds; we mark self-speech, pronunciation, deletion status, lemma and POS.</Paragraph>
    <Paragraph position="7"> 8Inter-annotator agreement of 0.788 (k corrected for prevalence).</Paragraph>
    <Paragraph position="8"> * Graphical output: The type and amount of information displayed, the option selected by the wizard, and the user's choices.</Paragraph>
    <Paragraph position="9"> * Utterances: Error rates due to word deletion, and various features describing the syntactic structure, e.g., mood, polarity, diathesis, complexity and taxis, the presence of marked syntactic constructions such as ellipsis, fronting, extraposition, cleft, etc. * Turns: Time delay, dialogue duration so far, and other dialogue history features, i.e.</Paragraph>
    <Paragraph position="10"> values which accumulate over time.</Paragraph>
    <Paragraph position="11"> * Domain objects and discourse entities: Properties of referring expressions reflecting the type and information status of discourse entities, and coreference/bridging links between them.</Paragraph>
    <Paragraph position="12"> * Dialogue acts: DAs based on an agent-based approach to dialogue as collaborative problem-solving (Blaylock et al., 2003), e.g., determining joint objectives, finding and instantiating recipes to accomplish them, executing recipes and monitoring for success. We also annotate propositional content and the database queries.</Paragraph>
    <Paragraph position="13"> * CRs: Additional features including the source and degree of uncertainty, and characteristics of the CRs strategy.</Paragraph>
    <Paragraph position="14">  * Tasks: A set of features for estimating user satisfaction online for reinforcement learning (Rieser et al., 2005).</Paragraph>
    <Paragraph position="15"> * Session: Subject and wizard information, user questionnaire aswers, and accumulating attribute values from other layers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML