File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-2027_intro.xml

Size: 4,716 bytes

Last Modified: 2025-10-06 14:03:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2027">
  <Title>Information structure and pauses in a corpus of spoken Danish</Title>
  <Section position="2" start_page="0" end_page="191" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The interest for corpora annotated with information structure has been raised recently by several authors. Kruijff-Korbayov'a and Kruijff (2004) describe a method where a rich discourse-level annotation is used to investigate information structure, while both Postolache (2005) and Diderichsen and Elming (2005) study the application of machine learning to the problem of automatic identification of topic and focus. In this study, on the contrary, information structure is annotated manually, and the annotation is used to investigate the correlation between information structure tags and intra-clausal pauses.</Paragraph>
    <Paragraph position="1"> 2 Annotating information structure The starting point for this study was the corpus of spoken Danish 'DanPass' (Gronnum, 2005), a collection of 54 monologues produced by 18 different subjects dealing with three well-defined tasks, following the methodology established in Terken (1985). In the first task, the subjects describe a geometrical network, in the second the process of assembling the drawing of a house out of existing pieces, and in the third they solve a map task. The corpus has been annotated with several annotation tiers, including orthography, phonetic transcription, pauses and PoS-tags. Two independent annotators added then tags for focus and topic based on a set of simple guidelines, and using the Praat tool to carry out the annotation.</Paragraph>
    <Paragraph position="2"> The annotation reflects the assumption that a sentence can be divided into an obligatory focus part, which expresses the non-presupposed information, and a presupposed background part.</Paragraph>
    <Paragraph position="3"> A referent in the background part may function as the sentence topic in the sense of Lambrecht (1994). For each sentence in the corpus, the annotators were asked to identify what they intuitively considered non-presupposed information and annotate it as belonging to the focus. Technically, each word belonging to the focus is added a focus tag. The annotators were also asked to test whether they could single out a sentence referent by means of the &amp;quot;What about X&amp;quot; test (Reinhart, 1981). If they could, they were asked to add topic tags to all the words making up the corresponding expression. Words not bearing any tag are considered part of the background.</Paragraph>
    <Paragraph position="4"> The guidelines did not contain any reference to pausing, nor did the annotators know that their work would be used to study the correlation be- null tween pauses and information structure. In fact, that was not the purpose of the annotation work, which is of more general interest. It should also be noted that the annotators were not explicitly instructed to code phrases, since we did not want to make the assumption that topic or focus necessarily correspond to syntactic phrases. Approximately two person months were spent annotating two sections of the corpus. The kappa score varied between 0.7 to 0.8 depending on the corpus section, showing an acceptable inter-annotator agreement. Most disagreements relate to the identification of the focus left-hand boundary, where one of the annotators sometimes identified wider focus domains than the other. These differences have not been inspected yet, but will be used to revise the guidelines to produce a unique consistent annotation. Table (1) shows the number of tags assigned by the two coders (C1 and C2) in the two sections of the corpus coded so far.</Paragraph>
    <Paragraph position="5"> Below, an example of an annotated tier is shown in a linearised format (the textgrids output by Praat also contain time intervals that link the transcription to the sound file):</Paragraph>
    <Paragraph position="7"> 'PAUSEabove PAUSE there is [F a PAUSE green circle] PAUSEand above [T the green circle] there is [F a PAUSEpurple triangle]' The example consists of two sentences. In the first, the annotator has tagged 'en gron cirkel' (a green circle) as the focus; in the second, 'den gronne cirkel' (the green circle) has been tagged as the topic, while 'en lilla trekant' (a purple triangle) is tagged as the focus. Pauses are indicated by '+' and '='. Theformer is asilent pause, and the latter a pause accompanied by a sound, like 'hmm'. Pauses were already available in the orthographic transcription of the corpus, which was produced earlier by different annotators.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML