File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1318_intro.xml

Size: 3,376 bytes

Last Modified: 2025-10-06 14:03:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1318">
  <Title>Measuring annotator agreement in a complex hierarchical dialogue act annotation scheme</Title>
  <Section position="3" start_page="0" end_page="126" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The DIT++ tagset (Bunt, 2005) was designed to combine in one comprehensive annotation scheme the communicative functions of dialogue acts distinguished in Dynamic Interpretation Theory (DIT, (Bunt, 2000; Bunt and Girard, 2005)), and many of those in DAMSL (Allen and Core, 1997) and in other annotation schemes. An important difference between the DIT++ and DAMSL schemes is the more elaborate and fine-grained set of functions for feedback and other aspects of dialogue control that is available in DIT, partly inspired by the work of Allwood (Allwood et al., 1993). As it is often thought that more elaborate and fine-grained annotation schemes are difficult for annotators to apply consistently, we decided to address this issue in an annotation experiment on which we report in this paper. A frequently used way of evaluating human dialogue act classification is inter-annotator agreement. Agreement is sometimes measured as percentage of the cases on which the annotators agree, but more often expected agreement is taken into account in using the kappa statistic (Cohen, 1960; Carletta, 1996), which is given by:</Paragraph>
    <Paragraph position="2"> where po is the observed proportion of agreement and pe is the proportion of agreement expected by chance. Ever since its introduction in general (Cohen, 1960) and in computational linguistics (Carletta, 1996), many researchers have pointed out that there are quite some problems in using k (e.g.</Paragraph>
    <Paragraph position="3"> (Di Eugenio and Glass, 2004)), one of which is the discrepancy between p0 and k for skewed class distribution.</Paragraph>
    <Paragraph position="4"> Another is that the degree of disagreement is not taken into account, which is relevant for any non-nominal scale. To address this problem, a weighted k has been proposed (Cohen, 1968) that penalizes disagreement according to their degree rather than treating all disagreements equally. It would be arguable that in a similar way, characteristics of dialogue acts in a particular taxonomy and possible pragmatic similarity between them should be taken into account to express annotator agreement. For dialogue act taxonomies which are structured in a meaningful way, such as those that  express hierarchical relations between concepts in the taxonomy, the taxonomic structure can be exploited to express how much annotators disagree when they choose different concepts that are directly or indirectly related. Recent work that accounts for some of these aspects is a metric for automatic dialogue act classification (Lesch et al., 2005) that uses distance in a hierarchical structure of multidimensional labels.</Paragraph>
    <Paragraph position="5"> In the following sections of this paper, we will first briefly consider the dimensions in the DIT++ scheme and highlight the taxonomic characteristics that will turn out to be relevant in later stage. We will then introduce a variant of weighted k for inter-annotator agreement called ktw that adopts a taxonomy-dependent weighting, and discuss its use.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML