File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1318_metho.xml
Size: 22,073 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1318"> <Title>Measuring annotator agreement in a complex hierarchical dialogue act annotation scheme</Title> <Section position="4" start_page="126" end_page="127" type="metho"> <SectionTitle> 2 Annotation using DIT </SectionTitle> <Paragraph position="0"> DIT is a context-change (or information-state update) approach to the analysis of dialogue, which describes utterance meaning in terms of context update operations called 'dialogue acts'. A dialogue act in DIT has two components: (1) the semantic content, being the objects, events, properties, relations, etc. that are considered; and (2) the communicative function, that describes how the addressee is intended to use the semantic content for updating his context model when he understands the utterance correctly. DIT takes a multidimensional view on dialogue in the sense that speakers may use utterances to address several aspects of the communication simultaneously, as reflected in the multifunctionality of utterances. One such aspect is the performance of the task or activity for which the dialogue takes place; another is the monitoring of each other's attention, understanding and uptake through feedback acts; others include for instance the turn-taking process and the timing of communicative actions, and finally yet another aspect is formed by the social obligations that may arise such as greeting, apologising, or thanking. The various aspects of communication that can be addressed independently are called dimensions (Bunt and Girard, 2005; Bunt, 2006). The DIT++ tagset distinguishes 11 dimensions, which all contain a number of communicative functions that are specific to that dimension, such as TURN GIVING, PAUSING, and APOLOGY.</Paragraph> <Paragraph position="1"> Besides dimension-specific communicative functions, DIT also distinguishes a layer of communicative functions that are not specific to any particular dimension but that can be used to address any aspect of communication. These functions, which include questions, answers, statements, and commissive as well as directive acts, are called general purpose functions. A dialogue act falls within a specific dimension if it has a communicative function specific for that dimension or if it has a general-purpose function and a semantic content relating to that dimension. Dialogue utterances can in principle have a function (but never more than one) in each of the dimensions, so annotators using the DIT++ scheme can assign at most one tag for each of the 11 dimensions to any given utterance.</Paragraph> <Paragraph position="2"> Both within the set of general-purpose communicative function tags and within the sets of dimension-specific tags, tags can be hierarchically related in such a way that a label lower in a hierarchy is more specific than a label higher in the same hierarchy. Tag F1 is more specific than tag F2 if F1 defines a context update operation that includes the update operation corresponding to F2.</Paragraph> <Paragraph position="3"> For instance, consider a part of the taxonomy for general purpose functions (Figure 1).</Paragraph> <Paragraph position="4"> For an utterance to be assigned a YN-QUESTION, we assume the speaker believes that the addressee knows the truth value of the proposition presented.</Paragraph> <Paragraph position="5"> For an utterance to be assigned a CHECK, we assume the speaker additionally has a weak belief that the proposition that forms the semantic content is true. And for a POSI-CHECK, there is the additional assumption that the speaker believes (weakly) that the hearer also believes that the proposition is true.1 Similar to the hierarchical relations between YN-Question,CHECK, and POSI-CHECK, other parts of the annotation scheme contain hierarchically related functions.</Paragraph> <Paragraph position="6"> The following example illustrates the use of DIT++ communicative functions for a very simple translated) dialogue fragment2.</Paragraph> <Paragraph position="8"/> </Section> <Section position="5" start_page="127" end_page="558" type="metho"> <SectionTitle> 3 Agreement using k 3.1 Related work </SectionTitle> <Paragraph position="0"> Inter-annotator agreements have been calculated with the purpose of qualitatively evaluating tagsets and individual tags. For DAMSL, the first agreement results were presented in (Core and Allen, 1997), based on the analysis of TRAINS 9193 dialogues (Gross et al., 1993; Heeman and Allen, 1995). In this analysis, 604 utterances were tagged by mostly two annotators. Following the suggestions in (Carletta, 1996), Core et al. consider kappa scores above 0.67 to indicate significant agreement and scores above 0.8 reliable agreement. Another more recent analysis was performed for 8 dialogues of the MON-ROE corpus (Stent, 2000), counting 2897 utterances in total, processed by two annotators for 13 DAMSL dimensions. Other analyses apply DAMSL derived schemes (such as SWITCHBOARD-DAMSL) to various corpora (e.g. (Di Eugenio et al., 1998; Shriberg et al., 2004) ). For the comprehensive DIT++ taxonomy, the work reported here represents the first investigation of annotator agreement. null</Paragraph> <Section position="1" start_page="127" end_page="558" type="sub_section"> <SectionTitle> 3.2 Experiment outline </SectionTitle> <Paragraph position="0"> As noted, existing work on annotator agreement analysis has mostly involved only two annotators.</Paragraph> <Paragraph position="1"> It may be argued that especially for annotation of concepts that are rather complex, an odd number of annotators is desirable. First, it allows having majority agreement unless all annotators choose entirely different. Second, it allows to deal better with the undesirable situation that one annotator chooses quite differently from the others. The agreement scores reported in this paper are all calculated on the basis of the annotations of three annotators, using the method proposed in (Davies and Fleiss, 1982).</Paragraph> <Paragraph position="2"> The dialogues that were annotated are task-oriented and are all in Dutch. To account for different complexities of interaction, both human-machine and human-human dialogues are considered. Moreover, the dialogues analyzed are drawn from different corpora: OVIS (Strik et al., 1997), DIAMOND (Geertzen et al., 2004), and a collection of Map Task dialogues (Caspers, 2000); see Table 1, where the number of annotated utterances is also indicated.</Paragraph> <Paragraph position="3"> corpus domain type #utt OVIS TRAINS like interactions H-M 193 on train connections DIAMOND1 interactions on how to H-M 131 operate a fax device DIAMOND2 interactions on how to H-H 114 operate a fax device MAPTASK HCRC Map Task like H-H 120 interaction Six undergraduate students annotated the selected dialogue material. They had been introduced to the DIT++ annotation scheme and the underlying theory while participating in a course on pragmatics. During this course they were exposed to approximately four hours of lecturing and few small annotation exercises. For all dialogues, the audio recordings were transcribed and the annotators annotated presegmented utterances for which full agreement was established on segmentation level beforehand. During the annotation sessions the annotators had -- apart from the transcribed speech -- access to the audio recordings, to the on-line definitions of the communicative functions in the scheme and to a very brief, 1-page set of annotation guidelines3. The task was facilitated by the use of an annotation tool that had been built for this occasion; this tool allowed the subjects to assign each utterance one DIT++ tag for each dimension without any further constraints. In total 1,674 utterances were annotated.</Paragraph> </Section> <Section position="2" start_page="558" end_page="558" type="sub_section"> <SectionTitle> 3.3 Problems with standard k </SectionTitle> <Paragraph position="0"> If we were to apply the standard k statistic to DIT++ annotations, we would not do justice to an important aspect of the annotation scheme concerning the differences between alternative tags, and hence the possible differences in the disagreement between annotators using alternative tags. An aspect in which the DIT++ scheme differs from other taxonomies for dialogue acts is that, as noted in Section 2, communicative functions (CFs) within a dimension as well as general-purpose CFs are often structured into hierarchies in which a difference in level represents a relation of specificity. When annotators differ in that they assign tags which both belong to the same hierarchy, they may differ in the degree of specificity that they want to express, but they agree to the extent that these tags inherit the same elements from tags higher in the hierarchy. Inter-annotator disagreement is in such a case much less than if they would choose two unrelated tags. This is for instance obvious in the following example of the annotations of two utterances by two annotators:</Paragraph> <Paragraph position="2"> With utterance 1, the annotators should be said simply to disagree (in fact, annotator 2 incorrectly assigns a YNQ function). Concerning utterance 2 the annotators also disagree, but Figure 1 and the definitions given in Section 2 tell us that the disagreement in this case is quite small, as a CHECK inherits the properties of a YNQ. We therefore should not use a black-and-white measure of agreement, like the standard k, but we should have a measure for partial annotator agreement.</Paragraph> <Paragraph position="3"> In order to measure partial (dis-)agreement between annotators in an adequate way, we should not just take into account whether two tags are hierarchically related or not, but also how far they are apart in the hierarchy, to reflect that two tags which are only one level apart are semantically more closely related than tags that are several levels apart. We will take this additional requirement into account when designing a weighted disagreement statistic in the next section.</Paragraph> </Section> </Section> <Section position="6" start_page="558" end_page="558" type="metho"> <SectionTitle> 4 Agreement based on structural </SectionTitle> <Paragraph position="0"> taxonomic properties The agreement coefficient we are looking for should in the first place be weighted in the sense that it takes into account the magnitude of disagreement. Two such coefficients are weighted kappa (kw, (Cohen, 1968)) and alpha (Krippendorff, 1980). For our purposes, we adopt kw for its property to take into account a probability distribution typical for each annotator, generalize it to the case for multiple annotators by taking the average over the scores of annotator pairs, and define a function to be used as distance metric.</Paragraph> <Section position="1" start_page="558" end_page="558" type="sub_section"> <SectionTitle> 4.1 Cohen's weighted k </SectionTitle> <Paragraph position="0"> Assuming the case of two annotators, let pij denote the proportion of utterances for which the first and second annotator assigned categories i and j, respectively. Then Cohen defines kw in terms of disagreement rather than agreement where qo = 1 [?] po and qe = 1 [?] pe such that Equation 1 can be rewritten to:</Paragraph> <Paragraph position="2"> (2) To arrive at kw, the proportions qo and qe in Equation 2 are replaced by weighted functions over all possible category pairs:</Paragraph> <Paragraph position="4"> where vij denotes the disagreement weight. To calculate this weight we need to specify a distance function as metric.</Paragraph> </Section> <Section position="2" start_page="558" end_page="558" type="sub_section"> <SectionTitle> 4.2 A taxonomic metric </SectionTitle> <Paragraph position="0"> The task of defining a function in order to calculate the difference between a pair of categories requires us to determine semantic-pragmatic relatedness between the CFs in the taxonomy. For any annotation scheme, whether it is hierarchically structured or not, we could assign for each possible pair of categories a value that expresses the semantic-pragmatic relatedness between the two categories compared to all other possible pairs. However, it seems quite difficult to find universal characteristics for CFs to be used to express relatedness on a rational scale. When we consider a taxonomy that is structured in a meaningful way, in this case one that expresses hierarchical relations between CF based on their effect on information states, the taxonomic structure can be exploited to express in a systematic fashion how much annotators disagree when they choose different concepts that are directly or indirectly related.</Paragraph> <Paragraph position="1"> The assignment of different CFs to a specific utterance by two annotators represents full disagreement in the following cases: 1. the two CFs belong to different dimensions; 2. one of the two CFs is general-purpose; the other is dimension-specific;4 3. the two CFs belong to the same dimension but not to the same hierarchy; 4. the two CFs belong to the same hierarchy but are not located in the same branch. Two CFs are said to be located in the same branch when one of the two CFs is an ancestor of the other.</Paragraph> <Paragraph position="2"> If, by contrast, the two CFs take part in a parent-child relation within a hierarchy (either within a dimension or among the general-purpose CFs), then the CFs are related and this assignment represents partial disagreement. A distance metric that measures this disagreement, which we denote as d, should have the following properties: 1. d should be a real number normalized in the range [0...1]; 2. Let C be the (unordered) set of CFs.5 For every two CFs c1,c2 [?] C, d(c1,c2) = 0 when c1 and c2 are not related; 3. Let C be the (unordered) set of CFs. For every communicative function c [?] C, d(c,c) = 1; 4. Let C be the (unordered) set of CFs. For every two CFs c1,c2 [?] C, d(c1,c2) = d(c2,c1).</Paragraph> <Paragraph position="3"> Furthermore, when c1 and c2 are related, we should specify how distance between them in the hierarchy should be expressed in terms of partial disagreement. For this, we should take the following aspects into account: pair consisting of the name of a general-purpose function and the name of a dimension. However, in view of the simplification mentioned in the previous note, for the sake of this paper we may as well consider tags containing a general-purpose function as simply consisting of that function. and c2 being located in two different levels of depths n and n+1 might be considered to be more different than that between to levels of depth n + 1 and n + 2. If this would be the case, the deeper two levels are located in the tree, the smaller the differences between the nodes on those levels. For the hierarchies in DIT, we keep the magnitude of disagreement linear with the difference in levels, and independent of level depth; Given the considerations above, we propose the following metric:</Paragraph> <Paragraph position="5"> ing how much distance there is between two adjacent levels in the hierarchy; a plausible value for a could be 0.75; * [?] is a function that returns the difference in depth between the levels of ci and cj; * b is a constant for which 0 < b [?] 1, expressing in what rate differences should become smaller when the depth in the hierarchy gets larger. If there is no reason to assume that differences on a higher depth in the hierarchy are of less magnitude than differences on a lower depth, then b = 1; * G(ci,cj) is a function that returns the minimal depth of ci and cj.</Paragraph> <Paragraph position="6"> To provide some examples of how d would be calculated, let us consider the general purpose functions in Figure 1. Consider also Figure 2, that represents two hierarchies of CFs in the auto feedback dimension6, and let us assume the values of the various parameters those that are suggested above. We then get the following calculations:</Paragraph> <Paragraph position="8"> To conclude, we can simply take d to be the weighting in Cohen's kw and come to a coefficient which we will call taxonomically weighted kappa, denoted by ktw:</Paragraph> <Paragraph position="10"/> </Section> <Section position="3" start_page="558" end_page="558" type="sub_section"> <SectionTitle> 4.3 ktw statistics for DIT </SectionTitle> <Paragraph position="0"> Considering the DIT++ taxonomy, it may be argued that due to the many hierarchies in the topology of the general-purpose functions, this is the part where most is to be gained by employing ktw.</Paragraph> <Paragraph position="1"> Table 2 shows the statistics for each dimension, averaged over all annotation pairs. With annotation pair is understood the pair of assignments an utterance received by two annotators for a particular dimension. The figures in the table are based on those cases in which both annotators assigned a function to a specific utterance for a specific dimension. Cases where either one annotator does not assign a function while the other does, or where both annotators do not assign a function, are not considered. Scores for standard k and ktw can be found in the first two columns. The column #pairs indicates on how many annotation pairs the statistics are based. The last column shows the ap-ratio. This figure indicates which fraction of all annotated functions in that dimension are represented by annotation pairs. When #ap denotes the number of annotation pairs and #pa denotes the number of partial annotations (annotations in which one annotator assigned a function and the other did not), then the ap-ratio is calculated as #ap/(#pa + #ap). We can observe that due to the use of the taxonomic weighting both feedback dimensions and the task dimension gained substantially in annotator agreement.</Paragraph> <Paragraph position="2"> 6Auto-feedback: feedback on the processing (perception, understanding, evaluation,..) of previous utterances by the speaker. DIT also distinguishes allo-feedback, where the speaker provides or elicits information about the addressee's processing.</Paragraph> <Paragraph position="3"> own com. management 1.00 1.00 2 0.08 partner com. management nav nav 1 0.07 dialogue struct. management 0.74 0.74 15 0.31 social obl. management 1.00 1.00 61 0.80 Table 2: Scores for corrected k and ktw per DIT dimension.</Paragraph> <Paragraph position="4"> When we look at the agreement statistics and consider k scores above 0.67 to be significant and scores above 0.8 considerably reliable, as is usual for k statistics, we can find the dimensions TURN-MANAGEMENT, CONTACT MANAGEMENT, and SOCIAL-OBLIGATIONS-MANAGEMENT to be reliable and DIALOGUE STRUCT. MANAGEMENT to be significant. For some dimensions, the occurences of functions in these dimensions in the annotated dialogue material were too few to draw conclusions. When we also take the ap-ratio into account, only the dimensions TASK, TIME MANAGEMENT, and SOCIAL-OBLIGATIONS-MANAGEMENT combine a fair agreement on functions with fair agreement on whether or not to annotate in these dimensions. Especially for the other dimensions, the question should be raised for which cases and for what reasons the ap-ratio is low. This question asks for further qualitative analysis, which is beyond the scope of this paper7.</Paragraph> </Section> </Section> <Section position="7" start_page="558" end_page="558" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> In the previous sections, we showed how the taxonomically weighted ktw that we proposed can be more suitable for taxonomies that contain hierarchical structures, like the DIT++) taxonomy. However, there are some specific and general issues that deserve more attention.</Paragraph> <Paragraph position="1"> A question that might be raised in using ktw as opposed to ordinary k, is if the assumption that the interpretations of k proposed in literature in terms of reliability is also valid for ktw statistics. This is ultimately an empirical issue, to be decided by which ktw scores researchers find to correspond to fair or near agreement between annotators.</Paragraph> <Paragraph position="2"> Another point of discussion is the arbitrariness of the values of the parameters that can be chosen in d. In this paper we proposed a = 0.75 and b = 0.5. Choosing different values may change the disagreement of two distinct CFs located in the same hierarchy considerably. Still, we think that by interpolating smoothly between the intuitively clear cases at the two extreme ends of the scale, it is possible to choose reasonable values for the parameters that scale well, given the average hierarchy depth.</Paragraph> <Paragraph position="3"> A more general problem, inherent in almost any (dialogue act) annotation activity is that when we consider the possible factors that influence the agreement scores, we find that they can be numerous. Starting with the tagset, unclear definitions and vague concepts are a major source of disagreement. Other factors are the quality and extensiveness of annotation instructions, and the experience of the annotators. These were kept constant throughout the experiment reported in this paper, but clearly the use of more experienced or better trained annotators could have a great influence. Then there is the influence that the use of an annotation tool can have. Does the tool gives hints on annotation consistency (e.g. an ANSWER should be preceded by a QUESTION), does it enforce consistency, or does it not consider annotation consistency at all? Are the possible choices for annotators presented in such a way that each choice is equally well visible and accessible? Clearly, when we do not control these factors sufficiently, we run the risk that what we measure does not express what we try to quantify: (dis)agreement among annotators about the description of what happens in a dialogue.</Paragraph> </Section> class="xml-element"></Paper>