XML Viewer - w04-2313

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2313_metho.xml
Size: 31,853 bytes
Last Modified: 2025-10-06 14:09:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2313">
  <Title>Towards Automatic Identification of Discourse Markers in Dialogs: The Case of Like</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Sometimes, the POS tagging of a whole utterance can be ru-
</SectionTitle>
    <Paragraph position="0"> ined by an incorrect tagging of the DM (cf. section 7), not to mention its parsing.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Case of Like
</SectionTitle>
    <Paragraph position="0"> The discourse marker like is probably one of the most difficult to detect automatically because of the large number of functions of the word like. Apart from a DM, like can be used as a preposition, as in example (1) below, an adjective (2), a conjunction (3), an adverb (4), a  noun (5) and a verb (6)5: 1. He was like a son to me.</Paragraph>
    <Paragraph position="1"> 2. Cooking, ironing and like chores.</Paragraph>
    <Paragraph position="2"> 3. Nobody can sing that song like he did.</Paragraph>
    <Paragraph position="3"> 4. It's nothing like as nice as their previous house! 5. Scenes of unrest the like(s) of which had never been seen before in the city.</Paragraph>
    <Paragraph position="4"> 6. I like chocolate very much.</Paragraph>
    <Paragraph position="5">  The DM like is sometimes analyzed simply as a &amp;quot;filler&amp;quot;, a hesitation word like uhmm that has no contribution to the meaning of an utterance6. However, other studies have shown that like has a much more complex role in dialogue. At a general level, like can be described as a &amp;quot;loose talk&amp;quot; marker (Andersen 2001). The function of like is to make explicit to the hearer that what follows the marker (for instance a noun phrase) is in fact a loose interpretation of the speaker's belief. Consider the following examples from the ICSI corpus:  1. It took like twenty minutes.</Paragraph>
    <Paragraph position="6"> 2. They had little carvings of like dead people on the walls or something.</Paragraph>
    <Paragraph position="7">  In the first example, by using like, the speaker intends to communicate that the duration mentioned is an approximation. In the second example, the approximation concerns the expression that was used (&amp;quot;dead people&amp;quot;). By using like, the speaker informs the audience that this term doesn't exactly match what she has in mind. But like as a DM has also other functions, for example introducing a quotation (reported speech) and serving as a discourse link introducing a correction or a reformulation7. We will not elaborate on these functions, since the remainder of this paper will be dedicated to the identification of DM like, regardless of its precise functions.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Disambiguation of Like by Humans
</SectionTitle>
    <Paragraph position="0"> Before trying to extract automatically the pragmatic occurrences of like, we have designed two experiments involving human judges. These preliminary experiments are useful indicators of the difficulty of this task, and the human scores will be used to assess more accurately the scores obtained by automatic methods systems.</Paragraph>
    <Paragraph position="1">  5 Adapted from the Dictionnaire Hachette Oxford. Oxford: OUP, 1994, 1943p.</Paragraph>
    <Paragraph position="2"> 6 See for instance the Collins Cobuild English Language Dictionary (1987: 842).</Paragraph>
    <Paragraph position="3"> 7 For a detailed analysis of like, see Andersen (2001).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Description of the Experiments
</SectionTitle>
      <Paragraph position="0"> In the first experiment, human judges used only the written transcription of utterances containing like. In the second experiment, we explored the possibility to improve the level of inter-annotator agreement by using prosodic information: the human judges were also able to listen to the meeting recordings.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 First Experiment: Annotation Based on Writ-
ten Transcription Only
</SectionTitle>
      <Paragraph position="0"> The first experiment involved 6 human judges, 3 men and 3 women whose age ranged from 25 to 40. They were divided in two groups of equal size: one of native English speakers, and one of French speakers with a very good knowledge of English.</Paragraph>
      <Paragraph position="1"> Every judge was asked to annotate a number of utterances containing like, taken from two different sources: 26 occurrences came from the transcription of movie dialogs (from Pretty Woman) and 49 occurrences corresponded to one ICSI-MR meeting.</Paragraph>
      <Paragraph position="2"> The participants were asked to decide for every occurrence of like whether it represented a DM or not. They were also asked to specify their degree of certainty on a three-point scale (1 = certain, 2 = reasonably sure, 3 = hesitating). Answers were simply written on paper.</Paragraph>
      <Paragraph position="3"> At the beginning, participants received written indications concerning the role of like as a DM as well as examples of pragmatic and non-pragmatic uses.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Second Experiment: Use of Prosodic Cues
</SectionTitle>
      <Paragraph position="0"> In the second experiment, a group of 3 judges (2 French speakers and 1 English speaker) were asked to perform the same type of task, but in addition to the written transcription, they were also allowed to listen to the recording of the meeting when needed. This second experiment did not include dialogs from a movie but only from a one-hour ICSI-MR meeting, containing 55 occurrences of like8. The participants received the same set of instructions as in the first experiment, and in addition some explanation about the prosody of like as a DM. No time constraints were imposed, so the subjects could listen to the recording as many times as needed.</Paragraph>
      <Paragraph position="1"> On average, they completed the task in a half an hour.</Paragraph>
      <Paragraph position="2"> Access to the recording was provided through a hyper-text transcript synchronized to the sound file at the utterance level (a multimedia solution developed for the IM2 project).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Results show that annotating DMs is a difficult task even for human judges. In the first experiment, the level 8 Two of the participants had already participated in the first experiment, but the meeting was not the one used in the previous experiment.</Paragraph>
      <Paragraph position="1"> of inter-annotator agreement measured by the Kappa coefficient is quite low (k = 0.40) for the natural dialogs of the ICSI-MR corpus, and average for the movie transcription (k = 0.65) 9. In the second experiment, with the help of prosodic cues, inter-annotator agreement increases, and the annotation becomes much more reliable (k = 0.74). Therefore, the identification of DM like is an empirically valid task, which can be accomplished at a reasonable performance level by untrained annotators.</Paragraph>
      <Paragraph position="2"> However, access to the prosodic information (from recordings) appears to be required. The inter-annotator agreement scores also set an initial boundary on automatic performances, which should not be expected to reach much higher levels. These results should be confirmed by experiments on longer transcripts, involving also annotators with specific training for DMs.</Paragraph>
      <Paragraph position="3"> The results obtained in these experiments shed an interesting empirical light on a number of predictions that were made before the experiments.</Paragraph>
      <Paragraph position="4"> First, it appears that DMs are easier to annotate in pre-planned dialogs, because such dialogs are less ambiguous than the natural ones. Indeed, the level of agreement reached for the movie transcription is much higher than for the ICSI-MR meeting in the same conditions (0.65 vs. 0.42). This result confirms that even if movie dialogs are made to reproduce the naturalness of naturally occurring dialogs, they are never as ambiguous, mainly because they only reflect the global communicative intention of one person (the author).</Paragraph>
      <Paragraph position="5"> The second hypothesis we tested concerned the difference between native and non-native speakers' ability to annotate DMs. We believed that the group of native English speakers would have a better level of agreement. This prediction has not been confirmed: the group of non-native English speakers obtained nearly the same level of agreement as the native English speakers, for both types of corpora: k = 0.67 vs. k = 0.63 for the movie transcription and k = 0.4 vs. k = 0.43 for the meeting corpus. So it seems that non-native English speakers with a very good command of English are just as reliable as native English speakers to annotate DMs.</Paragraph>
      <Paragraph position="6"> The third prediction we have tested is the possible correlation between the degree of certainty of annotators and the level of agreement. We haven't been able to find any significant correlation on both types of corpora and in both experiments. Thus, the capacity of human judges to evaluate their own intuition doesn't seem to be very high for this task. However, it should be mentioned that in general, the subjects have been much more confident in the second experiment, when they were able to use prosodic cues. The percentage of answers given 9 We use Krippendorff's scale to assess intercoder agreement. This scale discounts any result with k &lt; 0.67, allows tentative conclusions when 0.67 &lt; k &lt; 0.8 and definite conclusions when k [?] 0.8.</Paragraph>
      <Paragraph position="7"> with maximal certainty by the two annotators who took part in both experiments grew from 45% to 60% and from 65% to 87% respectively.</Paragraph>
      <Paragraph position="8"> When looking more closely at the utterances upon which annotators do not agree, we can see that some types of occurrences of like seem to be much more difficult to annotate in both experiments. In most of these cases, like had the function of a preposition. For example, one subject was mistaken in annotating all occurrences of the type: sounds like, seems like, feels like, as DMs. This observation is not so surprising if we bear in mind that the pragmatic uses of like seem to have emerged (historically) in a grammaticalization process.</Paragraph>
      <Paragraph position="9"> According to Andersen (2001, p. 294): &amp;quot;the fundamental assumption here is that the pragmatic marker like originates in a lexical item, that is, a preposition with the inherent meaning 'similar to'&amp;quot;. This suggests that more detailed explanations regarding the role of the DM like as well as some more training would probably improve the reliability of annotation.</Paragraph>
      <Paragraph position="10"> To sum up, these two experiments have enabled us to quantify the level of agreement between human annotators and to confirm the usefulness of prosodic cues in order to efficiently detect the DM like.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Automatic Detection of Like as a DM
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 A Priori Cues
</SectionTitle>
      <Paragraph position="0"> We have defined three linguistic criteria to be used for the disambiguation of DMs in general, which we will apply to the disambiguation of like in section 6 below.</Paragraph>
      <Paragraph position="1"> The first criterion is the presence of collocations.</Paragraph>
      <Paragraph position="2"> For instance, when well is used to mark a change of topic, it is nearly always used in a cluster of markers such as: well you know, well now, well I think or oh well. On the contrary, when used to close a topic, well can very often be found in clusters like OK well or well anyway/anyhow. The criterion of collocations can also be applied the other way round, to establish cases where a given element cannot be a DM. For instance, when like is used in collocations such as: I/you like, seems/feels like, just like; or when well is used in constructions like: very well, as well, quite well, etc.</Paragraph>
      <Paragraph position="3"> The second criterion is the position in the utterance.</Paragraph>
      <Paragraph position="4"> Again, depending on the word, this criterion can be used to ascertain that an element is a DM or, on the contrary, to rule out this possibility. For instance, well as a DM is nearly always placed at the beginning of an utterance or at least, at the beginning of a prosodic unit. In other cases, the use of this criterion implies that to be a DM, an element must not commence the utterance. According to Aijmer (2002, p. 30) : &amp;quot;Some of the discourse particles [...] (actually, sort of) can, for instance, be inserted parenthetically or finally, often with little difference in meaning, after a sentence, clause, turn, tone unit as a post-end field constituent.&amp;quot; The third criterion is prosody. According to Schiffrin (1987, p. 328) &amp;quot;[a discourse particle] has to have a range of prosodic contours e.g. tonic stress and followed by a pause, phonological reduction&amp;quot;.</Paragraph>
      <Paragraph position="5"> However, even though these three criteria can help a human annotator to extract DMs successfully most of the time, some rare occurrences remain ambiguous.</Paragraph>
      <Paragraph position="6"> Some occurrences are at the boundary between a pragmatic and a non-pragmatic use. In these rare cases, both interpretations remain equally possible.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Application of A Priori Cues to NLP
</SectionTitle>
      <Paragraph position="0"> Some of the criteria we propose seem relatively easy to automate. For instance, it is rather easy to extract a set a collocations once a list is made. Although some collocations imply the presence of a DM, and some other its absence, in some cases this criterion is in fact much more efficient in its second form, to rule out the presence of a DM. It is also rather easy to automate the criterion involving a certain position in the utterance, especially when the position is strongly constrained (for instance, at the beginning or end of the utterance). As far as prosody is concerned, the detection of pitch variations (for instance amounting to a correct transcription of commas) seems feasible for good quality recordings.</Paragraph>
      <Paragraph position="1"> However, used independently from the others, none of these criteria can suffice to completely automate the extraction of DMs, even though in some cases a single criterion can be enough to get good results. For example, in the case of well, the position in the utterance can often be sufficient to correctly extract a significant proportion of all occurrences. Nevertheless, it will not solve all occurrences, since well is not always used at the beginning of an utterance but also at the beginning of a prosodic phrase, as in: &amp;quot;And I said, well I have to think about it&amp;quot;. In these cases, the use of prosody to detect prosodic phrases becomes necessary. Similarly, the exclusion of some collocations like very well, as well, etc.</Paragraph>
      <Paragraph position="2"> is necessary to solve the last problematic cases.</Paragraph>
      <Paragraph position="3"> In sum, these criteria seem to be sufficient to partially automate the disambiguation of DMs, which could serve to reduce the burden of human annotators.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Evaluation of NLP Performance
</SectionTitle>
      <Paragraph position="0"> The evaluation of DM detection requires a &amp;quot;gold standard&amp;quot; (correct annotation) and the implementation of comparison metrics. The correct annotation of DMs was discussed in the experiments above, in the case of like, a highly versatile marker. In order to have enough data for our NLP experiment, one of the authors annotated manually all occurrences of like in 50 one-hour dialogs from the ICSI-MR corpus, generating 2,116 occurrences of like, of which 792 are DMs. About 20 occurrences of like could not be reliably disambiguated and were removed from the reference annotation.</Paragraph>
      <Paragraph position="1"> We have already compared the annotations produced by human judges using the kappa metric. This metric can be used as well to score the performances of a system at distinguishing pragmatic from non pragmatic uses. Note that kappa compensates the scores by taking into account the probability of agreement by chance. A simpler but useful metric is the percentage of occurrences correctly identified, or accuracy. Unlike kappa, accuracy does not factor out agreement by chance, but provides a more interpretable score.10 Furthermore, if the task to be evaluated is the retrieval of pragmatic uses among all uses of the lexical item (which are trivial to detect), then recall and precision are also relevant. For instance, to evaluate techniques that filter out non-DMs, we will require them to reach nearly 100% recall, and a reasonable precision say, more than 0.6 or 0.7 for like, i.e. twice the baseline precision, which is the frequency of the DM use.</Paragraph>
      <Paragraph position="2"> 6 Filters for the Disambiguation of Like We first explore the possibility to use a list of collocations in order to identify occurrences of like as a DM in two different corpora, ICSI-MR and a transcription of Switchboard telephone conversations. The best use of this criterion is to maximize precision while keeping recall as close as possible to 100%, i.e. to rule out a maximal number of occurrences that are not pragmatic while keeping all the pragmatic ones. Such a partial identification can be used as a filter to reduce the number of occurrences that must be processed manually.</Paragraph>
      <Paragraph position="3"> The list of collocations that exclude the presence of a DM contains for example collocations such as: something like that, I like, looks like, etc. The full list contains 26 collocations and was tested on two different corpora: first, on a subpart of the ICSI-MR corpus, with 6 hours of recording, and approximately 60,000 words; then on the Switchboard data, transcribed and annotated with DMs (Meteer 1995), with ca. 2,500 conversations and about 3 million words.</Paragraph>
      <Paragraph position="4"> Our method reaches 0.75 precision with 100% recall on the ICSI-MR corpus, and 0.44 precision with 0.99 recall on Switchboard. The main goal of the filter is thus achieved: recall remains very high on both corpora. A precision of 0.75 for ICSI-MR means that a significant number of occurrences are correctly ruled out - the initial proportion of pragmatic uses is about 1/3, while af10 Note that the probability of agreement by chance is here close to 0.5, given that 20-40% of the occurrences of like are DMs. When the proportion of DM occurrences is a, the probability of agreement by chance is (a2 + (1 - a)2), hence 0.68 for 20% and 0.52 for 40%.</Paragraph>
      <Paragraph position="5"> ter the application of the filter it reaches 3/4, and none of the pragmatic uses was missed in the process.</Paragraph>
      <Paragraph position="6"> The efficiency of the filter is smaller on the Switchboard data (0.44 precision vs. 0.75 for ICSI). In the ICSI-MR corpus, the precision obtained is probably the highest possible one with this filter, since the corpus was used as a development corpus, from which we have extracted our set of collocations. On the other hand, in the Switchboard corpus, the lower precision might also be due to the incoherent annotation. We used indeed the annotation of DMs that was already present in Switchboard, and this annotation is not entirely reliable.</Paragraph>
      <Paragraph position="7"> In fact, no real theoretical assumptions seem to underlie this annotation and according to Meteer (1995) the criterion to decide if an ambiguous case was a DM was &amp;quot;[...] if the speaker is a heavy discourse like user, count ambiguous cases as discourse markers, if not, assume they are not.&amp;quot; In such circumstances, we can expect that the low precision of our system on Switchboard can at least be partly attributed to this lack of reliability.</Paragraph>
      <Paragraph position="8"> Finally, our system has performed the same task as human judges in the first experiment (see section 4) on 49 occurrences of like in one ICSI-MR meeting. Interestingly, if we compare the average kappa obtained between humans and the kappa obtained between the system and all human judges, we get the same value (k = 0.42). Even though the results obtained by this preliminary system are quite tentative, this comparison with human judges seems to indicate that the performance is quite acceptable.</Paragraph>
      <Paragraph position="9"> 7 Use of a Part-of-speech Tagger The use of a POS tagger for disambiguating pragmatic vs. non-pragmatic uses of like is a straightforward idea.</Paragraph>
      <Paragraph position="10"> Indeed, if the accuracy of the taggers on colloquial speech transcripts was very high, this would help filtering out many (if not all) of the non-pragmatic uses, such as cases when like is simply a verb.</Paragraph>
      <Paragraph position="11"> We experimented using QTag, a freely available probabilistic POS tagger for English (Mason 2000)11.</Paragraph>
      <Paragraph position="12"> The tagger assigns one of the following tags to occurrences of like: preposition (IN, 1,412 occurrences), verb (VB, 509), subordinative conjunction (CS, 134), general adjective (JJ, 52), and general adverb (RB, 9).</Paragraph>
      <Paragraph position="13"> These tags must then be interpreted in terms of DM uses. A simple attempt is to use the tagger as a filter, to remove verbal occurrences. Hence, a VB tag is interpreted as non-DM, and all the other tags as (possible) DMs. Unfortunately, the evaluation shows that such a filter is unreliable: recall is 0.77, precision is 0.38, accuracy 44%, and kappa is only 0.02, i.e. near random cor11QTag uses a variant of the Brown/UPenn tagsets, and was trained on a million-word subset of the BNC (written material): http://web.bham.ac.uk/o.mason/software/tagger/.</Paragraph>
      <Paragraph position="14"> relation. As expected, other interpretations of the tags do not lead to better overall results. The most significant figures are obtained when selecting only adjectival uses of like (tagged JJ) as potential DMs: the recall is of course very low, but precision is 0.74, which means that the JJ tag could be used as a cue for the presence of a DM use.</Paragraph>
      <Paragraph position="15"> The main reason that explains the failure of the tagger to detect DM uses of like is that it was not trained on speech transcription, where like is quite frequent. A tagger trained on speech (supposing annotated data is available) could use some punctuation from the transcription to improve its accuracy, such as marks for interruptions and pauses that sometimes appear around DM uses of like. This could help it to avoid marking some of those occurrences as VB. A study by Heeman, Byron and Allen (1997) has shown that when specific tags are assigned to DMs and the tagging is done in the process of speech recognition, both the quality of tagging and the correct identification of DMs are significantly improved.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Statistical Training of DM Classifiers
</SectionTitle>
    <Paragraph position="0"> The relevance of machine learning techniques to detect DMs and to improve manually-derived classification models has already been emphasized by Litman (1996).</Paragraph>
    <Paragraph position="1"> We have conducted machine learning experiments with the 2,116-occurrence data set, and confirmed the relevance of the filters defined in section 6 above, and the role of several additional features. The results obtained with like are also compared, at the end of this section, with an analysis on well as a DM.</Paragraph>
    <Paragraph position="2"> 8.1 Features for the Classification of Like For each occurrence of like, we extracted the following features that we thought relevant to the DM/non-DM classification problem: * presence of a collocation that rules out the occurrence as a DM; since like can be either the first word or the second word in the collocation, we separated this into two features; * duration of the spoken word like computed from the timing provided with the ICSI-MR transcriptions, which was generated automatically; * duration of the pause before like: 0 or more, or [?]1 if the utterance begins with like (the segmentation into prosodic utterances was also provided with the transcription); * duration of the pause after like: 0 or more, or [?]1 if the utterance ends with like.</Paragraph>
    <Paragraph position="3"> In order to classify each of the occurrences of like as either a DM or a non-DM, we used decision trees as provided with the machine learning toolkit WEKA (Witten and Frank 2000)12. Since not all the features are discrete, we used the C4.5 decision tree learner (Quinlan 1993), or J48 in WEKA. For testing, we experimented both with separate training and test sets derived from the data (e.g. 1,500 vs. 616 instances), and by using 10fold cross-validation of classifiers as provided by WEKA. Results being similar, we report below the latter scores.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8.2 Results for the Classification of Like
</SectionTitle>
    <Paragraph position="0"> The best performance obtained by a C4.5 classifier is 0.95 recall and 0.68 precision for the DM occurrences, corresponding to 81% correctly classified instances and a kappa of 0.63. This is a significant performance, but it appears to be in the same range as the filter-based method (tested only on a smaller data set). And indeed, the classifier tree (see Figure 1 in the Appendix) exhibits as the first nodes the two classes of collocation filters defined a priori in section 6. This is a strong empirical proof of the relevance of these filters. Note that this criterion has not been used by Litman (1996) who focuses on a much more detailed analysis of the prosody along with some textual features.</Paragraph>
    <Paragraph position="1"> Moreover, the next feature in the tree is the duration of the pause before like ('pause_avant'): it appears that a relatively long pause before like (greater than 240 ms) characterizes a DM in most remaining cases (70 out of 78). This matches our intuitions about the prosodic behaviour of like as a DM. The next features in the tree have quite a low precision, and may not generalize to other corpora. Tentatively, it appears that a very short like (shorter than 120 ms) is not a DM.</Paragraph>
    <Paragraph position="2"> The best classifier tends to show that apart from the collocation filters, the other features do not play an important role. A classifier based only on the collocation filters achieves 0.96 recall and 0.67 precision for DM identification (80% correctly classified instances and k = 0.62), which is only slightly below the best classifier. Is it that the time-based features are totally irrelevant? An experiment without the two collocation filters shows that temporal features are relevant: the best classifier achieves 67% correct classification, with k = 0.23, that is, somewhat above chance. Again, among the first nodes of the tree are the interval before like and its duration (Figure 2 in Appendix). Also, a pause after like seems to signal a DM. Temporal features are therefore relevant to DM detection, but they are in reality correlated with collocation-based features, which supersede them when they can be detected.</Paragraph>
    <Paragraph position="3"> The conclusions of this experiment with like are that the simple features designed until now, though particu-</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
12 The Waikato Environment for Knowledge Analysis
</SectionTitle>
    <Paragraph position="0"> (WEKA) is made available by Ian H. Witten and Eibe Frank at http://www.cs.waikato.ac.nz/ml/weka.</Paragraph>
    <Paragraph position="1"> larly efficient given their simplicity, do not allow for more than 70% precision (at 100% recall) for the detection of like as a DM. Time-based features do not outperform collocation-based filters - though the former could generalize better to other DMs. This result is also particularly interesting considering the fact that human annotators performed significantly better when allowed to use sound files. The results suggest that prosodic features other than duration are relevant for the disambiguation of like. Further work on the prosody of like (e.g. pitch) should enable us to refine this criterion.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.3 The Classification of Well
</SectionTitle>
      <Paragraph position="0"> Using a similar procedure, we have applied C4.5 classification to the detection of well as a DM. On the same dialogs as above, we annotated the occurrences of well as a DM (579) among all occurrences of well (873).</Paragraph>
      <Paragraph position="1"> About 66% of all occurrences are DMs, which gives a baseline classification score (all occurrences considered to be DMs).</Paragraph>
      <Paragraph position="2"> The features defined for well are similar to those used for like: collocation-based filters (with a different content) and time-based features. In addition, we defined a collocation-based feature that is supposed to ascertain the presence of a DM, namely collocations such as oh well or OK well. We also consider the occurrence of well at the end of an interrupted or abandoned utterance (ending on transcriptions by '= ='), a feature we hypothesize to indicate a DM.</Paragraph>
      <Paragraph position="3"> The highest accuracy, 91% and k = 0.8, is obtained by a classifier combining the collocation filters and the duration of the pause after well (cf. Figure 3 in the Appendix). This corresponds to 91% precision and 97% recall for the detection of DMs.</Paragraph>
      <Paragraph position="4"> The use of the collocation-based filter alone - the one that rules out DM occurrences based on the previous word, e.g. as well - yields only slightly lower performance (90% with k = 0.79). Again, this does not mean that all the other features are irrelevant. Rather, the time-based filter based on the duration of the pause after well, which includes the detection of well at the end of completed or interrupted utterances, produces a classification accuracy of 75% (and a low kappa, 0.45), with 77% precision and 96% recall on the identification of DMs only.</Paragraph>
      <Paragraph position="5"> These results suggest that time-based features could generalize to a whole class of DMs, but for individual DMs, such features are outperformed by collocations filters based on patterns of occurrences. The definition of collocation filters for a set of DMs seems feasible, albeit somehow tedious.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML