File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1035_metho.xml
Size: 25,983 bytes
Last Modified: 2025-10-06 14:08:39
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1035"> <Title>Classifying Ellipsis in Dialogue: A Machine Learning Approach</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Corpus Study 2.1 The Corpus </SectionTitle> <Paragraph position="0"> Our corpus-based investigation of bare sluices has been performed using the 10 million word dialogue transcripts of the BNC. The corpus of bare sluices has been constructed using SCoRE (Purver, 2001), a tool that allows one to search the BNC using regular expressions.</Paragraph> <Paragraph position="1"> The dialogue transcripts of the BNC contain 5183 bare sluices (i.e. 5183 sentences consisting of just a wh-word). We distinguish between the following classes of bare sluices: what, who, when, where, why, how and which. Given that only 15 bare which were found, we have also considered sluices of the form which N. Including which N, the corpus contains a total of 5343 sluices, whose distribution is shown in Table 1.</Paragraph> <Paragraph position="2"> The annotation was performed on two di erent samples of sluices extracted from the total found in the dialogue transcripts of the BNC.</Paragraph> <Paragraph position="3"> The samples were created by arbitrarily selecting 50 sluices of each class (15 in the case of which). The rst sample included all instances of bare how and bare which found, making up a total of 365 sluices. The second sample contained 50 instances of the remaining classes, making up a total of 300 sluices.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The Annotation Procedure </SectionTitle> <Paragraph position="0"> To classify the sluices in the rst sample of our sub-corpus we used the categories described below. The classi cation was done by 3 expert annotators (the authors) independently.</Paragraph> <Paragraph position="1"> Direct The utterer of the sluice understands the antecedent of the sluice without di culty.</Paragraph> <Paragraph position="2"> The sluice queries for additional information that was explicitly or implicitly quanti ed away in the previous utterance.</Paragraph> <Paragraph position="3"> (2) Caroline: I'm leaving this school.</Paragraph> <Paragraph position="4"> Lyne: When? [KP3, 538] Reprise The utterer of the sluice cannot understand some aspect of the previous utterance which the previous (or possibly not directly previous) speaker assumed as presupposed (typically a contextual parameter, except for why, where the relevant \parameter&quot; is something like speaker intention or speaker justi cation). (3) Geo rey: What a useless fairy he was.</Paragraph> <Paragraph position="5"> Susan: Who? [KCT, 1753] Clari cation The sluice is used to ask for clari cation about the previous utterance as a whole.</Paragraph> <Paragraph position="6"> (4) June: Only wanted a couple weeks.</Paragraph> <Paragraph position="7"> Ada: What? [KB1, 3312] Unclear It is di cult to understand what content the sluice conveys, possibly because the input is too poor to make a decision as to its resolution, as in the following example: (5) Unknown : <unclear> <pause> Josephine: Why? [KCN, 5007] After annotating the rst sample, we decided to add a new category to the above set. The sluices in the second sample were classi ed according to a set of ve categories, including the following: Wh-anaphor The antecedent of the sluice is a wh-phrase.</Paragraph> <Paragraph position="8"> (6) Larna: We're gonna nd poison apple and I know where that one is.</Paragraph> <Paragraph position="9"> Charlotte: Where? [KD1, 2371]</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Reliability </SectionTitle> <Paragraph position="0"> To evaluate the reliability of the annotation, we use the kappa coe cient (K) (Carletta, 1996), which measures pairwise agreement between a set of coders making category judgements, correcting for expected chance agreement. 2 The agreement on the coding of the rst sample of sluices was moderate (K = 52).3 There were important di erences amongst sluice classes: The lowest agreement was on the annotation for why (K = 29), what (K = 32) and how (K = 32), which suggests that these categories are highly ambiguous. Examination of the coincidence matrices shows that the largest confusions were between reprise and clarification in the case of what, and between direct and reprise for why and how.</Paragraph> <Paragraph position="1"> On the other hand, the agreement on classifying who was substantially higher (K = 71), with some disagreements between direct and reprise.</Paragraph> <Paragraph position="2"> Agreement on the annotation of the 2nd sample was considerably higher although still not entirely convincing (K = 61). Overall agreement was improved in all classes, except for</Paragraph> <Paragraph position="4"> portion of actual agreements and P(E) is the proportion of expected agreement by chance, which depends on the number of relative frequencies of the categories under test. The denominator is the total proportion less the proportion of chance expectation.</Paragraph> <Paragraph position="5"> 3All values are shown as percentages.</Paragraph> <Paragraph position="6"> where and who. Agreement on what improved slightly (K = 39), and it was substantially higher on why (K = 52), when (K = 62) and which N (K = 64).</Paragraph> <Paragraph position="7"> Discussion Although the three coders may be considered experts, their training and familiarity with the data were not equal. This resulted in systematic di erences in their annotations. Two of the coders (coder 1 and coder 2) had worked more extensively with the BNC dialogue transcripts and, crucially, with the definition of the categories to be applied. Leaving coder 3 out of the coder pool increases agreement very signi cantly: K = 70 in the rst sample, and K = 71 in the second one. The agreement reached by the more expert pair of coders was high and stable. It provides a solid foundation for the current classi cation. It also indicates that it is not di cult to increase annotation agreement by relatively light training of coders.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Results: Distribution Patterns </SectionTitle> <Paragraph position="0"> In this section we report the results obtained from the corpus study described in Section 2.</Paragraph> <Paragraph position="1"> The study shows that the distribution of readings is signi cantly di erent for each class of sluice. Subsection 3.2 outlines a possible explanation of such distribution.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Sluice/Interpretation Correlations </SectionTitle> <Paragraph position="0"> The distribution of interpretations for each class of sluice is shown in Table 2. The distributions are presented as percentages of pairwise agreement (i.e. agreement between pairs of coders), leaving aside the unclear cases. This allows us to see the proportion made up by each interpretation for each sluice class, together with any correlations between sluice and interpretation. Distributions are similar over both samples, suggesting that corpus size is large enough to permit the identi cation of repeatable patterns. null Table 2 reveals interesting correlations between sluice classes and preferred interpretations. The most common interpretation for what is clarification, making up 69% in the rst sample and 66% in the second one. Why sluices have a tendency to be direct (57%, 83%). The sluices with the highest probability of being reprise are who (76%, 95%), which (96%), which N (88%, 80%) and where (75%, 69%). On the other hand, when (67%, 65%) and how (87%) have a clear preference for direct interpretations.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Explaining the Frequency Hierarchy </SectionTitle> <Paragraph position="0"> In order to gain a complete perspective on sluice distribution in the BNC, it is appropriate to combine the (averaged) percentages in Table 2 with the absolute number of sluices contained in the BNC (see Table 1), as displayed in Table 3: whatcla 2040 whichNrep 135 whydir 775 whendir 90 whatrep 670 whodir 70 whorep 410 wheredir 70 whyrep 345 howdir 45 whererep 250 whenrep 35 whatdir 240 whichNdir 24 For instance, although more than 70% of why sluices are direct, the absolute number of why sluices that are reprise exceeds the total number of when sluices by almost 3 to 1. Explicating the distribution in Table 3 is important in order to be able to understand among other issues whether we would expect a similar distribution to occur in a Spanish or Mandarin dialogue corpus; similarly, whether one would expect this distribution to be replicated across di erent domains. Here we restrict ourselves to sketching an explanation of a couple of striking patterns exhibited in Table 3.</Paragraph> <Paragraph position="1"> One such pattern is the low frequency of when sluices, particularly by comparison with what one might expect to be its close cousin|where; indeed the direct/reprise splits are almost mirror images for when v. where. Another very notable pattern, alluded to above, is the high frequency of why sluices.4 The when v. where contrast provides one argument against (7), which is probably the null 4As we pointed out above, sluices are a common means of asking wh{interrogatives; in the case of why{ interrogatives, this is even stronger|close to 50% of all such interrogatives in the BNC are sluices.</Paragraph> <Paragraph position="2"> hypothesis w/r to the distribution of reprise sluices: (7) Frequency of antecedent hypothesis: The frequency of a class of reprise sluices is directly correlated with the frequency of the class of its possible antecedents.</Paragraph> <Paragraph position="3"> Clearly locative expressions do not outnumber temporal ones and certainly not by the proportion the data in Table 3 would require to maintain (7).5 (Purver, 2004) provides additional data related to this|clari cation requests of all types in the BNC that pertain to nominal antecedents outnumber such CRs that relate to verbal antecedents by 40:1, which does not correlate with the relative frequency of nominal v. verbal antecedents (about 1.3:1).</Paragraph> <Paragraph position="4"> A more re ned hypothesis, which at present we can only state quite informally, is (8): (8) Ease of grounding of antecedent hypothesis: The frequency of a class of reprise sluices is directly correlated with the ease with which the class of its possible antecedents can be grounded (in the sense of (Clark, 1996; Traum, 1994)).</Paragraph> <Paragraph position="5"> This latter hypothesis o ers a route towards explaining the when v. where contrast. There are two factors at least which make grounding a temporal parameter signi cantly easier on the whole than grounding a locative parameter. The rst factor is that conversationalists typically share a temporal ontology based on a clock and/or calendar. Although well structured locative ontologies do exist (e.g. grid points in a map), they are far less likely to be common currency. The natural ordering of clock/calendar-based ontologies re ected in grammatical devices such as sequence of tense is a second factor that favours temporal parameters over locatives. null From this perspective, the high frequency of why reprises is not surprising. Such reprises query either the justi cation for an antecedent assertion or the goal of an antecedent query.</Paragraph> <Paragraph position="6"> Speakers usually do not specify these explicitly. In fact, what requires explanation is why such 5A rough estimate concerning the BNC can be extracted by counting the words that occur more than 1000 times. Of these approx 35k tokens are locative in nature and could serve as antecedents of where; the corresponding number for temporal expressions and when yields approx 80k tokens. These numbers are derived from a frequency list (Kilgarri , 1998) of the demographic portion of the BNC.</Paragraph> <Paragraph position="7"> reprises do not occur even more frequently than they actually do. To account for this, one has to appeal to considerations of the importance of anchoring a contextual parameter.6 A detailed explication of the distribution shown in Table 3 requires a detailed model of dialogue interaction. We have limited ourselves to suggesting that the distribution can be explicated on the basis of some quite general principles that regulate grounding.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Heuristics for sluice </SectionTitle> <Paragraph position="0"> disambiguation In this section we informally describe a set of heuristics for assigning an interpretation to bare sluices. In subsection 4.2, we show how our heuristics can be formalised as probabilistic sluice typing constraints.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Description of the heuristics </SectionTitle> <Paragraph position="0"> To maximise accuracy we have restricted ourselves to cases of three-way agreement among the three coders when considering the distribution patterns from which we obtained our heuristics. Looking at these patters we have arrived at the following general principles for resolving bare sluice types.</Paragraph> <Paragraph position="1"> What The most likely interpretation is clarification. This seems to be the case when the antecedent utterance is a fragment, or when there is no linguistic antecedent. Reprise interpretations also provide a signi cant proportion (about 23%). If there is a pronoun (matching the appropriate semantic constraints) in the antecedent utterance, then the preferred interpretation is reprise: (9) Andy: I don't know how to do it.</Paragraph> <Paragraph position="2"> Nick: What? Garlic bread? [KPR, 1763] Why The interpretation of why sluices tends to be direct. However, if the antecedent is a non-declarative utterance, or a negative declarative, the sluice is likely to be a reprise. (10) Vicki: Were you buying this erm newspaper last week by any chance? Frederick: Why? [KC3, 3388] Who Sluices of this form show a very strong preference for reprise interpretation. In the majority of cases, the antecedent is either a proper name (11), or a personal pronoun.</Paragraph> <Paragraph position="3"> 6Another factor is the existence of default strategies for resolving such parameters, e.g. assuming that the question asked transparently expresses the querier's pri- null mary goal.</Paragraph> <Paragraph position="4"> (11) Patrick: [...] then I realised that it was Fennite Katherine: Who? [KCV, 4694] Which/Which N Both sorts of sluices exhibit a strong tendency to reprise. In the overwhelming majority of reprise cases for both which and which N, the antecedent is a de nite description like 'the button' in (12).</Paragraph> <Paragraph position="5"> (12) Arthur: You press the button.</Paragraph> <Paragraph position="6"> June: Which one? [KSS, 144] Where The most likely interpretation of where sluices is reprise. In about 70% of the reprise cases, the antecedent of the sluice is a deictic locative pronoun like 'there' or 'here'. Direct interpretations are preferred when the antecedent utterance is declarative with no overt spatial location expression.</Paragraph> <Paragraph position="7"> (13) Pat: You may nd something in there actually. Carole: Where? [KBH, 1817] When If the antecedent utterance is a declar- null ative and there is no time-denoting expression other than tense, the sluice will be interpreted as direct, as in example (14). On the other hand, deictic temporal expressions like 'then' trigger reprise interpretations.</Paragraph> <Paragraph position="8"> (14) Caroline: I'm leaving this school.</Paragraph> <Paragraph position="9"> Lyne: When? [KP3, 538] How This class of sluice exhibits a very strong tendency to direct (87%). It appears that most of the antecedent utterances contain an accomplishment verb.</Paragraph> <Paragraph position="10"> (15) Anthony: I've lost the, the whole work itself Arthur: How? [KP1, 631]</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Probabilistic Constraints </SectionTitle> <Paragraph position="0"> The problem we are addressing is typing of bare sluice tokens in dialogue. This problem is analogous to part-of-speech tagging, or to dialogue act classi cation.</Paragraph> <Paragraph position="1"> We formulate our typing constraints as Horn clauses to achieve the most general and declarative expression of these conditions. The antecedent of a constraint uses predicates corresponding to dialogue relations, syntactic properties, and lexical content. The predicate of the consequent represents a sluice typing tag, which corresponds to a maximal type in the HPSG grammar that we used in implementing our dialogue system. Note that these constraints cannot be formulated at the level of the lexical entries of the wh-words since these distributions are speci c to sluicing and not to non-elliptical wh-interrogatives.7 As a rst example, consider the following rule: sluice(x), where(x), ant utt(y,x), contains(y,'there') ! reprise(x) [.78] This rule states that if x is a sluice construction with lexical head where, and its antecedent utterance (identi ed with the latest move in the dialogue) contains the word 'there', then x is a reprise sluice. Note that, as in a probabilistic context-free grammar (Booth, 1969), the rule is assigned a conditional probability. In the example above, .78 is the probability that the context described in the antecedent of the clause produces the interpretation speci ed in the consequent.8 null The following three rules are concerned with the disambiguation of why sluice readings. The structure of the rules is the same as before. In this case however, the disambiguation is based on syntactic and semantic properties of the antecedent utterance as a whole (like polarity or mood), instead of focusing on a particular lexical item contained in such utterance.</Paragraph> <Paragraph position="2"> sluice(x), why(x), ant utt(y,x), non decl(y) ! reprise(x) [.93] sluice(x), why(x), ant utt(y,x), pos decl(y) ! direct(x) [.95] sluice(x), why(x), ant utt(y,x), neg decl(y) ! reprise(x) [.40]</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Applying Machine Learning </SectionTitle> <Paragraph position="0"> To evaluate our heuristics, we applied machine learning techniques to our corpus data. Our aim was to evaluate the predictive power of the features observed and to test whether the intuitive constraints formulated in the form of Horn clause rules could be learnt automatically from these features.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 SLIPPER </SectionTitle> <Paragraph position="0"> We use a rule-based learning algorithm called SLIPPER (for Simple Learner with Iterative Pruning to Produce Error Reduction). SLIPPER (Cohen and Singer, 1999) combines the 7Thus, whereas Table 2 shows that approx. 70% of who-sluices are reprise, this is clearly not the case for non-elliptical who{interrogatives. For instance, the KB7 block in the BNC has 33 non-elliptical who{ interrogatives. Of these at most 3 serve as reprise utterances. null 8These probabilities have been extracted manually from the three-way agreement data.</Paragraph> <Paragraph position="1"> separate-and-conquer approach used by most rule learners with con dence-rated boosting to create a compact rule set.</Paragraph> <Paragraph position="2"> The output of SLIPPER is a weighted rule set, in which each rule is associated with a condence level. The rule builder is used to nd a rule set that separates each class from the remaining classes using growing and pruning techniques. To classify an instance x, one computes the sum of the con dences that cover x: if the sum is greater than zero, the positive class is predicted. For each class, the only rule with a negative con dence rating is a single default rule, which predicts membership in the remaining classes.</Paragraph> <Paragraph position="3"> We decided to use SLIPPER for two main reasons: (1) it generates transparent, relatively compact rule sets that can provide interesting insights into the data, and (2) its if-then rules closely resemble our Horn clause constraints.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Experimental Setup </SectionTitle> <Paragraph position="0"> To generate the input data we took all three-way agreement instances plus those instances where there is agreement between coder 1 and coder 2, leaving out cases classi ed as unclear.</Paragraph> <Paragraph position="1"> We reclassi ed 9 instances in the rst sample as wh-anaphor, and also included these data.9 The total data set includes 351 datapoints. These were annotated according to the set of features shown in Table 4.</Paragraph> <Paragraph position="2"> sluice type of sluice mood mood of the antecedent utterance polarity polarity of the antecedent utterance frag whether the antecedent utterance is a fragment quant presence of a quanti ed expression deictic presence of a deictic pronoun proper n presence of a proper name pro presence of a pronoun def desc presence of a de nite description wh presence of a wh word overt presence of any other potential We use a total of 11 features. All features are nominal. Except for the sluice feature that indicates the sluice type, they are all boolean, i.e. they can take as value either yes or no. The features mood, polarity and frag refer to syntactic and semantic properties of the antecedent 9We reclassi ed those instances that had motivated the introduction of the wh-anaphor category for the second sample. Given that there were no disagreements involving this category, such reclassi cation was straightforward. null utterance as a whole. The remaining features, on the other hand, focus on a particular lexical item or construction contained in such utterance. They will take yes as a value if this element or construction exists and, it matches the semantic restrictions imposed by the sluice type. The feature wh will take a yes value only if there is a wh-word that is identical to the sluice type. Unknown or irrelevant values are indicated by a question mark. This allows us to express, for instance, that the presence of a proper name is irrelevant to determine the interpretation of a where sluice, while it is crucial when the sluice type is who. The feature overt takes no as value when there is no overt antecedent expression. It takes yes when there is an antecedent expression not captured by any other feature, and it is considered irrelevant (question mark value) when there is an antecedent expression de ned by another feature.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Accuracy Results </SectionTitle> <Paragraph position="0"> We performed a 10-fold cross-validation on the total data set, obtaining an average success rate of 90.32%. Using leave-one-out cross-validation we obtained an average success rate of 84.05%.</Paragraph> <Paragraph position="1"> For the holdout method, we held over 100 instances as a testing data, and used the reminder (251 datapoints) for training. This yielded a success rate of 90%. Recall, precision and f-measure values are reported in Table 5.</Paragraph> <Paragraph position="2"> category recall precision f-measure Using the holdout procedure, SLIPPER generated a set of 23 rules: 4 for direct, 13 for reprise, 1 for clarification and 1 for wh-anaphor, plus 4 default rules, one for each class. All features are used except for frag, which indicates that this feature does not play a signi cant role in determining the correct reading. The following rules are part of the rule set generated by SLIPPER: direct not reprise|clarification|wh anaphor :overt=no, polarity=pos (+1.06296) reprise not direct|clarification|wh anaphor :deictic=yes (+3.31703) reprise not direct|clarification|wh anaphor :mood=non decl, sluice=why (+1.66429)</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Comparing SLIPPER and TiMBL </SectionTitle> <Paragraph position="0"> Although SLIPPER seems to be especially well suited for the task at hand, we decided to run a di erent learning algorithm on the same training and testing data sets and compare the results obtained. For this experiment we used TiMBL, a memory-based learning algorithm developed at Tilburg University (Daelemans et al., 2003). As with all memory-based machine learners, TiMBL stores representations of instances from the training set explicitly in memory. In the prediction phase, the similarity between a new test instance and all examples in memory is computed using some distance metric. The system will assign the most frequent category within the set of most similar examples (the k-nearest neighbours). As a distance metric we used information-gain feature weighting, which weights each feature according to the amount of information it contributes to the correct class label.</Paragraph> <Paragraph position="1"> The results obtained are very similar to the previous ones. TiMBL yields a success rate of 89%. Recall, precision and f-measure values are shown in Table 6. As expected, the feature that received a lowest weighting was frag.</Paragraph> <Paragraph position="2"> category recall precision f-measure</Paragraph> </Section> </Section> class="xml-element"></Paper>