XML Viewer - w06-3407

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3407_metho.xml
Size: 21,731 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3407">
  <Title>Topic Segmentation of Dialogue</Title>
  <Section position="4" start_page="42" end_page="42" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> Existing topic segmentation approaches can be loosely classified into two types: (1) lexical cohesion models, and (2) content-oriented models. The underlying assumption in lexical cohesion models is that a shift in term distribution signals a shift in topic (Halliday and Hassan, 1976). The best known algorithm based on this idea is TextTiling (Hearst, 1997). In TextTiling, a sliding window is passed over the vector-space representation of the text. At each position, the cosine correlation between the upper and lower regions of the sliding window is compared with that of the peak cosine correlation values to the left and right of the window. A segment boundary is predicted when the magnitude of the difference exceeds a threshold.</Paragraph>
    <Paragraph position="1"> One drawback to relying on term co-occurrence to signal topic continuity is that synonyms or related terms are treated as thematically-unrelated.</Paragraph>
    <Paragraph position="2"> One proposed solution to this problem is Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997). Two LSA-based algorithms for segmentation are described in (Foltz, 1998) and (Olney and Cai, 2005). Foltz's approach differs from TextTiling mainly in its use of an LSA-based vector space model. Olney and Cai address a problem not addressed by TextTiling or Foltz's approach, which is that cohesion is not just a function of the repetition of thematically-related terms, but also a function of the presentation of new information in reference to information already presented. Their orthonormal basis approach allows for segmentation based on relevance and informativity.</Paragraph>
    <Paragraph position="3"> Content-oriented models, such as (Barzilay and Lee, 2004), rely on the re-occurrence of patterns of topics over multiple realizations of thematically similar discourses, such as a series of newspaper articles about similar events. Their approach utilizes a hidden Markov model where states correspond to topics and state transition probabilities correspond to topic shifts. To obtain the desired number of topics (states), text spans of uniform length (individual contributions, in our case) are clustered. Then, state emission probabilities are induced using smoothed cluster-specific language models. Transition probabilities are induced by considering the proportion of documents in which a contribution assigned to the source cluster (state) immediately precedes a contribution assigned to the target cluster (state). Following an EM-like approach, contributions are reassigned to states until the algorithm converges.</Paragraph>
  </Section>
  <Section position="5" start_page="42" end_page="43" type="metho">
    <SectionTitle>
4 Overview of Museli Approach
</SectionTitle>
    <Paragraph position="0"> We cast the segmentation problem as a binary classification problem where each contribution is classified as NEW_TOPIC if it introduces a new topic and SAME_TOPIC otherwise. In our hybrid Museli approach, we combined lexical cohesion with features that have the potential to capture something about the linguistic style that marks shifts in topic. Table 1 lists our features.</Paragraph>
    <Section position="1" start_page="43" end_page="43" type="sub_section">
      <SectionTitle>
Feature Description
Lexical
Cohesion
</SectionTitle>
      <Paragraph position="0"> Cosine correlation of adjacent regions in the discourse. Term vectors of adjacent regions are stemmed and stopwords are removed. null  where X corresponds to this time difference and MIN &amp; MAX are with respect to the whole corpus.  Binary-valued, was the speaker of the previous contribution the student or the tutor?  We found that using a Naive Bayes classifier with an attribute selection wrapper using the chi-square test for ranking attributes performed better than other state-of-the-art machine learning algorithms on our task, perhaps because of the evidence integration oriented nature of the problem. We conducted our evaluation using 10-fold crossvalidation, being careful not to include instances from the same dialogue in both the training and test sets on any fold to avoid biasing the trained model with idiosyncratic communicative patterns associated with individual dialogue participants. To capitalize on differences in conversational behavior between participants assigned to different  The current contribution's agent is implicit in the fact that we learn separate models for each agent-role (student &amp; tutor). roles in the conversation (i.e., student and tutor), we learn separate models for each role. This decision is motivated by observations that participants with different speaker-roles, each with different goals in the conversation, introduce topics with a different frequency, introduce different types of topics, and may introduce topics in a different style that displays their status in the conversation. For instance, a tutor may be more likely to introduce new topics with a contribution that ends with an imperative. A student may be more likely to introduce new topics with a contribution that ends with a wh-question. Dissimilar agent-roles also occur in other domains such as Travel Agent and Customer in flight booking scenarios.</Paragraph>
      <Paragraph position="1"> Using the complete set of features enumerated above, we perform feature selection on the training data for each fold of the cross-validation separately, training a model with the top 1000 features, and applying that trained model to the test data. Examples of high ranking features output by our chi-squared feature selection wrapper confirm our intuition that initial and final contributions of a segment are marked differently. Moreover, the highest ranked features are different for our two speaker-roles. Some features highly-correlated with student-initiated segments are am_trying, should, what_is, and PUNCT_question, which relate to student questions and requests for information. Some features highly-correlated with tutorinitiated segments include ok_lets, do, see_what, and BEGIN_VERB (the POS of the first word in the contribution is VERB), which characterize imperatives, and features such as now, next, and first, which characterize instructional task ordering.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="43" end_page="44" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluate Museli in comparison to the best performing state-of-the-art approaches, demonstrating that our hybrid Museli approach out-performs all of these approaches on two different dialogue corpora by a statistically significant margin (p &lt; .01), in one case reducing the probability of error, as measured by P k (Beeferman et al., 1999), to about 10%.</Paragraph>
    <Section position="1" start_page="43" end_page="44" type="sub_section">
      <SectionTitle>
5.1 Experimental Corpora
</SectionTitle>
      <Paragraph position="0"> We used two different dialogue corpora from the educational domain for our evaluation. Both corpora constitute of dialogues between a student and  a tutor (speakers with asymmetric roles) and both were collected via chat software. The first corpus, which we call the Olney &amp; Cai corpus, is a set of dialogues selected randomly from the same corpus Olney and Cai obtained their corpus from (Olney and Cai, 2005). The dialogues discuss problems related to Newton's Three Laws of Motion. The second corpus, the Thermo corpus, is a locally collected corpus of thermodynamics tutoring dialogues, in which tutor-student pairs work together to solve an optimization task. Table 2 shows corpus statistics from both corpora.</Paragraph>
      <Paragraph position="1">  Both corpora seem adequate for attempting to harness systematic differences in how speakers with asymmetric roles may initiate or close topic segments. The Thermo corpus is particularly appropriate for addressing the research question of how to automatically segment natural, spontaneous dialogue. The exploratory task is more loosely structured than many task-oriented domains investigated in the dialogue community, such as flight reservation or meeting scheduling. Students can interrupt with questions and tutors can digress in any way they feel may benefit the completion of the task. In the Olney and Cai corpus, the same 10 physics problems are addressed in each session and the interaction is almost exclusively a tutor initiation followed by student response, evident from the nearly equal number of student and tutor contributions. null</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
5.2 Baseline Approaches
</SectionTitle>
      <Paragraph position="0"> We evaluate Museli against the following four algorithms: (1) Olney and Cai (Ortho), (2) Barzilay and Lee (B&amp;L), (3) TextTiling (TT), and (4) Foltz.</Paragraph>
      <Paragraph position="1"> As opposed to the other baseline algorithms, (Olney and Cai, 2005) applied their orthonormal basis approach specifically to dialogue, and prior to this work, report the highest numbers for topic segmentation of dialogue. Barzilay and Lee's approach is the state of the art in modeling topic shifts in monologue text. Our application of B&amp;L to dialogue attempts to harness any existing and recognizable redundancy in topic-flow across our dialogues for the purpose of topic segmentation.</Paragraph>
      <Paragraph position="2"> We chose TextTiling for its seminal contribution to monologue segmentation. TextTiling and Foltz consider lexical cohesion as their only evidence of topic shifts. Applying these approaches to dialogue segmentation sheds light on how term distribution in dialogue differs from that of expository monologue text (e.g. news articles). The Foltz and Ortho approaches require a trained LSA space, which we prepared the same way as described in (Olney and Cai, 2005). Any parameter tuning for approaches other than our Museli was computed over the entire test set, giving baseline algorithms the maximum advantage.</Paragraph>
      <Paragraph position="3"> In addition to these approaches, we include segmentation results from three degenerate approaches: (1) classifying all contributions as NEW_TOPIC (ALL), (2) classifying no contributions as NEW_TOPIC (NONE), and (3) classifying contributions as NEW_TOPIC at uniform intervals (EVEN), separated by the average reference topic length (see Table 2).</Paragraph>
      <Paragraph position="4"> As a means for comparison, we adopt two evaluation metrics: P k and f-measure. An extensive argument in support of P k 's robustness (if k is set to 1/2 the average reference topic length) is presented in (Beeferman, et al. 1999). P k measures the probability of misclassifying two contributions a distance of k contributions apart, where the classification question is are the two contributions part of the same topic segment or not? P</Paragraph>
      <Paragraph position="6"> values are preferred over higher ones. It equally captures the effect of false-negatives and false-positives and favors predictions that that are closer to the reference boundaries. F-measure punishes false positives equally, regardless of their distance to reference boundaries.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="44" end_page="45" type="metho">
    <SectionTitle>
5.3 Results
</SectionTitle>
    <Paragraph position="0"> Table 3 shows our evaluation results. Note that lower values of P k are preferred over higher ones. The opposite is true of F-measure. In both corpora, the Museli approach performed significantly better than all other approaches (p &lt; .01).</Paragraph>
    <Section position="1" start_page="45" end_page="45" type="sub_section">
      <SectionTitle>
5.4 Error Analysis
</SectionTitle>
      <Paragraph position="0"> Results for all approaches are better on the Olney and Cai corpus than the Thermo corpus. The Thermo corpus differs profoundly from the Olney and Cai corpus in ways that very likely influenced the performance. For instance, in the Thermo corpus each dialogue contribution is on average 5 words long, whereas in the Olney and Cai corpus each dialogue contribution contains an average of 28 words. Thus, the vector space representation of the dialogue contributions is more sparse in the Thermo corpus, which makes shifts in lexical coherence less reliable as topic shift indicators.</Paragraph>
      <Paragraph position="1"> In terms of P k , TextTiling (TT) performed worse than the degenerate algorithms. TextTiling measures the term overlap between adjacent regions in the discourse. However, dialogue contributions are often terse or even contentless. This produces many islands of contribution-sequences for which the local lexical coherence is zero. TextTiling wrongly classifies all of these as starts of new topics. A heuristic improvement to prevent TextTiling from placing topic boundaries at every point along a sequence of contributions failed to produce a statistically significant improvement.</Paragraph>
      <Paragraph position="2"> The Foltz and the Ortho approaches rely on LSA to provide strategic semantic generalizations capable of detecting shifts in topic. Following (Olney and Cai, 2005), we built our LSA space using dialogue contributions as the atomic text unit. In corpora such as the Thermo corpus, however, this may not be effective due to the brevity of contributions. Barzilay and Lee's algorithm (B&amp;L) did not generalize well to either dialogue corpus. One reason could be that probabilistic methods, such as their approach, require that reference topics have significantly different language models, which was not true in either of our evaluation corpora. We also noticed a number of instances in the dialogue corpora where participants referred to information from previous topic segments, which consequently may have blurred the distinction between the language models assigned to different topics.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="45" end_page="46" type="metho">
    <SectionTitle>
6 Dialogue Exchanges
</SectionTitle>
    <Paragraph position="0"> Although results are reliably better than our baseline algorithms in both corpora, there is much room for improvement, especially in the more spontaneous Thermo corpus. We believe that an improvement can come from a multi-layer segmentation approach, where a first pass segments a dialogue into dialogue exchanges and a second classifier assigns topic shifts based on exchange initial contributions. Dialogue is hierarchical in nature.</Paragraph>
    <Paragraph position="1"> Topic and topic shift comprise only one of the many lenses through which dialogue behaves in seemingly structured ways. Thus, it seems logical that exploiting more fine-grained sub-parts of dialogue than our definition of topic might help us do better at predicting shifts in topic. One such sub-part of dialogue is the notion of dialogue exchange, typically between 2-3 contributions.</Paragraph>
    <Paragraph position="2"> Stubbs (1983) motivates the definition of an exchange with the following observation. In theory, there is no limit to the number of possible responses to the clause &amp;quot;Is Harry at home?&amp;quot;. However, constraints are imposed on the interpretation of the contribution that follows it: yes or no. Such a constraint is central to the concept of a dialogue exchange. Informally, an exchange is made from an initiation, for which the possibilities are openended, followed by dialogue contributions that are pre-classified and thus increasingly restricted. A contribution is part of the next exchange when the constraint on its communicative act is lifted.</Paragraph>
    <Paragraph position="3"> Sinclair and Coulthard (1975) introduce a more formal definition of exchange with their Initiative-Response-Feedback or IRF structure. An initiation produces a response and a response happens as direct consequence to an initiation. Feedback serves to close an exchange. Sinclair and Coulthard posit that if exchanges constitute the minimal unit of interaction, IRF is a primary structure of interactive discourse in general.</Paragraph>
    <Paragraph position="4"> To measure the benefits of exchange boundaries in detecting topic shift in dialogue, we coded the Thermo corpus with exchanges following Sinclair  and Coulthard's IRF structure. The coder who labeled dialogue exchanges had no knowledge of our definition of topic or our intention to do topicanalyses of the corpus. Any correlation between exchange boundaries and topic boundaries is not a bias introduced during the hand-labeling process.</Paragraph>
  </Section>
  <Section position="9" start_page="46" end_page="47" type="metho">
    <SectionTitle>
7 Topic Segmentation with Exchanges
</SectionTitle>
    <Paragraph position="0"> In our corpus, as we believe is true in domain-general dialogue, knowledge of an exchangeboundary increases the probability of a topicboundary significantly. One way to quantify this relation is with the following observation. In our experimental Thermo corpus, there are 4794 dialogue contributions, 360 topic shifts, and 1074 exchange shifts. Using maximum likelihood estimation, the likelihood of being correct if we say that a randomly chosen contribution is a topic shift is 0.075 (# topic shifts / # contributions). However, the likelihood of being correct if we have prior knowledge that an exchange-shift also occurs in that contribution is 0.25. Thus, knowledge that the contribution introduces a new exchange increases our confidence that it also introduces a new topic.</Paragraph>
    <Paragraph position="1"> More importantly, the probability that a contribution does not mark a topic shift, given that it does not mark an exchange-shift, is 0.98. Thus, exchanges show great promise in narrowing the search-space of tentative topic shifts.</Paragraph>
    <Paragraph position="2"> In addition to possibly narrowing the space of tentative topic-boundaries, exchanges are helpful in that they provide more coarse-grain building blocks for segmentation algorithms that rely on term-distribution as a proxy for dialogue coherence, such as TextTiling (Hearst, 1994, 1997), the Foltz algorithm (Foltz, 1998), Orthonormal Basis (Olney and Cai, 2005), and Barzilay and Lee's content modeling approach (Barzilay and Lee, 2004). At the heart of all these approaches is the assumption that a change in term distribution signals a shift in topic. When applied to dialogue, the major weakness of these approaches is that contributions are often times contentless: terse and absent of thematically meaningful terms. Thus, a more coarse-grained discourse unit is needed.</Paragraph>
    <Paragraph position="3"> 8 Barzilay and Lee with Exchanges Barzilay and Lee (2004) offer an attractive frame work for constructing a context-specific Hidden Markov Model (HMM) of topic drift. In our initial evaluation, we used dialogue contributions as the atomic discourse unit. Using contributions, our application of Barzilay and Lee's algorithm for segmenting dialogue fails at least in part because the model learns states that are not thematically meaningful, but instead relate to other systematic phenomena in dialogue, such as fixed expressions and discourse cues. Figure 1 shows the cluster (state) size distribution in terms of the percentage of the total discourse units (exchanges vs.</Paragraph>
    <Paragraph position="4"> contributions) in the Thermo corpus assigned to each cluster. In the horizontal axis, clusters (states) are sorted by size from largest to smallest.</Paragraph>
    <Paragraph position="5">  The largest cluster contains 70% of all contributions in the corpus. The second largest cluster only generates 10% of the contributions. In contrast, when using exchanges as the atomic unit, the cluster size distribution is less skewed and corresponds more closely to a topic analysis performed by a domain expert. In this analysis, the number of desired cluster (states), which is an input to the algorithm, was set to 16, the same number identified in a domain expert's analysis of the Thermo corpus.</Paragraph>
    <Paragraph position="6"> Examples of such topics include high-level ones such as greeting, setup initialization, and general thermo concepts, as well as task-specific ones like sensitivity analysis and regeneration.</Paragraph>
    <Paragraph position="7"> A closer examination of the clusters (states) confirms our intuition that systematic topic-independent phenomena in dialogue, coupled with the terse nature of contributions in spontaneous dialogue, leads to an overly skewed cluster size distribution. Examining the terms with the highest emission probabilities, the largest states contain  topical terms like cycle, efficiency, increase, quality, plot, and turbine intermixed with terms like think, you, right, make, yeah, fine, and ok. Also the sets of topical terms in these larger states do not seem coherent with respect to the expert induced topics. This suggests that thematically ambiguous fixed expressions blur the distinction between the different topic-centered language models, producing an overly heavy-tailed cluster size distribution. One might argue that a possible solution to this problem would be to remove these fixed expressions as part of pre-processing. However, that requires knowledge of the particular domain and knowledge of the interaction style characteristic to the context. We believe that a more robust solution is to use exchanges as the atomic unit of discourse.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML