File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1071_metho.xml

Size: 21,363 bytes

Last Modified: 2025-10-06 14:08:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1071">
  <Title>Discourse Segmentation of Multi-Party Conversation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The ICSI Meeting Corpus
</SectionTitle>
    <Paragraph position="0"> We have evaluated our segmenter on the ICSI Meeting corpus (Janin et al., 2003). This corpus is one of a growing number of corpora with human-to-human multi-party conversations. In this corpus, recordings of meetings ranged primarily over three different recurring meeting types, all of which concerned speech or language research.1 The average duration is 60 minutes, with an average of 6.5 participants.</Paragraph>
    <Paragraph position="1"> They were transcribed, and each conversation turn was marked with the speaker, start time, end time, and word content.</Paragraph>
    <Paragraph position="2"> From the corpus, we selected 25 meetings to be segmented, each by at least three subjects. We opted for a linear representation of discourse, since finer-grained discourse structures (e.g. (Grosz and Sidner, 1986)) are generally considered to be difficult to mark reliably. Subjects were asked to mark each speaker change (potential boundary) as either boundary or non-boundary. In the resulting annotation, the agreed segmentation based on majority 1While it would be desirable to have a broader variety of meetings, we hope that experiments on this corpus will still carry some generality.</Paragraph>
    <Paragraph position="3"> opinion contained 7.5 segments per meeting on average, while the average number of potential boundaries is 770. We used Cochran's Q (1950) to evaluate the agreement among annotators. Cochran's test evaluates the null hypothesis that the number of subjects assigning a boundary at any position is randomly distributed. The test shows that the inter-judge reliability is significant to the 0.05 level for 19 of the meetings, which seems to indicate that segment identification is a feasible task.2</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Segmentation based on Lexical Cohesion
</SectionTitle>
    <Paragraph position="0"> Previous work on discourse segmentation of written texts indicates that lexical cohesion is a strong indicator of discourse structure. Lexical cohesion is a linguistic property that pertains to speech as well, and is a linguistic phenomenon that can also be exploited in our case: while our data does not have the same kind of syntactic and rhetorical structure as written text, we nonetheless expect that information from the written transcription alone should provide indications about topic boundaries. In this section, we describe our work on LCseg, a topic segmenter based on lexical cohesion that can handle both speech and text, but that is especially designed to generate the lexical cohesion feature used in the feature-based segmentation described in Section 5.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Algorithm Description
</SectionTitle>
      <Paragraph position="0"> LCseg computes lexical chains, which are thought to mirror the discourse structure of the underlying text (Morris and Hirst, 1991). We ignore synonymy and other semantic relations, building a restricted model of lexical chains consisting of simple term repetitions, hypothesizing that major topic shifts are likely to occur where strong term repetitions start and end. While other relations between lexical items also work as cohesive factors (e.g. between a term and its super-ordinate), the work on linear topic segmentation reporting the most promising results account for term repetitions alone (Choi, 2000; Utiyama and Isahara, 2001).</Paragraph>
      <Paragraph position="1"> The preprocessing steps of LCseg are common to many segmentation algorithms. The input document is first tokenized, non-content words are removed, 2Four other meetings failed short the significance test, while there was little agreement on the two last ones (p &gt; 0.1). and remaining words are stemmed using an extension of Porter's stemming algorithm (Xu and Croft, 1998) that conflates stems using corpus statistics.</Paragraph>
      <Paragraph position="2"> Stemming will allow our algorithm to more accurately relate terms that are semantically close.</Paragraph>
      <Paragraph position="3"> The core algorithm of LCseg has two main parts: a method to identify and weight strong term repetitions using lexical chains, and a method to hypothesize topic boundaries given the knowledge of multiple, simultaneous chains of term repetitions.</Paragraph>
      <Paragraph position="4"> A term is any stemmed content word within the text. A lexical chain is constructed to consist of all repetitions ranging from the first to the last appearance of the term in the text. The chain is divided into subchains when there is a long hiatus of h consecutive sentences with no occurrence of the term, where h is determined experimentally. For each hiatus, a new division is made and thus, we avoid creating weakly linked chains.</Paragraph>
      <Paragraph position="5"> For all chains that have been identified, we use a weighting scheme that we believe is appropriate to the task of inducing the topical or sub-topical structure of text. The weighting scheme depends on two factors: Frequency: chains containing more repeated terms receive a higher score.</Paragraph>
      <Paragraph position="6"> Compactness: shorter chains receive a higher weight than longer ones. If two chains of different lengths contain the same number of terms, we assign a higher score to the shortest one. Our assumption is that the shorter one, being more compact, seems to be a better indicator of lexical cohesion.3 We apply a variant of a metric commonly used in information retrieval, TF.IDF (Salton and Buckley, 1988), to score term repetitions. If R1 ...Rn is the set of all term repetitions collected in the text, t1 ...tn the corresponding terms, L1 ...Ln their respective lengths,4 and L the length of the text, the adapted metric is expressed as follows, combining frequency (freq(ti)) of a term ti and the compactness of its underlying chain:</Paragraph>
      <Paragraph position="8"> one might assume that longer chains should receive a higher score. However we point out that in a linear model of discourse, chains that almost span the entire text are barely indicative of any structure (assuming boundaries are only hypothesized where chains start and end).</Paragraph>
      <Paragraph position="9"> 4All lengths are expressed in number of sentences.</Paragraph>
      <Paragraph position="10"> In the second part of the algorithm, we combine information from all term repetitions to compute a lexical cohesion score at each sentence break (or, in the case of spoken conversations, speaker turn break). This step of our algorithm is very similar in spirit to TextTiling (Hearst, 1994). The idea is to work with two adjacent analysis windows, each of fixed size k. For each sentence break, we determine a lexical cohesion function by computing the cosine similarity at the transition between the two windows.</Paragraph>
      <Paragraph position="11"> Instead of using word counts to compute similarity, we analyze lexical chains that overlap with the two windows. The similarity between windows (A and</Paragraph>
      <Paragraph position="13"> The similarity computed at each sentence break produces a plot that shows how lexical cohesion changes over time; an example is shown in Figure 1.</Paragraph>
      <Paragraph position="14"> The lexical cohesion function is then smoothed using a moving average filter, and minima become potential segment boundaries. Then, in a manner quite similar to (Hearst, 1994), the algorithm determines for every local minimum mi how sharp of a change there is in the lexical cohesion function. The algorithm looks on each side of mi for maxima of cohesion, and once it eventually finds one on each side (l and r), it computes the hypothesized segmentation probability:</Paragraph>
      <Paragraph position="16"> where LCF(x) is the value of the lexical cohesion function at x.</Paragraph>
      <Paragraph position="17"> This score is supposed to capture the sharpness of the change in lexical cohesion, and give probabilities close to 1 for breaks like sentence 179 in Figure 1. Finally, the algorithm selects the hypothesized boundaries with the highest computed probabilities. If the number of reference boundaries is unknown, the algorithm has to make a guess. It computes the 5Normalizing anything in these windows has little effect, since the cosine similarity is scale invariant, that is  x-axis represent sentence indices, and y-axis represents the lexical cohesion function. The representative example presented here is segmented by LCseg with an error of Pk = 15.79, while the average performance of the algorithm is Pk = 15.31 on the WSJ test corpus (unknown number of segments). mean and the variance of the hypothesized probabilities of all potential boundaries (local minima). As we can see in Figure 1, there are many local minima that do not correspond to actual boundaries. Thus, we ignore all potential boundaries with a probability lower than plimit. For the remaining points, we compute the threshold using the average (u) and standard deviation (s) of the p(mi) values, and each potential boundary mi above the threshold u[?]a*s is hypothesized as a real boundary.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We evaluate LCseg against two state-of-the-art segmentation algorithms based on lexical cohesion (Choi, 2000; Utiyama and Isahara, 2001). We use the error metric Pk proposed by Beeferman et al.</Paragraph>
      <Paragraph position="1"> (1999) to evaluate segmentation accuracy. It computes the probability that sentences k units (e.g. sentences) apart are incorrectly determined as being either in different segments or in the same one. Since it has been argued in (Pevzner and Hearst, 2002) that Pk has some weaknesses, we also include results according to the WindowDiff (WD) metric (which is described in the same work).</Paragraph>
      <Paragraph position="2"> A test corpus of concatenated6 texts extracted from the Brown corpus was built by Choi (2000) to evaluate several domain-independent segmentation algorithms. We reuse the same test corpus for our evaluation, in addition to two other test corpora we constructed to test how segmenters scale across genres and how they perform with texts with various  ments.</Paragraph>
      <Paragraph position="3"> number of segments.7 We designed two test corpora, each of 500 documents, using concatenated texts extracted from the TDT and WSJ corpora, ranging from 4 to 22 in number of segments.</Paragraph>
      <Paragraph position="4"> LCseg depends on several parameters. Parameter tuning was performed on three tuning corpora of one thousand texts each.8 We performed searches for the optimal settings of the four tunable parameters introduced above; the best performance was achieved with h = 11 (hiatus length for dividing a chain into parts), k = 2 (analysis window size), plimit = 0.1 and a = 12 (thresholding limits for the hypothesized boundaries).</Paragraph>
      <Paragraph position="5"> As shown in Table 1, our algorithm is significantly better than (Choi, 2000) (labeled C99) on all three test corpora, according to a one-sided t-test of the null hypothesis of equal mean at the 0.01 level. It is not clear whether our algorithm is better than (Utiyama and Isahara, 2001) (U00). When the number of segments is provided to the algorithms, our algorithm is significantly better than Utiyama's on WSJ, better on Brown (but not significant), and significantly worse on TDT. When the number of boundaries is unknown, our algorithm is insignificantly worse on Brown, but significantly better on WSJ and TDT - the two corpora designed to have a varying number of segments per document. In the case of the Meeting corpus, none of the algorithms are significantly different than the others, due to the  the table are the results of significance tests between U00 and LCseg. Bold-faced values are scores that are statistically significant.</Paragraph>
      <Paragraph position="6"> small test set size.</Paragraph>
      <Paragraph position="7"> In conclusion, LCseg has a performance comparable to state-of-the-art text segmentation algorithms, with the added advantage of computing a segmentation probability at each potential boundary. This information can be effectively used in the feature-based segmenter to account for lexical cohesion, as described in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Feature-based Segmentation
</SectionTitle>
    <Paragraph position="0"> In the previous section, we have concentrated exclusively on the consideration of content (through lexical cohesion) to determine the structure of texts, neglecting any influence of form. In this section, we explore formal devices that are indicative of topic shifts, and explain how we use these cues to build a segmenter targeting conversational speech.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Probabilistic Classifiers
</SectionTitle>
      <Paragraph position="0"> Topic segmentation is reduced here to a classification problem, where each utterance break Bi is either considered a topic boundary or not. We use statistical modeling techniques to build a classifier that uses local features (e.g. cue phrases, pauses) to determine if an utterance break corresponds to a topic boundary. We chose C4.5 and C4.5rules (Quinlan, 1993), two programs to induce classification rules in the form of decision trees and production rules (respectively). C4.5 generates an unpruned decision tree, which is then analyzed by C4.5rules to generate a set of pruned production rules (it tries to find the most useful subset of them).</Paragraph>
      <Paragraph position="1"> The advantage of pruned rules over decision trees is that they are easier to analyze, and allow combination of features in the same rule (feature interactions are explicit).</Paragraph>
      <Paragraph position="2"> The greedy nature of decision rule learning algorithms implies that a large set of features can lead to bad performance and generalization capability. It is desirable to remove redundant and irrelevant features, especially in our case since we have little data labeled with topic shifts; with a large set of features, we would risk overfitting the data. We tried to restrict ourselves to features whose inclusion is motivated by previous work (pauses, speech rate) and added features that are specific to multi-speaker speech (overlap, changes in speaker activity).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Features
</SectionTitle>
      <Paragraph position="0"> Cue phrases: previous work on segmentation has found that discourse particles like now, well provide valuable information about the structure of texts (Grosz and Sidner, 1986; Hirschberg and Litman, 1994; Passonneau and Litman, 1997). We analyzed the correlation between words in the meeting corpus and labeled topic boundaries, and automatically extracted utterance-initial cue phrases9 that are statistically correlated with boundaries. For every word in the meeting corpus, we counted the number of its occurrences near any topic boundary, and its number of appearances overall. Then, we performed kh2 significance tests (e.g. figure 2 for okay) under the null hypothesis that no correlation exists. We selected terms whose kh2 value rejected the hypothesis under a 0.01-level confidence (the rejection criterion is kh2 [?] 6.635). Finally, induced cue phrases whose usage has never been described in other work were removed (marked with [?] in Table 3). Indeed, there is a risk that the automatically derived list of cue phrases could be too specific to the word usage in 9As in (Litman and Passonneau, 1995), we restrict ourselves to the first lexical item of any utterance, plus the second one if the first item is also a cue word.</Paragraph>
      <Paragraph position="1">  Silences: previous work has found that major shifts in topic typically show longer silences (Passonneau and Litman, 1993; Hirschberg and Nakatani, 1996). We investigated the presence of silences in meetings and their correlation with topic boundaries, and found it necessary to make a distinction between pauses and gaps (Levinson, 1983). A pause is a silence that is attributable to a given party, for example in the middle of an adjacency pair, or when a speaker pauses in the middle of her speech.</Paragraph>
      <Paragraph position="2"> Gaps are silences not attributable to any party, and last until a speaker takes the initiative of continuing the discussion. As an approximation of this distinction, we classified a silence that follows a question or in the middle of somebody's speech as a pause, and any other silences as a gap. While the correlation between long silences and discourse boundaries seem to be less pervasive in meetings than in other speech corpora, we have noticed that some topic boundaries are preceded (within some window) by numerous gaps. However, we found little correlation between pauses and topic boundaries.</Paragraph>
      <Paragraph position="3"> Overlaps: we also analyzed the distribution of overlapping speech by counting the average overlap rate within some window. We noticed that, many times, the beginning of segments are characterized by having little overlapping speech.</Paragraph>
      <Paragraph position="4"> Speaker change: we sometimes noticed a correlation between topic boundaries and sudden changes in speaker activity. For example, in Figure 2, it is clear that the contribution of individual speakers to the discussion can greatly change from one discourse unit to the next. We try to capture significant changes in speakership by measuring the dissimilarity between two analysis windows. For each potential boundary, we count for each speaker i the number of words that are uttered before (Li) and after (Ri) the potential boundary (we limit our analysis to a window of fixed size). The two distributions are normalized to form two probability distributions l and r, and significant changes of speakership are detected by computing their Jensen-Shannon divergence: null</Paragraph>
      <Paragraph position="6"> where D(l||r) is the KL-divergence between the two distributions.</Paragraph>
      <Paragraph position="7"> Lexical cohesion: we also incorporated the lexical cohesion function computed by LCseg as a feature of the multi-source segmenter in a manner similar to the knowledge source combination performed by (Beeferman et al., 1999) and (T&amp;quot;ur et al., 2001). Note that we use both the posterior estimate computed by LCseg and the raw lexical cohesion function as features of the system.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Features: Selection and Combination
</SectionTitle>
      <Paragraph position="0"> For every potential boundary Bi, the classifier analyzes features in a window surrounding Bi to decide whether it is a topic boundary or not. It is generally unclear what is the optimal window size and how features should be analyzed. Windows of various sizes can lead to different levels of prediction, and in some cases, it might be more appropriate to only extract features preceding or following Bi.</Paragraph>
      <Paragraph position="1"> We avoided making arbitrary choices of parameters; instead, for any feature F and a set F1,...,Fn of possible ways to measure the feature (different window sizes, different directions), we picked the Fi that is in isolation the best predictor of topic boundaries (among F1,...,Fn). Table 4 presents for each feature the analysis mode that is the most useful on the training data.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> We performed 25-fold cross-validation for evaluating the induced probabilistic classifier, computing the average of Pk and WD on the held-out meetings. Feature selection and decision rule learning  is always performed on sets of 24 meetings, while the held-out data is used for testing. Table 5 gives some examples of the type of rules that are learned. The first rule states that if the value for the lexical cohesion (LC) function is low at the current sentence break, there is at least one CUE phrase, there is less than three seconds of silence to the left of the break,10 and a single speaker holds the floor for a longer period of time than usual to the right of the break, then we have a topic break. In general, we found that the derived rules show that lexical cohesion plays a stronger role than most other features in determining topic breaks. Nonetheless, the quantitative results summarized in table 6, which correspond to the average performance on the held-out sets, show that the integration of conversational features with the text-based segmenter outperforms either alone.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML