File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1320_metho.xml
Size: 21,487 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1320"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Analysis of Quantitative Aspects in the Evaluation of Thematic Segmentation Algorithms</Title> <Section position="5" start_page="0" end_page="144" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The goal of thematic segmentation is to identify boundaries of topically coherent segments in text documents. Giving a rigorous definition of the notion of topic is difficult, but the task of discourse/dialogue segmentation into thematic episodes is usually described by invoking an &quot;intuitive notion of topic&quot; (Brown and Yule, 1998). Thematic segmentation also relates to several notions such as speaker's intention, topic flow and cohesion.</Paragraph> <Paragraph position="1"> Since it is elusive what mental representations humans use in order to distinguish a coherent text, different surface markers (Hirschberg and Nakatani, 1996; Passonneau and Litman, 1997) and external knowledge sources (Kozima and Furugori, 1994) have been exploited for the purpose of automatic thematic segmentation. Halliday and Hasan (1976) claim that the text meaning is realised through certain language resources and they refer to these resources by the term of cohesion.</Paragraph> <Paragraph position="2"> The major classes of such text-forming resources identified in (Halliday and Hasan, 1976) are: substitution, ellipsis, conjunction, reiteration and collocation. In this paper, we examine one form of lexical cohesion, namely lexical reiteration.</Paragraph> <Paragraph position="3"> Following some of the most prominent discourse theories in literature (Grosz and Sidner, 1986; Marcu, 2000), a hierarchical representation of the thematic episodes can be proposed. The basis for this is the idea that topics can be recursively divided into subtopics. Real texts exhibit a more intricate structure, including 'semantic returns' by which a topic is suspended at one point and resumed later in the discourse. However, we focus here on a reduced segmentation problem, which involves identifying non-overlapping and non-hierarchical segments at a coarse level of granularity.</Paragraph> <Paragraph position="4"> Thematic segmentation is a valuable initial tool in information retrieval and natural language processing. For instance, in information access systems, smaller and coherent passage retrieval is more convenient to the user than wholedocument retrieval and thematic segmentation has been shown to improve the passage-retrieval performance (Hearst and Plaunt, 1993). In cases such as collections of transcripts there are no headers or paragraph markers. Therefore a clear separation of the text into thematic episodes can be used together with highlighted keywords as a kind of 'quick read guide' to help users to quickly navigate through and understand the text. Moreover automatic thematic segmentation has been shown to play an important role in automatic summarization (Mani, 2001), anaphora resolution and dis- null course/dialogue understanding.</Paragraph> <Paragraph position="5"> In this paper, we concern ourselves with the task of linear thematic segmentation and are interested in finding out whether different segmentation systems can perform well on artificial and real data sets without specific parameter tuning. In addition, we will refer to the implications of the choice of a particular error metric for evaluation results.</Paragraph> <Paragraph position="6"> This paper is organized as follows. Section 2 and Section 3 describe various systems and, respectively, different input data selected for our evaluation. Section 4 presents several existing evaluation metrics and their weaknesses, as well as a new evaluation metric that we propose. Section 5 presents our experimental set-up and shows comparisons between the performance of different systems. Finally, some conclusions are drawn in Section 6.</Paragraph> </Section> <Section position="6" start_page="144" end_page="145" type="metho"> <SectionTitle> 2 Comparison of Systems </SectionTitle> <Paragraph position="0"> Combinations of different features (derived for example from linguistic, prosodic information) have been explored in previous studies like (Galley et al., 2003) and (Kauchak and Chen, 2005). In this paper, we selected for comparison three systems based merely on the lexical reiteration feature: TextTiling (Hearst, 1997), C99 (Choi, 2000) and TextSeg (Utiyama and Isahara, 2001). In the following, we briefly review these approaches.</Paragraph> <Section position="1" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 2.1 TextTiling Algorithm </SectionTitle> <Paragraph position="0"> The TextTiling algorithm was initially developed by Hearst (1997) for segmentation of expository texts into multi-paragraph thematic episodes having a linear, non-overlapping structure (as reflected by the name of the algorithm). TextTiling is widely used as a de-facto standard in the evaluation of alternative segmentation systems, e.g.</Paragraph> <Paragraph position="1"> (Reynar, 1998; Ferret, 2002; Galley et al., 2003).</Paragraph> <Paragraph position="2"> The algorithm can briefly be described by the following steps.</Paragraph> <Paragraph position="3"> Step 1 includes stop-word removal, lemmatization and division of the text into 'token-sequences' (i.e. text blocks having a fixed number of words).</Paragraph> <Paragraph position="4"> Step 2 determines a score for each gap between two consecutive token-sequences, by computing the cosine similarity (Manning and Sch&quot;utze, 1999) between the two vectors representing the frequencies of the words in the two blocks.</Paragraph> <Paragraph position="5"> Step 3 computes a 'depth score' for each token-sequence gap, based on the local minima of the score computed in step 2.</Paragraph> <Paragraph position="6"> Step 4 consists in smoothing the scores.</Paragraph> <Paragraph position="7"> Step 5 chooses from any potential boundaries those that have the scores smaller than a certain 'cutoff function', based on the average and standard deviation of score distribution.</Paragraph> </Section> <Section position="2" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 2.2 C99 Algorithm </SectionTitle> <Paragraph position="0"> The C99 algorithm (Choi, 2000) makes a linear segmentation based on a divisive clustering strategy and the cosine similarity measure between any two minimal units. More exactly, the algorithm consists of the following steps.</Paragraph> <Paragraph position="1"> Step 1: after the division of the text into minimal units (in our experiments, the minimal unit is an utterance1), stop words are removed and a stemmer is applied.</Paragraph> <Paragraph position="2"> The second step consists of constructing a similarity matrix Smxm, where m is the number of utterances and an element sij of the matrix corresponds to the cosine similarity between the vectors representing the frequencies of the words in the i-th utterance and the j-th utterance.</Paragraph> <Paragraph position="3"> Step 3: a 'rank matrix' Rmxm is computed, by determining for each pair of utterances, the number of neighbors in Smxm with a lower similarity value.</Paragraph> <Paragraph position="4"> In the final step, the location of thematic boundaries is determined by a divisive top-down clustering procedure. The criterion for division of the current segment B into b1,...bm subsegments is based on the maximisation of a 'density' D, computed for each potential repartition of boundaries as</Paragraph> <Paragraph position="6"> where sumk and areak refers to the sum of rank and area of the k-th segment in B, respectively.</Paragraph> </Section> <Section position="3" start_page="144" end_page="145" type="sub_section"> <SectionTitle> 2.3 TextSeg Algorithm </SectionTitle> <Paragraph position="0"> The TextSeg algorithm (Utiyama and Isahara, 2001) implements a probabilistic approach to determine the most likely segmentation, as briefly described below.</Paragraph> <Paragraph position="1"> The segmentation task is modeled as a problem of finding the minimum cost C(S) of a segmentation S. The segmentation cost is defined as: C(S) [?] [?]logPr(W|S)Pr(S), 1Occasionally within this document we employ the term utterance to denote either a sentence or an utterance in its proper sense.</Paragraph> <Paragraph position="2"> where W = w1w2...wn represents the text consisting of n words (after applying stop-words removal and stemming) and S = S1S2...Sm is a potential segmentation of W in m segments. The probability Pr(W|S) is defined using Laplace law, while the definition of the probability Pr(S) is chosen in a manner inspired by information theory. null A directed graph G is defined such that a path in G corresponds to a possible segmentation of W. Therefore, the thematic segmentation proposed by the system is obtained by applying a dynamic programming algorithm for determining the minimum cost path in G.</Paragraph> </Section> </Section> <Section position="7" start_page="145" end_page="145" type="metho"> <SectionTitle> 3 Input Data </SectionTitle> <Paragraph position="0"> When evaluating a thematic segmentation system for an application, human annotators should provide the gold standard. The problem is that the procedure of building such a reference corpus is expensive. That is, the typical setting involves an experiment with several human subjects, who are asked to mark thematic segment boundaries based on specific guidelines and their intuition. The inter-annotator agreement provides the reference segmentation. This expense can be avoided by constructing a synthetic reference corpus by concatenation of segments from different documents.</Paragraph> <Paragraph position="1"> Therefore, the use of artificial data for evaluation is a general trend in many studies, e.g. (Ferret, 2002; Choi, 2000; Utiyama and Isahara, 2001).</Paragraph> <Paragraph position="2"> In our experiment, we used artificial and real data, i.e. the algorithms have been tested on the following data sets containing English texts.</Paragraph> <Section position="1" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 3.1 Artificially Generated Data </SectionTitle> <Paragraph position="0"> Choi (2000) designed an artificial dataset, built by concatenating short pieces of texts that have been extracted from the Brown corpus. Any test sample from this dataset consists of ten segments. Each segment contains the first n sentences (where 3 [?] n [?] 11) of a randomly selected document from the Brown corpus. From this dataset, we randomly chose for our evaluation 100 test samples, where the length of a segment varied between 3 and 11 sentences.</Paragraph> </Section> <Section position="2" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 3.2 TDT Data </SectionTitle> <Paragraph position="0"> One of the commonly used data sets for topic segmentation emerged from the Topic Detection and Tracking (TDT) project, which includes the task of story segmentation, i.e. the task of segmenting a stream of news data into topically cohesive stories. As part of the TDT initiative several datasets of news stories have been created. In our evaluation, we used a subset of 28 documents randomly selected from the TDT Phase 2 (TDT2) collection, where a document contains an average of 24.67 segments.</Paragraph> </Section> <Section position="3" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 3.3 Meeting Transcripts </SectionTitle> <Paragraph position="0"> The third dataset used in our evaluation contains 25 meeting transcripts from the ICSI-MR corpus (Janin et al., 2004). The entire corpus contains high-quality close talking microphone recordings of multi-party dialogues. Transcriptions at word level with utterance-level segmentations are also available. The gold standard for thematic segmentations has been kindly provided by (Galley et al., 2003) and has been chosen by considering the agreement between at least three human annotations. Each meeting is thus divided into contiguous major topic segments and contains an average of 7.32 segments.</Paragraph> <Paragraph position="1"> Note that thematic segmentation of meeting data is a more challenging task as the thematic transitions are subtler than those in TDT data.</Paragraph> </Section> </Section> <Section position="8" start_page="145" end_page="147" type="metho"> <SectionTitle> 4 Evaluation Metrics </SectionTitle> <Paragraph position="0"> In this section, we will look in detail at the error metrics that have been proposed in previous studies and examine their inadequacies. In addition, we propose a new evaluation metric that we consider more appropriate.</Paragraph> <Section position="1" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 4.1 Pk Metric </SectionTitle> <Paragraph position="0"> (Passonneau and Litman, 1996; Beeferman et al., 1999) underlined that the standard evaluation metrics of precision and recall are inadequate for thematic segmentation, namely by the fact that these metrics did not account for how far away is a hypothesized boundary (i.e. a boundary found by the automatic procedure) from a reference boundary (i.e. a boundary found in the reference data).</Paragraph> <Paragraph position="1"> On the other hand, it is desirable that an algorithm that places for instance a boundary just one utterance away from the reference boundary to be penalized less than an algorithm that places a boundary two (or more) utterances away from the reference boundary. Hence (Beeferman et al., 1999) proposed a new metric, called PD, that allows for a slight vagueness in where boundaries lie. More specifically, (Beeferman et al., 1999) define PD as follows2:</Paragraph> <Paragraph position="3"> dhyp(i,j)].</Paragraph> <Paragraph position="4"> N is the number of words in the reference data. The function dref(i,j) is evaluated to one if the two reference corpus indices specified by its parameters i and j belong in the same segment, and zero otherwise. Similarly, the function dhyp(i,j) is evaluated to one, if the two indices are hypothesized by the automatic procedure to belong in the same segment, and zero otherwise. The [?] operator is the XNOR function 'both or neither'. D(i,j) is a &quot;distance probability distribution over the set of possible distances between sentences chosen randomly from the corpus&quot;. In practice, a distribution D having &quot;all its probability mass at a fixed distance k&quot; (Beeferman et al., 1999) was adopted and the metric PD was thus renamed Pk.</Paragraph> <Paragraph position="5"> In the framework of the TDT initiative, (Allan et al., 1998) give the following formal definition of Pk and its components:</Paragraph> <Paragraph position="7"> and Pseg is the a priori probability that in the reference data a boundary occurs within an interval of k words. Therefore Pk is calculated by moving a window of a certain width k, where k is usually set to half of the average number of words per segment in the gold standard.</Paragraph> <Paragraph position="8"> Pevzner and Hearst (2002) highlighted several problems of the Pk metric. We illustrate below what we consider the main problems of the Pk metric, based on two examples.</Paragraph> <Paragraph position="9"> Let r(i,k) be the number of boundaries between positions i and i + k in the gold standard segmentation and h(i,k) be the number of boundaries between positions i and i+k in the automatic hypothesized segmentation.</Paragraph> <Paragraph position="10"> * Example 1: If r(i,k) = 2 and h(i,k) = 1 then obviously a missing boundary should 2Let ref be a correct segmentation and hyp be a segmentation proposed by a text segmentation system. We will keep this notations in equations introduced below.</Paragraph> <Paragraph position="11"> be counted in Pk, i.e. PMiss should be increased. null * Example 2: If r(i,k) = 1 and h(i,k) = 2 then obviously PFalseAlarm should be increased. null However, considering the first example, we will obtain dref(i,i + k) = 0, dhyp(i,i + k) = 0 and consequently PMiss is not increased. By taking the case from the second example we obtain dref(i,i + k) = 0 and dhyp(i,i + k) = 0, involving no increase of PFalseAlarm.</Paragraph> <Paragraph position="12"> In (TDT, 1998), a slightly different definition is given for the Pk metric: the definition of miss and false alarm probabilities is replaced with:</Paragraph> <Paragraph position="14"> 0, otherwise.</Paragraph> <Paragraph position="15"> We will refer to this new definition of Pk by Pprimek. Therefore, by taking the definition of Pprimek and the first example above, we obtain dref(i,i+k) = 0 and Ohmhyp(i,i+k) = 0 and thus PprimeMiss is correctly increased. However for the case of example 2 we will obtain dref(i,i + k) = 0 and Ohmhyp(i,i + k) = 0, involving no increase of PprimeFalseAlarm and erroneous increase of PprimeMiss.</Paragraph> </Section> <Section position="2" start_page="146" end_page="147" type="sub_section"> <SectionTitle> 4.2 WindowDiff metric </SectionTitle> <Paragraph position="0"> Pevzner and Hearst (2002) propose the alternative metric called WindowDiff. By keeping our notations concerning r(i,k) and h(i,k) introduced in the subsection 4.1, WindowDiff is defined as:</Paragraph> <Paragraph position="2"> Similar to both Pk and Pprimek, WindowDiff is also computed by moving a window of fixed size across the test set and penalizing the algorithm misses or erroneous algorithm boundary detections. However, unlike Pk and Pprimek, WindowDiff takes into account how many boundaries fall within the window and is penalizing in &quot;how many discrepancies occur between the reference and the system results&quot; rather than &quot;determining how often two units of text are incorrectly labeled as being in different segments&quot; (Pevzner and Hearst, 2002).</Paragraph> <Paragraph position="3"> Our critique concerning WindowDiff is that misses are less penalised than false alarms and we argue this as follows. WindowDiff can be rewritten as:</Paragraph> <Paragraph position="5"> Hence both misses and false alarms are weighted by 1N[?]k.</Paragraph> <Paragraph position="6"> Note that, on the one hand, there are indeed (Nk) equiprobable possibilities to have a false alarm in an interval of k units. On the other hand, however, the total number of equiprobable possibilities to have a miss in an interval of k units is smaller than (N-k) since it depends on the number of reference boundaries (i.e. we can have a miss in the interval of k units only if in that interval the reference corpus contains at least one boundary). Therefore misses, being weighted by 1N[?]k, are less penalised than false alarms.</Paragraph> <Paragraph position="7"> Let Bref be the number of thematic boundaries in the reference data. Let's say that the reference data contains about 20% boundaries and 80% non-boundaries from the total number of potential boundaries. Therefore, since there are relatively few boundaries compared with non-boundaries, a strategy introducing no false alarms, but introducing a maximum number of misses (i.e. k * Bref misses) can be judged as being around 80% correct by the WindowDiff measure. On the other hand, a segmentation with no misses, but with a maximum number of false alarms (i.e. (N [?] k) false alarms) is judged as being 100% erroneous by the WindowDiff measure. That is, misses and false alarms are not equally penalised.</Paragraph> <Paragraph position="8"> Another issue regarding WindowDiff is that it is not clear &quot;how does one interpret the values produced by the metric&quot; (Pevzner and Hearst, 2002).</Paragraph> </Section> <Section position="3" start_page="147" end_page="147" type="sub_section"> <SectionTitle> 4.3 Proposal for a New Metric </SectionTitle> <Paragraph position="0"> In order to address the inadequacies of Pk and WindowDiff, we propose a new evaluation metric, defined as follows:</Paragraph> <Paragraph position="2"> Prmiss could be interpreted as the probability that the hypothesized segmentation contains less boundaries than the reference segmentation in an interval of k units3, conditioned by the fact that the reference segmentation contains at least one boundary in that interval. Analogously Prfa is the probability that the hypothesized segmentation contains more boundaries than the reference segmentation in an interval of k units.</Paragraph> <Paragraph position="3"> For certain applications where misses are more important than false alarms or vice versa, the Prerror can be adjusted to tackle this trade-off via the Cfa and Cmiss parameters. In order to have Prerror [?] [0,1], we suggest that Cfa and Cmiss be chosen such that Cfa + Cmiss = 1. By choosing Cfa=Cmiss=12, the penalization of misses and false alarms is thus balanced. In consequence, a strategy that places no boundaries at all is penalized as much as a strategy proposing boundaries everywhere (i.e. after every unit). In other words, both such degenerate algorithms will have an error rate Prerror of about 50%. The worst algorithm, penalised as having an error rate Prerror of 100% when k = 2, is the algorithm that places boundaries everywhere except the places where reference boundaries exist.</Paragraph> </Section> </Section> class="xml-element"></Paper>