File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1004_metho.xml
Size: 18,107 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1004"> <Title>Minimum Cut Model for Spoken Lecture Segmentation</Title> <Section position="5" start_page="25" end_page="26" type="metho"> <SectionTitle> 3 Minimum Cut Framework </SectionTitle> <Paragraph position="0"> Linguistic research has shown that word repetition in a particular section of a text is a device for creating thematic cohesion (Halliday and Hasan, 1976), and that changes in the lexical distributions usually signal topic transitions.</Paragraph> <Paragraph position="1"> lecture, with vertical lines indicating true segment boundaries.</Paragraph> <Paragraph position="2"> Figure 1 illustrates these properties in a lecture transcript from an undergraduate Physics class. We use the text Dotplotting representation by (Church, 1993) and plot the cosine similarity scores between every pair of sentences in the text. The intensity of a point (i,j) on the plot indicates the degree to which the i-th sentence in the text is similar to the j-th sentence. The true segment boundaries are denoted by vertical lines. This similarity plot reveals a block structure where true boundaries delimit blocks of text with high inter-sentential similarity. Sentences found in different blocks, on the other hand, tend to exhibit low similarity.</Paragraph> <Paragraph position="4"> Formalizing the Objective Whereas previous unsupervised approaches to segmentation rested on intuitive notions of similarity density, we formalize the objective of text segmentation through cuts on graphs. We aim to jointly maximize the intra-segmental similarity and minimize the similarity between different segments. In other words, we want to find the segmentation with a maximally homogeneous set of segments that are also maxi- null mally different from each other.</Paragraph> <Paragraph position="5"> Let G = {V,E} be an undirected, weighted graph, where V is the set of nodes corresponding to sentences in the text and E is the set of weighted edges (See Figure 2). The edge weights, w(u,v), define a measure of similarity between pairs of nodes u and v, where higher scores indicate higher similarity. Section 4 provides more details on graph construction.</Paragraph> <Paragraph position="6"> We consider the problem of partitioning the graph into two disjoint sets of nodes A and B. We aim to minimize the cut, which is defined to be the sum of the crossing edges between the two sets of nodes. In other words, we want to split the sentences into two maximally dissimilar classes by choosing A and B to minimize:</Paragraph> <Paragraph position="8"> However, we need to ensure that the two partitions are not only maximally different from each other, but also that they are themselves homogeneous by accounting for intra-partition node similarity. We formulate this requirement in the framework of normalized cuts (Shi and Malik, 2000), where the cut value is normalized by the volume of the corresponding partitions. The volume of the partition is the sum of its edges to the whole graph:</Paragraph> <Paragraph position="10"> The normalized cut criterion (Ncut) is then defined as follows:</Paragraph> <Paragraph position="12"> By minimizing this objective we simultaneously minimize the similarity across partitions and maximize the similarity within partitions. This formulation also allows us to decompose the objective into a sum of individual terms, and formulate a dynamic programming solution to the multiway cut problem.</Paragraph> <Paragraph position="13"> This criterion is naturally extended to a k-way normalized cut:</Paragraph> <Paragraph position="15"> where A1 ...Ak form a partition of the graph, and V [?]Ak is the set difference between the entire graph and partition k.</Paragraph> <Paragraph position="16"> Decoding Papadimitriou proved that the problem of minimizing normalized cuts on graphs is NP-complete (Shi and Malik, 2000). However, in our case, the multi-way cut is constrained to preserve the linearity of the segmentation. By segmentation linearity, we mean that all of the nodes between the leftmost and the rightmost nodes of a particular partition have to belong to that partition. With this constraint, we formulate a dynamic programming algorithm for exactly finding the minimum normalized multiway cut in polynomial time:</Paragraph> <Paragraph position="18"> C [i,k] is the normalized cut value of the optimal segmentation of the first k sentences into i segments. The i-th segment, Aj,k, begins at node uj and ends at node uk. B [i,k] is the back-pointer table from which we recover the optimal sequence of segment boundaries. Equations 3 and 4 capture respectively the condition that the normalized cut value of the trivial segmentation of an empty text into one segment is zero and the constraint that the first segment starts with the first node.</Paragraph> <Paragraph position="19"> The time complexity of the dynamic programming algorithm is O(KN2), where K is the number of partitions and N is the number of nodes in the graph or sentences in the transcript.</Paragraph> </Section> <Section position="6" start_page="26" end_page="27" type="metho"> <SectionTitle> 4 Building the Graph </SectionTitle> <Paragraph position="0"> Clearly, the performance of our model depends on the underlying representation, the definition of the pairwise similarity function, and various other model parameters. In this section we provide further details on the graph construction process.</Paragraph> <Paragraph position="1"> Preprocessing Before building the graph, we apply standard text preprocessing techniques to the text. We stem words with the Porter stemmer (Porter, 1980) to alleviate the sparsity of word counts through stem equivalence classes. We also remove words matching a prespecified list of stop words.</Paragraph> <Paragraph position="2"> Graph Topology As we noted in the previous section, the normalized cut criterion considers long-term similarity relationships between nodes. This effect is achieved by constructing a fully-connected graph. However, considering all pair-wise relations in a long text may be detrimental to segmentation accuracy. Therefore, we discard edges between sentences exceeding a certain threshold distance. This reduction in the graph size also provides us with computational savings.</Paragraph> <Paragraph position="3"> Similarity Computation In computing pair-wise sentence similarities, sentences are represented as vectors of word counts. Cosine similarity is commonly used in text segmentation (Hearst, 1994). To avoid numerical precision issues when summing a series of very small scores, we compute exponentiated cosine similarity scores between pairs of sentence vectors:</Paragraph> <Paragraph position="5"> We further refine our analysis by smoothing the similarity metric. When comparing two sentences, we also take into account similarity between their immediate neighborhoods. The smoothing is achieved by adding counts of words that occur in adjoining sentences to the current sentence feature vector. These counts are weighted in accordance to their distance from the current sentence:</Paragraph> <Paragraph position="7"> where si are vectors of word counts, and a is a parameter that controls the degree of smoothing.</Paragraph> <Paragraph position="8"> In the formulation above we use sentences as our nodes. However, we can also represent graph nodes with non-overlapping blocks of words of fixed length. This is desirable, since the lecture transcripts lack sentence boundary markers, and short utterances can skew the cosine similarity scores. The optimal length of the block is tuned on a heldout development set.</Paragraph> <Paragraph position="9"> Lexical Weighting Previous research has shown that weighting schemes play an important role in segmentation performance (Ji and Zha, 2003; Choi et al., 2001). Of particular concern are words that may not be common in general English discourse but that occur throughout the text for a particular lecture or subject. For example, in a lecture about support vector machines, the occurrence of the term &quot;SVM&quot; is not going to convey a lot of information about the distribution of sub-topics, even though it is a fairly rare term in general English and bears much semantic content. The same words can convey varying degrees of information across different lectures, and term weighting specific to individual lectures becomes important in the similarity computation.</Paragraph> <Paragraph position="10"> In order to address this issue, we introduce a variation on the tf-idf scoring scheme used in the information-retrieval literature (Salton and Buckley, 1988). A transcript is split uniformly into N chunks; each chunk serves as the equivalent of documents in the tf-idf computation. The weights are computed separately for each transcript, since topic and word distributions vary across lectures.</Paragraph> </Section> <Section position="7" start_page="27" end_page="29" type="metho"> <SectionTitle> 5 Evaluation Set-Up </SectionTitle> <Paragraph position="0"> In this section we present the different corpora used to evaluate our model and provide a brief overview of the evaluation metrics. Next, we describe our human segmentation study on the corpus of spoken lecture data.</Paragraph> <Section position="1" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 5.1 Parameter Estimation </SectionTitle> <Paragraph position="0"> A heldout development set of three lectures isused for estimating the optimal word block length for representing nodes, the threshold distances for discarding node edges, the number of uniform chunks for estimating tf-idf lexical weights, the alpha parameter for smoothing, and the length of the smoothing window. We use a simple greedy search procedure for optimizing the parameters.</Paragraph> </Section> <Section position="2" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 5.2 Corpora </SectionTitle> <Paragraph position="0"> We evaluate our segmentation algorithm on three sets of data. Two of the datasets we use are new segmentation collections that we have compiled for this study,1 and the remaining set includes a standard collection previously used for evaluation of segmentation algorithms. Various corpus statistics for the new datasets are presented in Table 1. Below we briefly describe each corpus.</Paragraph> <Paragraph position="1"> Physics Lectures Our first corpus consists of spoken lecture transcripts from an undergraduate Physics class. In contrast to other segmentation datasets, our corpus contains much longer texts.</Paragraph> <Paragraph position="2"> A typical lecture of 90 minutes has 500 to 700 sentences with 8500 words, which corresponds to about 15 pages of raw text. We have access both to manual transcriptions of these lectures and also output from an automatic speech recognition system. The word error rate for the latter is 19.4%,2 which is representative of state-of-the-art performance on lecture material (Leeuwis et al., 2003). The Physics lecture transcript segmentations were produced by the teaching staff of the introductory Physics course at the Massachusetts Institute of Technology. Their objective was to facilitate access to lecture recordings available on the class website. This segmentation conveys the high-level topical structure of the lectures. On average, a lecture was annotated with six segments, and a typical segment corresponds to two pages of a transcript.</Paragraph> <Paragraph position="3"> Artificial Intelligence Lectures Our second lecture corpus differs in subject matter, lecturing style, and segmentation granularity. The graduate Artificial Intelligence class has, on average, twelve segments per lecture, and a typical segment is about half of a page. One segment roughly corresponds to the content of a slide. This time the segmentation was obtained from the lecturer herself. The lecturer went through the transcripts of lecture recordings and segmented the lectures with the objective of making the segments correspond to presentation slides for the lectures.</Paragraph> <Paragraph position="4"> Due to the low recording quality, we were unable to obtain the ASR transcripts for this class.</Paragraph> <Paragraph position="5"> Therefore, we only use manual transcriptions of these lectures.</Paragraph> <Paragraph position="6"> Synthetic Corpus Also as part of our analysis, we used the synthetic corpus created by Choi (2000) which is commonly used in the evaluation of segmentation algorithms. This corpus consists of a set of concatenated segments randomly sampled from the Brown corpus. The length of the segments in this corpus ranges from three to eleven sentences. It is important to note that the lexical transitions in these concatenated texts are very sharp, since the segments come from texts written in widely varying language styles on completely different topics.</Paragraph> <Paragraph position="7"> 2A speaker-dependent model of the lecturer was trained on 38 hours of lectures from other courses using the SUM-MIT segment-based Speech Recognizer (Glass, 2003).</Paragraph> </Section> <Section position="3" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 5.3 Evaluation Metric </SectionTitle> <Paragraph position="0"> We use the Pk and WindowDiff measures to evaluate our system (Beeferman et al., 1999; Pevzner and Hearst, 2002). The Pk measure estimates the probability that a randomly chosen pair of words within a window of length k words is inconsistently classified. The WindowDiff metric is a variant of the Pk measure, which penalizes false positives on an equal basis with near misses.</Paragraph> <Paragraph position="1"> Both of these metrics are defined with respect to the average segment length of texts and exhibit high variability on real data. We follow Choi (2000) and compute the mean segment length used in determining the parameter k on each reference text separately.</Paragraph> <Paragraph position="2"> We also plot the Receiver Operating Characteristic (ROC) curve to gauge performance at a finer level of discrimination (Swets, 1988). The ROC plot is the plot of the true positive rate against the false positive rate for various settings of a decision criterion. In our case, the true positive rate is the fraction of boundaries correctly classified, and the false positive rate is the fraction of non-boundary positions incorrectly classified as boundaries. In computing the true and false positive rates, we vary the threshold distance to the true boundary within which a hypothesized boundary is considered correct. Larger areas under the ROC curve of a classifier indicate better discriminative performance. null</Paragraph> </Section> <Section position="4" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 5.4 Human Segmentation Study </SectionTitle> <Paragraph position="0"> Spoken lectures are very different in style from other corpora used in human segmentation studies (Hearst, 1994; Galley et al., 2003). We are interested in analyzing human performance on a corpus of lecture transcripts with much longer texts and a less clear-cut concept of a sub-topic. We define a segment to be a sub-topic that signals a prominent shift in subject matter. Disregarding this sub-topic change would impair the high-level understanding of the structure and the content of the lecture.</Paragraph> <Paragraph position="1"> As part of our human segmentation analysis, we asked three annotators to segment the Physics lecture corpus. These annotators had taken the class in the past and were familiar with the subject matter under consideration. We wrote a detailed instruction manual for the task, with annotation guidelines for the most part following the model used by Gruenstein et al. (2005). The annotators were instructed to segment at a level of granularity ent pairs of annotators.</Paragraph> <Paragraph position="2"> that would identify most of the prominent topical transitions necessary for a summary of the lecture. The annotators used the NOMOS annotation software toolkit, developed for meeting segmentation (Gruenstein et al., 2005). They were provided with recorded audio of the lectures and the corresponding text transcriptions. We intentionally did not provide the subjects with the target number of boundaries, since we wanted to see if the annotators would converge on a common segmentation granularity.</Paragraph> <Paragraph position="3"> Table 2 presents the annotator segmentation statistics. We see two classes of segmentation granularities. The original reference (O) and annotator A segmented at a coarse level with an average of 6.6 and 8.9 segments per lecture, respectively.</Paragraph> <Paragraph position="4"> Annotators B and C operated at much finer levels of discrimination with 18.4 and 13.8 segments per lecture on average. We conclude that multiple levels of granularity are acceptable in spoken lecture segmentation. This is expected given the length of the lectures and varying human judgments in selecting relevant topical content.</Paragraph> <Paragraph position="5"> Following previous studies, we quantify the level of annotator agreement with the Pk measure (Gruenstein et al., 2005).3 Table 3 shows the annotator agreement scores between different pairs of annotators. Pk measures ranged from 0.24 and 0.42. We observe greater consistency at similar levels of granularity, and less so across the two certain threshold distance are removed.</Paragraph> <Paragraph position="6"> classes. Note that annotator A operated at a level of granularity consistent with the original reference segmentation. Hence, the 0.24 Pk measure score serves as the benchmark with which we can compare the results attained by segmentation algorithms on the Physics lecture data.</Paragraph> <Paragraph position="7"> As an additional point of reference we note that the uniform and random baseline segmentations attain 0.469 and 0.493 Pk measure, respectively, on the Physics lecture set.</Paragraph> </Section> </Section> class="xml-element"></Paper>