File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1003_metho.xml
Size: 19,953 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1003"> <Title>Unsupervised Topic Modelling for Multi-Party Spoken Discourse</Title> <Section position="5" start_page="17" end_page="18" type="metho"> <SectionTitle> 3 Learning topics and segments </SectionTitle> <Paragraph position="0"> We specify our model to address the problem of topic segmentation: attempting to break the discourse into discrete segments in which a particular set of topics are discussed. Assume we have a corpus of U utterances, ordered in sequence. The uth utterance consists of Nu words, chosen from a vocabulary of size W. The set of words associated with the uth utterance are denoted wu, and indexed as wu,i. The entire corpus is represented by w.</Paragraph> <Paragraph position="1"> Following previous work on probabilistic topic models (Hofmann, 1999; Blei et al., 2003; GriffithsandSteyvers, 2004), wemodeleachutterance as being generated from a particular distribution over topics, where each topic is a probability distribution over words. The utterances are ordered sequentially,andweassumeaMarkovstructureon the distribution over topics: with high probability, the distribution for utterance u is the same as for utteranceu[?]1; otherwise, we sample a new distribution over topics. This pattern of dependency is produced by associating a binary switching variable with each utterance, indicating whether its topic is the same as that of the previous utterance.</Paragraph> <Paragraph position="2"> The joint states of all the switching variables define segments that should be semantically coherent, because their words are generated by the same topic vector. We will first describe this generative model in more detail, and then discuss inference in this model.</Paragraph> <Section position="1" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 3.1 A hierarchical Bayesian model </SectionTitle> <Paragraph position="0"> We are interested in where changes occur in the set of topics discussed in these utterances. To this end, let cu indicate whether a change in the distribution over topics occurs at the uth utterance and let P(cu = 1) = pi (where pi thus defines the expected number of segments). The distribution over topics associated with the uth utterance will be denoted th(u), and is a multinomial distribution over T topics, with the probability of topic t being th(u)t .</Paragraph> <Paragraph position="1"> If cu = 0, then th(u) = th(u[?]1). Otherwise, th(u) is drawn from a symmetric Dirichlet distribution with parameter a. The distribution is thus:</Paragraph> <Paragraph position="3"> model and (b) the hidden Markov model used as a comparison.</Paragraph> <Paragraph position="4"> where d(*,*) is the Dirac delta function, and G(*) is the generalized factorial function. This distribution is not well-defined when u = 1, so we set c1 = 1 and draw th(1) from a symmetric Dirichlet(a) distribution accordingly.</Paragraph> <Paragraph position="5"> As in (Hofmann, 1999; Blei et al., 2003; Griffiths and Steyvers, 2004), each topic Tj is a multinomial distribution ph(j) over words, and the probability of the word w under that topic is ph(j)w . The uth utterance is generated by sampling a topic as-</Paragraph> <Paragraph position="7"> from a symmetric Beta(g) distribution, and each ph(j) is generated from a symmetric Dirichlet(b) distribution, we obtain a joint distribution over all of these variables with the dependency structure shown in Figure 1A.</Paragraph> </Section> <Section position="2" start_page="18" end_page="18" type="sub_section"> <SectionTitle> 3.2 Inference </SectionTitle> <Paragraph position="0"> Assessing the posterior probability distribution over topic changes c given a corpus w can be simplified by integrating out the parameters th,ph, and pi. According to Bayes rule we have:</Paragraph> <Paragraph position="2"> Evaluating P(c) requires integrating over pi.</Paragraph> <Paragraph position="3"> Specifically, we have:</Paragraph> <Paragraph position="5"> where [?]TW is the T-dimensional cross-product of the multinomial simplex on W points, n(t)w is the number of times word w is assigned to topic t in z, and n(t)* is the total number of words assigned to topic t in z. To evaluate P(z|c) we have:</Paragraph> <Paragraph position="7"> The fact that the cu variables effectively divide the sequence of utterances into segments that use thesamedistributionovertopicssimplifiessolving the integral and we obtain:</Paragraph> <Paragraph position="9"> denotes the set of utterances that share the same topicdistribution(i.e. belongtothesamesegment) as u, and n(Su)t is the number of times topic t appears in the segment Su (i.e. in the values of zuprime corresponding for uprime [?] Su).</Paragraph> <Paragraph position="10"> Equations 2, 3, and 5 allow us to evaluate the numerator of the expression in Equation 1. However, computing the denominator is intractable.</Paragraph> <Paragraph position="11"> Consequently, we sample from the posterior distribution P(z,c|w) using Markov chain Monte Carlo (MCMC) (Gilks et al., 1996). We use Gibbs sampling, drawing the topic assignment for each word, zu,i, conditioned on all other topic assignments, z[?](u,i), all topic change indicators, c, and all words, w; and then drawing the topic change indicator for each utterance, cu, conditioned on all other topic change indicators, c[?]u, all topic assignments z, and all words w.</Paragraph> <Paragraph position="12"> The conditional probabilities we need can be derived directly from Equations 2, 3, and 5. The conditional probability of zu,i indicates the probability that wu,i should be assigned to a particular topic, given other assignments, the current segmentation, and the words in the utterances. Cancelling constant terms, we obtain:</Paragraph> <Paragraph position="14"> where all counts (i.e. the n terms) exclude zu,i.</Paragraph> <Paragraph position="15"> The conditional probability of cu indicates the probability that a new segment should start at u.</Paragraph> <Paragraph position="16"> In sampling cu from this distribution, we are splitting or merging segments. Similarly we obtain the expression in (7), where S1u is Su for the segmentation when cu = 1, S0u is Su for the segmentation when cu = 0, and all counts (e.g. n1) exclude cu.</Paragraph> <Paragraph position="17"> For this paper, we fixed a, b and g at 0.01.</Paragraph> <Paragraph position="18"> Our algorithm is related to (Barzilay and Lee, 2004)'s approach to text segmentation, which uses ahiddenMarkovmodel(HMM)tomodelsegmentation and topic inference for text using a bigram representation in restricted domains. Due to the adaptive combination of different topics our algorithm can be expected to generalize well to larger domains. It also relates to earlier work by (Blei and Moreno, 2001) that uses a topic representation but also does not allow adaptively combining different topics. However, while HMM approaches allow a segmentation of the data by topic, they do not allow adaptively combining different topics into segments: while a new segment can be modelled as being identical to a topic that has already been observed, it can not be modelled as a combination of the previously observed topics.1 Note that while (Imai et al., 1997)'s HMM approach allows topic mixtures, it requires supervision with hand-labelled topics.</Paragraph> <Paragraph position="19"> In our experiments we therefore compared our resultswiththoseobtainedbyasimilarbutsimpler 10 state HMM, using a similar Gibbs sampling algorithm. Thekeydifferencebetweenthetwomodels is shown in Figure 1. In the HMM, all variation in the content of utterances is modelled at a single level, witheachsegmenthavingadistributionover words corresponding to a single state. The hierarchical structure of our topic segmentation model allows variation in content to be expressed at two levels, with each segment being produced from a linear combination of the distributions associated with each topic. Consequently, our model can often capture the content of a sequence of words by postulating a single segment with a novel distribution over topics, while the HMM has to frequently switch between states.</Paragraph> </Section> </Section> <Section position="6" start_page="18" end_page="22" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="18" end_page="20" type="sub_section"> <SectionTitle> 4.1 Experiment 0: Simulated data </SectionTitle> <Paragraph position="0"> To analyze the properties of this algorithm we first applied it to a simulated dataset: a sequence of 10,000 words chosen from a vocabulary of 25.</Paragraph> <Paragraph position="1"> Each segment of 100 successive words had a con1Say that a particular corpus leads us to infer topics corresponding to &quot;speech recognition&quot; and &quot;discourse understanding&quot;. A single discussion concerning speech recognition for discourse understanding could be modelled by our algorithm as a single segment with a suitable weighted mixture of the two topics; a HMM approach would tend to split it into multiple segments (or require a specific topic for this segment).</Paragraph> <Paragraph position="2"> stant topic distribution (with distributions for different segments drawn from a Dirichlet distribution with b = 0.1), and each subsequence of 10 words was taken to be one utterance. The topic-word assignments were chosen such that when the vocabularyisalignedina5x5gridthetopicswere binary bars. The inference algorithm was then run for200,000iterations, withsamplescollectedafter every1,000iterationstominimizeautocorrelation.</Paragraph> <Paragraph position="3"> Figure 2 shows the inferred topic-word distributions and segment boundaries, which correspond well with those used to generate the data.</Paragraph> </Section> <Section position="2" start_page="20" end_page="22" type="sub_section"> <SectionTitle> 4.2 Experiment 1: The ICSI corpus </SectionTitle> <Paragraph position="0"> We applied the algorithm to the ICSI meeting corpus transcripts (Janin et al., 2003), consisting of manual transcriptions of 75 meetings. For evaluation, we use (Galley et al., 2003)'s set of human-annotated segmentations, which covers a sub-portion of 25 meetings and takes a relatively coarse-grained approach to topic with an average of 5-6 topic segments per meeting. Note that these segmentations were not used in training the model: topic inference and segmentation was unsupervised, with the human annotations used only to provide some knowledge of the overall segmentation density and to evaluate performance.</Paragraph> <Paragraph position="1"> The transcripts from all 75 meetings were linearized by utterance start time and merged into a single dataset that contained 607,263 word tokens.</Paragraph> <Paragraph position="2"> We sampled for 200,000 iterations of MCMC, taking samples every 1,000 iterations, and then averaged the sampled cu variables over the last 100 samples to derive an estimate for the posterior probability of a segmentation boundary at each utterance start. This probability was then thresholded to derive a final segmentation which was compared to the manual annotations. More precisely, we apply a small amount of smoothing (Gaussian kernel convolution) and take the midpoints of any areas above a set threshold to be the segment boundaries. Varying this threshold allows us to segment the discourse in a more or less fine-grained way (and we anticipate that this could be user-settable in a meeting browsing application).</Paragraph> <Paragraph position="3"> If the correct number of segments is known for a meeting, this can be used directly to determine the optimum threshold, increasing performance; if not, we must set it at a level which corresponds to the desired general level of granularity. For each set of annotations, we therefore performed two sets of segmentations: one in which the threshold was set for each meeting to give the known gold-standard number of segments, and one in which the threshold was set on a separate development settogivetheoverallcorpus-wideaveragenumber of segments, and held constant for all test meetings.2 This also allows us to compare our results with those of (Galley et al., 2003), who apply a similar threshold to their lexical cohesion function and give corresponding results produced with known/unknown numbers of segments.</Paragraph> <Paragraph position="4"> Segmentation We assessed segmentation performance using the Pk and WindowDiff (WD) error measures proposed by (Beeferman et al., 1999) and (Pevzner and Hearst, 2002) respectively; both intuitively provide a measure of the probability that two points drawn from the meeting will be incorrectly separated by a hypothesized segment boundary - thus, lower Pk and WD figures indicate better agreement with the human-annotated results.3 For the numbers of segments we are dealing with, a baseline of segmenting the discourse into equal-length segments gives both Pk and WD about 50%. In order to investigate the effect of the number of underlying topics T, we tested models using 2, 5, 10 and 20 topics. We then compared performance with (Galley et al., 2003)'s LCSeg tool, and with a 10-state HMM model as described above. Results are shown in Table 1, averaged over the 25 test meetings.</Paragraph> <Paragraph position="5"> Results show that our model significantly out-performs the HMM equivalent - because the HMM cannot combine different topics, it places a lot of segmentation boundaries, resulting in inferior performance. Using stemming and a bigram segment boundary, compared with human segmentation, for an arbitrary subset of the data; C) Receiveroperator characteristic (ROC) curves for predicting human segmentation, and conditional probabilities of placing a boundary at an offset from a human boundary; D) subjective topic coherence ratings. Number of topics T representation, however, might improve its performance (Barzilay and Lee, 2004), although similar benefits might equally apply to our model. It alsoperformscomparablyto(Galleyetal.,2003)'s unsupervised performance (exceeding it for some settings of T). It does not perform as well as their hybrid supervised system, which combined LCSeg with supervised learning over discourse features (Pk = .23); but we expect that a similar approachwouldbepossiblehere,combiningourseg- null mentationprobabilitieswithotherdiscourse-based features in a supervised way for improved performance. Interestingly, segmentation quality, at least at this relatively coarse-grained level, seems hardly affected by the overall number of topics T.</Paragraph> <Paragraph position="6"> Figure 3B shows an example for one meeting of how the inferred topic segmentation probabilities at each utterance compare with the gold-standard segment boundaries. Figure 3C illustrates the performance difference between our model and the HMM equivalent at an example segment boundary: for this example, the HMM model gives almost no discrimination.</Paragraph> <Paragraph position="7"> Identification Figure 3A shows the most indicative words for a subset of the topics inferred at the last iteration. Encouragingly, most topics seem intuitively to reflect the subjects we know were discussed in the ICSI meetings - the majority of them (67 meetings) are taken from the weekly meetings of 3 distinct research groups, where discussions centered around speech recognition techniques (topics 2, 5), meeting recording, annotation and hardware setup (topics 6, 3, 1, 8), robust language processing (topic 7). Others reflect general classes of words which are independent of subject matter (topic 4).</Paragraph> <Paragraph position="8"> To compare the quality of these inferred topics we performed an experiment in which 7 human observers rated (on a scale of 1 to 9) the semantic coherence of 50 lists of 10 words each. Of these lists, 40 contained the most indicative words for each of the 10 topics from different models: the topic segmentation model; a topic model that had the same number of segments but with fixed evenly spread segmentation boundaries; an equiv- null alent with randomly placed segmentation boundaries; and the HMM. The other 10 lists contained random samples of 10 words from the other 40 lists. Results are shown in Figure 3D, with the topic segmentation model producing the most coherent topics and the HMM model and random words scoring less well. Interestingly, using an even distribution of boundaries but allowing the topic model to infer topics performs similarly well with even segmentation, but badly with random segmentation - topic quality is thus not very susceptible to the precise segmentation of the text, but does require some reasonable approximation (on ICSI data, an even segmentation gives a Pk of about 50%, while random segmentations can do much worse). However, note that the full topic segmentation model is able to identify meaningful segmentation boundaries at the same time as inferring topics.</Paragraph> </Section> <Section position="3" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 4.3 Experiment 2: Dialogue robustness </SectionTitle> <Paragraph position="0"> Meetings often include off-topic dialogue, in particular at the beginning and end, where informal chat and meta-dialogue are common. Galley et al. (2003) annotated these sections explicitly, together with the ICSI &quot;digit-task&quot; sections (participants read sequences of digits to provide data for speech recognition experiments), and removed them from their data, as did we in Experiment 1 above. While this seems reasonable for the purposes of investigating ideal algorithm performance, in real situations we will be faced with such off-topic dialogue, and would obviously prefer segmentation performance not to be badly affected (and ideally, enabling segmentation of the off-topic sections from the meeting proper).</Paragraph> <Paragraph position="1"> One might suspect that an unsupervised generative model such as ours might not be robust in the presence of numerous off-topic words, as spurious topics might be inferred and used in the mixture model throughout. In order to investigate this, we therefore also tested on the full dataset without removing these sections (806,026 word tokens in total), and added the section boundaries as further desired gold-standard segmentation boundaries. Table 2 shows the results: performance is not significantly affected, and again is very similar for both our model and LCSeg.</Paragraph> </Section> <Section position="4" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 4.4 Experiment 3: Speech recognition </SectionTitle> <Paragraph position="0"> The experiments so far have all used manual word transcriptions. Of course, in real meeting pro- null ness to off-topic and ASR data.</Paragraph> <Paragraph position="1"> cessing systems, we will have to deal with speech recognition (ASR) errors. We therefore also tested on 1-best ASR output provided by ICSI, and results are shown in Table 2. The &quot;off-topic&quot; and &quot;digits&quot; sections were removed in this test, so results are comparable with Experiment 1. Segmentation accuracy seems extremely robust; interestingly, LCSeg's results are less robust (the drop in performance is higher), especially when the number of segments in a meeting is unknown.</Paragraph> <Paragraph position="2"> It is surprising to notice that the segmentation accuracy in this experiment was actually slightly higher than achieved in Experiment 1 (especially given that ASR word error rates were generally above 20%). This may simply be a smoothing effect: differences in vocabulary and its distribution can effectively change the prior towards sparsity instantiated in the Dirichlet distributions.</Paragraph> </Section> </Section> class="xml-element"></Paper>