File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1643_metho.xml
Size: 20,160 bytes
Last Modified: 2025-10-06 14:10:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1643"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance[?]</Title> <Section position="4" start_page="0" end_page="364" type="metho"> <SectionTitle> 2 Corpus </SectionTitle> <Paragraph position="0"> The work presented here was applied to the ICSI Meeting Corpus (Janin et al., 2003), a corpus of &quot;naturally-occurring&quot; meetings, i.e. meetings that would have taken place anyway. Their style is quite informal, and topics are primarily concerned with speech, natural language, artificial intelligence, and networking research. The corpus contains 75 meetings, which are 60 minutes long on average, and involve a number of participants ranging from 3 to 10 (6 on average). The total number of unique speakers is 60, including 26 non-native English speakers. Experiments in this paper are based either on human orthographic transcriptions or automatic speech recognition output, which were available for all meetings. Forautomaticrecognition, weusedtheICSI-SRI-UW speech recognition system (Mirghafori et al., 2004), a state-of-the-art conversational telephone speech (CTS) recognizer whose language and acoustic models were adapted to the meeting domain. It achieves 34.8% WER on the ICSI corpus, which is indicative of the difficulty involved in processing meetings automatically.</Paragraph> <Paragraph position="1"> We also used additional annotation that has been developed to support higher-level analyses of meeting structure, in particular the ICSI Meeting Recorder Dialog act (MRDA) corpus (Shriberg et al., 2004). Dialog act (DA) labels describe the pragmatic function of utterances, e.g. a STATE-MENT or a BACKCHANNEL. This auxiliary corpus consists of over 180,000 human-annotated dialog act labels (k = .8), for which so-called adjacency pair (AP) relations (e.g., APOLOGY-DOWNPLAY) were also labeled. This latter annotation was used to train an AP classifier that is instrumental in automatically determining the structure of our sequence models. Note that, in the case of three or more speakers, adjacency pair is admittedly an unfortunate term, since labeled APs are generally not adjacent (e.g., see Table 1), but we will nevertheless use the same terminology to enforce consistency with previous work.</Paragraph> <Paragraph position="2"> To train and evaluate our summarizer, we used a corpus of extractive summaries produced at the University of Edinburgh (Murray et al., 2005). For each of the 75 meetings, human judges were asked toselecttranscriptionutterancessegmentedbyDA to include in summaries, resulting in an average compression ratio of 6.26% (though no strict limit was imposed). Inter-labeler agreement was measured using six meetings that were summarized by multiple coders (average k = .323). While this level of agreement is quite low, this situation is not uncommon to summarization, since there may be many good summaries for a given document; a main challenge lies in using evaluation schemes that properly accounts for this diversity.</Paragraph> </Section> <Section position="5" start_page="364" end_page="365" type="metho"> <SectionTitle> 3 Content selection </SectionTitle> <Paragraph position="0"> State sequence Markov models such as hidden Markov models (Rabiner, 1989) have been highly successful in many speech and natural language processing applications, including summarization.</Paragraph> <Paragraph position="1"> Following an intuition that the probability of a given sentence may be locally conditioned on the previous one, Conroy (2004) built a HMM-based summarizer that consistently ranked among the top systems in recent Document Understanding Conference (DUC) evaluations.</Paragraph> <Paragraph position="2"> Inter-sentential influences become more complex in the case of dialogues or correspondences, especially when they involve multiple parties.</Paragraph> <Paragraph position="3"> In the case of summarization of conversational speech, Zechner (2002) found, for instance, that a simple technique consisting of linking together questions and answers in summaries--and thus preventing the selection of orphan questions or answers--significantly improved their readability according to various human summary evaluations.</Paragraph> <Paragraph position="4"> In email summarization (Rambow et al., 2004), ShresthaandMcKeown(2004)obtainedgoodperformance in automatic detection of questions and answers, which can help produce summaries that highlight or focus on the question and answer exchange. In a combined chat and email summarization task, a technique (Zhou and Hovy, 2005) consisting of identifying APs and appending any relevant responses to topic initiating messages was instrumental in outperforming two competitive summarization baselines.</Paragraph> <Paragraph position="5"> The need to model pragmatic influences, such asbetweenaquestionandananswer,isalsoprevalent in meeting summarization. In fact, question-answer pairs are not the only discourse relations that we need to preserve in order to create coherent summaries, and, as we will see, most instances of APs would need to be preserved together, either inside or outside the summary. Table 1 displays an AP construction with one statement (A part) and three respondents (B parts).</Paragraph> <Paragraph position="6"> This example illustrates that the number of turns between constituents of APs is variable and thus difficult to model with standard sequence models.</Paragraph> <Paragraph position="7"> This example also illustrates some of the predictors investigated in this paper. First, many speakers respond to A's utterance, which is generally a strong indicator that the A utterance should be included. Secondly, while APs are generally characterized in terms of pre-defined dialog acts, such italic are not present in the reference summary.</Paragraph> <Paragraph position="8"> as OFFER-ACCEPT, we found that the type of dialog act has much less importance than the existence of the AP connection itself (APs in the data represent a great variety of DA pairs, including many that are not characterized as APs in the litterature--e.g., STATEMENT-STATEMENT in the table). Since DAs seem to matter less than adjacency pairs, the aim will be to build techniques to automatically identify such relations and exploit them in utterance selection.</Paragraph> <Paragraph position="9"> In the current work, we use skip-chain sequence models (Sutton and McCallum, 2004) to represent dependencies between both contiguous utterances and paired utterances appearing in the same AP constructions. The graphical representations of skip-chain models, such as the CRF represented in Figure 1, are composed of two types of edges: linear-chain and skip-chain edges. The latter edges model AP links, which we represent as a set of (s,d) index pairs (note that no more than one AP may share the same second element d).</Paragraph> <Paragraph position="10"> The intuition that the summarization labels ([?]1 or 1) are highly correlated with APs is confirmed in Table 2. While contiguous labels yt[?]1 and yt seem to seldom influence each other, the correlation between AP elements ys and yd is particularly strong, and they have a tendency to be either both included or both excluded. Note that the second table is not symmetric, because the data allows an A part to be linked to multiple B parts, but not vice-versa. While counts in Table 2 reflect human labels, we only use automatically predicted (s,d) pairs in the experiments of the remaining part of this paper. To find these pairs automatically, wetrainedanon-sequentiallog-linearmodel that achieves a .902 accuracy (Galley et al., 2004).</Paragraph> </Section> <Section position="6" start_page="365" end_page="367" type="metho"> <SectionTitle> 4 Skip-Chain Sequence Models </SectionTitle> <Paragraph position="0"> In this paper, we investigate conditional models for paired sequences of observations and labels. In the case of utterance selection, the observation se-</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> adjacent labels yt[?]1 and yt is not significant (kh2 = 2.3, p > .05), empirical evidence clearly shows that ys and yd influence each other (kh2 = 78948, p < .001).</Paragraph> <Paragraph position="5"> summarization predictors (see Section 6), and the binary sequence y = y1:T = (y1,...,yT) (where yt [?] {[?]1,1}) determines which utterances must be included in the summary. In a discriminative framework, we concentrate our modeling effort on estimating p(y|x) from data, and do not explicitly model the prior probability p(x), since x is fixed during testing anyway.</Paragraph> <Paragraph position="6"> Many probabilistic approaches to modeling sequences have relied on directed graphical models, also known as Bayesian networks (BN),1 in particular hidden Markov models (Rabiner, 1989) and conditional Markov models (McCallum et al., 2000). However, prominent recent approaches have focused on undirected graphical models, in particular conditional random fields (CRF) (Lafferty et al., 2001), and provided state-of-the-art performance in many NLP tasks. In our work, we will provide empirical results for state sequence models of both semantics, and we will now de1Intheexistingliterature,sequencemodelsthatsatisfythe null Markovian condition--i.e., the state of the system at time t depend only on its immediate past t[?]k:t[?]1 (typically just t[?]1)--are generally termed dynamic Bayesian networks (DBN). Since the particular models under investigation, i.e. skip-chain models, do not have this property, we will simply refer to them as Bayesian networks.</Paragraph> <Paragraph position="7"> scribe skip-chain models for both BNs and CRFs.</Paragraph> <Paragraph position="8"> In a BN, the probability of the sequence y factorizesasaproductofprobabilitiesoflocalpredic- null tions yt conditioned on their parents pi(yt) (Equation1). InaCRF,theprobabilityofthesequencey factorizes according to a set of clique potentials {Phc}c[?]C, where C is represents the cliques of the underlying graphical model (Equation 2).</Paragraph> <Paragraph position="10"> We parameterize these BNs and CRFs as log-linear models, and factorize both BN's local prediction probabilities and CRF's clique potentials using two types of feature functions. Linear-chain feature functions fj(yt[?]k:t,x,t) represent local dependencies that are consistent with an order-k Markov assumption. For instance, one such function could be a predicate that is true if and only if</Paragraph> <Paragraph position="12"> both utterances are produced by the same speaker.</Paragraph> <Paragraph position="13"> Given a set of skip edges S = {(st,t)} specifying source and destination indices, skip-chain feature functions gj(yst,yt,x,st,t) exploit dependencies between variables that are arbitrarily distant in the chain. For instance, the finding that OFFER-REJECT pairs are often linked in summaries might be encoded as a skip-chain feature predicate that is true if and only if yst = 1, yt = 1, and the first word of the t-th utterance is &quot;no&quot;.</Paragraph> <Paragraph position="14"> Log-linear models for skip-chain sequence models are defined in terms of weights {lk} and {uk}, one for each feature function. In the case of BNs, we write:</Paragraph> <Paragraph position="16"> We can reduce a particular skip-chain CRF to represent only the set of cliques along (yt[?]1,yt) adjacency edges and (yst,yt) skip edges, resulting in only two potential functions:</Paragraph> <Paragraph position="18"/> <Section position="1" start_page="366" end_page="367" type="sub_section"> <SectionTitle> 4.1 Inference and Parameter Estimation </SectionTitle> <Paragraph position="0"> Our CRF and BN models were designed using MALLET (McCallum, 2002), which provides tools for training log-linear models with L-BFGS optimization techniques and maximize the log-likelihood of our training dataD = (x(i),y(i))Ni=1, andprovidesprobabilisticinferencealgorithmsfor linear-chain BNs and CRFs.</Paragraph> <Paragraph position="1"> Most previous work with CRFs containing non-local dependencies used approximate probabilistic inference techniques, including TRP (Sutton and McCallum, 2004) and Gibbs sampling (Finkel et al., 2005). Approximation is needed when the junction tree of a graphical model is associated with prohibitively large cliques. For example, the worse case reported in (Sutton and Mc-Callum, 2004) is a clique of 61 nodes. In the case of skip-chain models representing APs, the inference problem is somewhat simpler: loops in the graph are relatively short, 98% of AP edges span no more than 5 time slices, and the maximum clique size in the entire data is 5. While exact inference might be possible in our case, we used the simpler approach of adapting standard inference algorithms for linear-chain models.</Paragraph> <Paragraph position="2"> Specifically, to account for skip-edges, we used a technique inspired by (Sha and Pereira, 2003), in which multiple state dependencies, such as an order-2 Markov model, are encoded using auxiliary tags. For instance, an order-2 Markov model isparameterizedusingstatetriplesyt[?]2:t,andeach possible triple is converted to a label zt = yt[?]2:t. Using these auxiliary labels only, we can then use the standard forward-backward algorithm for computing marginal distributions in linear-chain CRFs, and Viterbi decoding in linear-chain CRFs and BNs. The only requirement is to ensure that a transition between zt and zt+1 is forbidden if the sub-states yt[?]1:t common to both states differ, i.e., is assigned an infinite cost. This approach can be extended to the case of skip-chain transitions.</Paragraph> <Paragraph position="3"> For instance, an order-1 Markov model with skipedgescanbeconstructedusingzt = (yst,yt[?]1,yt) triples, where the first element yst represents the label at the source of the skip-edge. Similarly to the case of order-2 Markov models, we need to ensure that only valid sequences of labels are considered, which is trivial to enforce if we assume that no skip edge ranges more than a predefined threshold of k time slices.</Paragraph> <Paragraph position="4"> Whilethisapproachisnotexact, itstillprovides competitive performance as we will see in Section 8. In future work, we plan to explore more accurate probabilistic inference techniques.</Paragraph> </Section> </Section> <Section position="7" start_page="367" end_page="367" type="metho"> <SectionTitle> 5 Ranking Utterances by Importance </SectionTitle> <Paragraph position="0"> As we will see in Section 8, using the actual {[?]1,1} label predictions of our BNs and CRFs leads to significantly sub-optimal results, which mightbeexplainedbythefollowingreasons. First, our models are optimized to maximize the conditional log-likelihood of the training data, a measure that does not correlate well with utility measures generally used in retrieval oriented tasks such as summarization, especially when faced with a significant class imbalance (only 6.26% of reference instances are positive). Second, the MAP decision rule doesn't give us the freedom to select an arbitrary number of sentences in order to satisfy any constraint on length. Instead of using actual predictions, it seems more reasonable to compute the posterior probability of each local prediction yt, and extract the N most probable summary sentences (yr1,...,yrk), where N may depend on a length expressed in number of words, as it is the case in our evaluation in Section 7.</Paragraph> <Paragraph position="1"> BNs assign probability distributions over entire sequencesbyestimatingtheprobabilityofeachindividual instance yt in the sequence (Equation 1), and seem thus particularly suited for ranking utterances. A first approach is then to rank utterances according to the cost of predicting yt = 1 at each time step on the Viterbi path. While these costs are well-formed (negative log) probabilities in the case of BNs, they cannot be interpreted as such in the case of CRFs, and turn out to produce poor results with CRFs. Indeed, the set of CRF potentials associated with each time step have no immediate probabilistic interpretation, and cannot be used directly to rank sentences. Since BNs and CRFs are here parameterized as log-linear models and rely on the same set of feature functions, a second approach is to use CRF-trained model parameters to build a BN classifier that assigns a probability to each yt. Specifically, the CRF model is first used to generate label predicitons ^y, from which the locally-normalized model estimates the cost of predicting ^yt = 1 given a label history ^y1:t[?]1.</Paragraph> <Paragraph position="2"> This ensures that we have a well-formed probability distribution at each time slice, while capitalizing on the good performance of CRF models.</Paragraph> <Paragraph position="3"> erwise mentioned, we refer to features of utterance t whose label yt we are trying to predict.</Paragraph> </Section> <Section position="8" start_page="367" end_page="368" type="metho"> <SectionTitle> 6 Features for extractive summarization </SectionTitle> <Paragraph position="0"> We started our analyses with a large collection of features found to be good predictors in either speech (Inoue et al., 2004; Maskey and Hirschberg, 2005; Murray et al., 2005) or text summarization (Mani and Maybury, 1999). Our goal is to build a very competitive feature set that capitalizesonrecentadvancesinsummarizationof both genres. Table 3 lists some important features.</Paragraph> <Paragraph position="1"> There is strong evidence that lexical cues such as &quot;significant&quot; and &quot;great&quot; are strong predictors in many summarization tasks (Edmundson, 1968).</Paragraph> <Paragraph position="2"> Such cues are admittedly quite genre specific, so we did not want to commit ourselves to any specific list, which may not carry over well to our specific speech domain, and we automatically selected a list of n-grams (n [?] 3) using cross-validation on the training data. More specifically, we computed the mutual information of each n- null gram with the class variable, and selected for each n the 200 best scoring n-grams. Other lexical features include: the number of digits, which is helpful for identifying sections of the meetings where participants collect data by recording digits; the number of repeats, which may indicate the kind of hesitations and disfluencies that negatively correlates with what is included in the summary.</Paragraph> <Paragraph position="3"> The information retrieval feature set contains many features that are generally found helpful in summarization, in particular tf*idf and scores derived from centroid methods. In particular, we used the latent semantic analysis (LSA) feature discussed in (Murray et al., 2005), which attempts to determine sentence importance through singular value decomposition, and whose resulting singular values and singular vectors can be exploited toassociateeachutteranceadegreeofrelevanceto oneofthetop-nconceptsofthemeetings(wheren represents the number of dimensions in the LSA).</Paragraph> <Paragraph position="4"> We used the same scoring mechanism as (Murray et al., 2005), though we extracted features for many different n values.</Paragraph> <Paragraph position="5"> Acoustic features extracted with Praat (Boersma and Weenink, 2006) were normalized by channel and speaker, including many raw features such as f0 and energy. Structural features listed in the table are those computed from the sequence model before decoding, e.g., the duration that separates the two elements of an AP. Finally, discourse features represent predictors that may substitute to DA labels. While DA tagging is not directly our concern, it is presumably helpful to capitalize on discourse characteristics of utterances involved in adjacency pairs, since different types of dialog acts may be unequally likely to appear in a summary.</Paragraph> </Section> class="xml-element"></Paper>