File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1032_metho.xml
Size: 14,354 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1032"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 249-256, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Bayesian Learning in Text Summarization</Title> <Section position="3" start_page="0" end_page="250" type="metho"> <SectionTitle> 2 Bayesian Model of Summaries </SectionTitle> <Paragraph position="0"> Since the business of extractive summarization, such as one we are concerned with here, is about ranking sentences according to how useful/important they are as part of summary, we will consider here a particular ranking scheme based on the probability of a sentence being part of summary under a given DOV, i.e.,</Paragraph> <Paragraph position="2"> where y denotes a given sentence, and v = (v1,...,vn) stands for a DOV, an array of observed vote counts for sentences in the text; v1 refers to the count of votes for a sentence at the text initial position, v2 to that for a sentence occurring at the second place, etc.</Paragraph> <Paragraph position="3"> Thus given a four sentence long text, if we have of the second, one for the third, and none for the fourth, then we would have v = (3,2,1,0).</Paragraph> <Paragraph position="4"> Now suppose that each sentence yi (i.e., a sentence at the i-th place in the order of appearance) is associated with what we might call a prior preference factor thi, representing how much a sentence at a particular position is favored as part of a summary in general. Then the probability that yi finds itself in a summary is given as: ph(yi|thi)P(thi), (2) where ph denotes some likelihood function, and P(thi) a prior probability of thi.</Paragraph> <Paragraph position="5"> Since the DOV is something we could actually observe about thi, we might as well couple thi with v by making a probability of thi conditioned on v. Formally, this would be written as: ph(yi|thi)P(thi|v). (3) The problem, however, is that we know nothing about what each thi looks like, except that it should somehow be informed by v. A typical Bayesian solution to this is to 'erase' thi by marginalizing (summing) over it, which brings us to this:</Paragraph> <Paragraph position="7"> Note that equation 4 no longer talks about the probability of yi under a particular thi; rather it talks about the expected probability for yi with respect to a preference factor dictated by v. All we need to know</Paragraph> <Paragraph position="9"> about P(thi|v) to compute the expectation is v and a probability distribution P, and not thi's, anymore.</Paragraph> <Paragraph position="10"> We know something about v, and this would leave us P. So what is it? In principle it could be any probability distribution. However largely for the sake of technical convenience, we assume it is one component of a multinomial distribution known as the Dirichlet distribution. In particular, we talk about Dirichlet(th|v), namely a Dirichlet posterior of th, given observations v, where th = (th1,...,thi,...,thn), and summationtextni thi = 1 (thi > 0). (Remarkably, if P(th) is a Dirichlet, so is P(th|v).) th here represents a vector of preference factors for n sentences -- which constitute the text.2 Accordingly, equation 4 could be rewritten as:</Paragraph> <Paragraph position="12"> An interesting way to look at the model is by way of a graphical model (GM), which gives some intuitive idea of what the model looks like. In a GM perspective, our model is represented as a simple tripartite structure (figure 2), in which each node corresponds to a variable (parameter), and arcs represent 2Since texts generally vary in length, we may set n to a sufficiently large number so that none of texts of interest may exceed it in length. For texts shorter than n, we simply add empty sentences to make them as long as n.</Paragraph> <Paragraph position="13"> dependencies among them. x - y reads 'y depends on x.' An arc linkage between v and yi is meant to represent marginalization over th.</Paragraph> <Paragraph position="14"> Moreover, we will make use of a scale parameter l [?] 1 to have some control over the shape of the distribution, so we will be working with Dirichlet(th|lv) rather than Dirichlet(th|v). Intuitively, we might take l as representing a degree of confidence we have in a set of empirical observations we call v, as increasing the value of l has the effect of reducing variance over each thi in th.</Paragraph> <Paragraph position="15"> The expectation and variance of Dirichlet(th|v) are given as follows.3</Paragraph> <Paragraph position="17"> See how l is stuck in the denominator. Another obvious fact about the scaling is that it does not affect the expectation, which remains the same.</Paragraph> <Paragraph position="18"> To get a feel for the significance of l, consider figure 3; the left panel shows a histogram of 50,000 variates of p1 randomly drawn from Dirichlet(p1,p2|lc1,lc2), with l = 1, and both c1 and c2 set to 1. The graph shows only the p1 part but things are no different for p2. (The x-dimension represents a particular value p1 takes (which ranges between 0 and 1) and the y-dimension records the number of the times p1 takes that value.) We see that points are spread rather evenly over the probability space. Now the right panel shows what happens if you increase l by a factor of 1,000 (which will give you P(p1,p2|1000,1000)); points take a bell shaped form, concentrating in a small region around the expectation of p1. In the experiments section, we will return to the issue of l and discuss how it affects performance of summarization.</Paragraph> <Paragraph position="19"> Let us turn to the question of how to find a solution to the integral in equation 5. We will be concerned here with two standard approaches to the issue: one is based on MAP (maximum a posteriori) and another on numerical integration. We start off with a MAP based approach known as Bayesian Information Criterion or BIC.</Paragraph> <Paragraph position="20"> For a given model m, BIC seeks an analytical approximation for equation 4, which looks like the following: null</Paragraph> <Paragraph position="22"> where k denotes the number of free parameters in m, and N that of observations. ^th is a MAP estimate of th under m, which is E[th]. It is interesting to note that BIC makes no reference to prior. Also worthy of note is that a minus of BIC equals MDL (Minimum Description Length).</Paragraph> <Paragraph position="23"> Alternatively, one might take a more straightforward (and fully Bayesian) approach known as the Monte Carlo integration method (MacKay, 1998) (MC, hereafter) where the integral is approximated by:</Paragraph> <Paragraph position="25"> where we draw each sample x(j) randomly from the distribution P(th|v), and n is the number of x(i)'s so collected. Note that MC gives an expectation of P(yi|v) with respect to P(th|v).</Paragraph> <Paragraph position="26"> Furthermore, ph could be any probabilistic function. Indeed any discriminative classifier (such as C4.5) will do as long as it generates some kind of probability. Given ph, what remains to do is essentially training it on samples bootstrapped (i.e., resampled) from the training data based onth -- which we draw from Dirichlet(th|v).4 To be more specific, suppose that we have a four sentence long text and an array of probabilities th = (0.4,0.3,0.2,0.1) drawn from a Dirichlet distribution: which is to say, we have a preference factor of 0.4 for the lead sentence, 0.3 for the second sentence, etc. Then we resample with replacement lead sentences from training data with the probability of 0.4, the second with the probability of 0.3, and so forth. Obviously, a 1000 (right panel).</Paragraph> <Paragraph position="27"> high preference factor causes the associated sentence to be chosen more often than those with a low preference.</Paragraph> <Paragraph position="28"> Thus given a text T = (a,b,c,d) with th = (0.4,0.3,0.2,0.1), we could end up with a data set dominated by a few sentence types, such as Tprime = (a,a,a,b), which we proceed to train a classifier on in place of T. Intuitively, this amounts to inducing the classifier to attend to or focus on a particular region or area of a text, and dismiss the rest. Note an interesting parallel to boosting (Freund and Schapire, 1996) and the alternating decision tree (Freund and Mason, 1999).</Paragraph> <Paragraph position="29"> In MC, for each th(k) drawn from Dirichlet(th|v), we resample sentences from the training data using probabilities specified by th(k), use them for training a classifier, and run it on a test document d to find, for each sentence in d, its probability of being a 'pick' (summary-worthy) sentence,i.e., P(yi|th(k)), which we average across th's. In experiments later described, we apply the procedure for 20,000 runs (meaning we run a classifier on each of 20,000 th's we draw), and average over them to find an estimate for P(yi|v).</Paragraph> <Paragraph position="30"> As for BIC, we generally operate along the lines of MC, except that we bootstrap sentences using only E[th], and the model complexity term, namely, [?]k2 lnN is dropped as it has no effect on ranking sentences. As with MC, we train a classifier on the bootstrapped samples and run it on a test document.</Paragraph> <Paragraph position="31"> Though we work with a set of fixed parameters, a bootstrapping based on them still fluctuates, producing a slightly different set of samples each time we run the operation. To get a reasonable convergence in experiments, we took the procedure to 5,000 iterations and averaged over the results.</Paragraph> <Paragraph position="32"> Either with BIC or with MC, building a summarizer on it is a fairly straightforward matter. Given a document d and a compression rate r, what a summarizer would do is simply rank sentences in d based on P(yi|v) and pick an r portion of highest ranking sentences.</Paragraph> </Section> <Section position="4" start_page="250" end_page="252" type="metho"> <SectionTitle> 3 Working with Bayesian Summarist </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="250" end_page="252" type="sub_section"> <SectionTitle> 3.1 C4.5 </SectionTitle> <Paragraph position="0"> In what follows, we will look at whether and how the Bayesian approach, when applied for the C4.5 decision tree learner (Quinlan, 1993), leverages its performance on real world data. This means our model</Paragraph> <Paragraph position="2"> with the likelihood function ph filled out by C4.5.</Paragraph> <Paragraph position="3"> Moreover, we compare two versions of the classifier; one with BIC/MC and one without. We used Weka implementations of the algorithm (with default settings) in experiments described below (Witten and Frank, 2000).</Paragraph> <Paragraph position="4"> While C4.5 here is configured to work in a binary (positive/negative) classification scheme, we run it in a 'distributional' mode, and use a particular class membership probability it produces, namely, the probability of a sentence being positive, i.e., a pick (summary-worthy) sentence, instead of a category label.</Paragraph> <Paragraph position="5"> Attributes for C4.5 are broadly intended to represent some aspects of a sentence in a document, an object of interest here. Thus for each sentence ps, its encoding involves reference to the following set of attributes or features. 'LocSen' gives a normalized location of ps in the text, i.e., a normalized distance from the top of the text; likewise, 'LocPar' gives a normalized location of the paragraph in which ps occurs, and 'LocWithinPar' records its normalized location within a paragraph. Also included are a few length-related features such as the length of text and sentence. Furthermore we brought in some language specific feature which we call 'EndCue.' It records the morphology of a linguistic element that ends ps, such as inflection, part of speech, etc.</Paragraph> <Paragraph position="6"> In addition, we make use of the weight feature ('Weight') for a record on the importance of ps based on tf.idf. Let ps = w1,...,wn, for some word wi.</Paragraph> <Paragraph position="7"> Then the weight W(ps) is given as:</Paragraph> <Paragraph position="9"> Here 'tf(w)' denotes the frequency of word w in a given document, 'df(w)' denotes the 'document frequency' of w, or the number of documents which contain an occurrence of w. N represents the total number of documents.5 Also among the features used here is 'Pos,' a feature intended to record the position or textual order of ps, given by how many sentences away it occurs from the top of text, starting with 0.</Paragraph> <Paragraph position="10"> While we do believe that the attributes discussed above have a lot to do with the likelihood that a given sentence becomes part of summary, we choose not to consider them parameters of the Bayesian model, just to keep it from getting unduly complex. Recall the graphical model in figure 2.</Paragraph> <Paragraph position="11"> 5Although one could reasonably argue for normalizing W(ps) by sentence length, it is not entirely clear at the moment whether it helps in the way of improving performance.</Paragraph> </Section> <Section position="2" start_page="252" end_page="252" type="sub_section"> <SectionTitle> 3.2 Test Data </SectionTitle> <Paragraph position="0"> Here is how we created test data. We collected three pools of texts from different genres, columns, editorials and news stories, from a Japanese financial paper (Nihon Keizai Shinbun) published in 1995, each with 25 articles. Then we asked 112 Japanese students to go over each article and identify 10% worth of sentences they find most important in creating a summary for that article. For each sentence, we recorded how many of the subjects are in favor of its inclusion in summary. On average, we had about seven people working on each text. In the following, we say sentences are 'positive' if there are three or more people who like to see them in a summary, and 'negative' otherwise. For convenience, let us call the corpus of columns G1K3, that of editorials G2K3 and that of news stories G3K3. Additional details are found in table 1.</Paragraph> </Section> </Section> class="xml-element"></Paper>