File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1041_metho.xml
Size: 7,662 bytes
Last Modified: 2025-10-06 14:07:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1041"> <Title>Headline Generation Based on Statistical Translation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The System </SectionTitle> <Paragraph position="0"> As in any language generation task, summarization can be conceptually modeled as consisting of two major sub-tasks: (1) content selection, and (2) surface realization. Parameters for statistical models of both of these tasks were estimated from a training corpus of approximately 25,000 1997 Reuters news-wire articles on politics, technology, health, sports and business. The target documents - the summaries - that the system needed to learn the translation mapping to, were the headlines accompanying the news stories.</Paragraph> <Paragraph position="1"> The documents were preprocessed before training: formatting and mark-up information, such as font changes and SGML/HTML tags, was removed; punctuation, except apostrophes, was also removed. Apart from these two steps, no other normalization was performed. It is likely that further processing, such as lemmatization, might be useful, producing smaller and better language models, but this was not evaluated for this paper.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Content Selection </SectionTitle> <Paragraph position="0"> Content selection requires that the system learn a model of the relationship between the appearance of some features in a document and the appearance of corresponding features in the summary.</Paragraph> <Paragraph position="1"> This can be modeled by estimating the likelihood of some token appearing in a summary given that some tokens (one or more, possibly different tokens) appeared in the document to be summarized. The very simplest, &quot;zero-level&quot; model for this relationship is the case when the two tokens in the document and the summary are identical.</Paragraph> <Paragraph position="2"> This can be computed as the conditional probability of a word occurring in the summary given that the word appeared in the document:</Paragraph> <Paragraph position="4"> where a9 and a14 represent the bags of words that the headline and the document contain.</Paragraph> <Paragraph position="5"> Once the parameters of a content selection model have been estimated from a suitable document/summary corpus, the model can be used to compute selection scores for candidate summary terms, given the terms occurring in a particular source document. Specific subsets of terms, representing the core summary content of an article, can then be compared for suitability in generating a summary. This can be done at two levels (1) likelihood of the length of resulting summaries, given the source document, and (2) likelihood of forming a coherently ordered summary from the content selected.</Paragraph> <Paragraph position="6"> The length of the summary can also be learned as a function of the source document. The simplest model for document length is a fixed length based on document genre. For the discussions in this paper, this will be the model chosen. Figure 2 shows the distribution of headline length. As can be seen, a Gaussian distribution could also model the likely lengths quite accurately.</Paragraph> <Paragraph position="7"> Finally, to simplify parameter estimation for the content selection model, we can assume that the likelihood of a word in the summary is independent of other words in the summary. In this case, the probability of any particular summarycontent candidate can be calculated simply as the product of the probabilities of the terms in the candidate set. Therefore, the overall probability of a candidate summary, a9 , consisting of words a1a4a3a28a27a24a29a30a3a32a31a33a29a35a34a35a34a35a34a13a29a30a3a6a36a37a15 , under the simplest, zero-level, summary model based on the previous assumptions, can be computed as the product of the likelihood of (i) the terms selected for the summary, (ii) the length of the resulting summary, and (iii) the most likely sequencing of the terms in the content set.</Paragraph> <Paragraph position="8"> In general, the probability of a word appearing in a summary cannot be considered to be independent of the structure of the summary, but the independence assumption is an initial modeling choice.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Surface Realization </SectionTitle> <Paragraph position="0"> The probability of any particular surface ordering as a headline candidate can be computed by modeling the probability of word sequences. The simplest model is a bigram language model, where the probability of a word sequence is approximated by the product of the probabilities of seeing each term given its immediate left context. Probabilities for sequences that have not been seen in the training data are estimated using back-off weights (Katz, 1987). As mentioned earlier, in principle, surface linearization calculations can be carried out with respect to any textual spans from characters on up, and could take into account additional information at the phrase level.</Paragraph> <Paragraph position="1"> They could also, of course, be extended to use higher order n-grams, providing that sufficient numbers of training headlines were available to estimate the probabilities.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Search </SectionTitle> <Paragraph position="0"> Even though content selection and summary structure generation have been presented separately, there is no reason for them to occur independently, and in fact, in our current implementation, they are used simultaneously to contribute to an overall weighting scheme that ranks possible summary candidates against each other. Thus, the overall score used in ranking can be obtained as a weighted combination of the content and structure model log probabilities. Cross-validation is used to learn weights a59 , a60 and a61 for a particular document genre.</Paragraph> <Paragraph position="1"> To generate a summary, it is necessary to find a sequence of words that maximizes the probability, under the content selection and summary structure models, that it was generated from the document to be summarized. In the simplest, zero-level model that we have discussed, since each summary term is selected independently, and the summary structure model is first order Markov, it is possible to use Viterbi beam search (Forney, 1973) to efficiently find a near-optimal summary.</Paragraph> <Paragraph position="2"> 2 Other statistical models might require the use of a different heuristic search algorithm. An example of the results of a search for candidates of various lengths is shown in Figure 1. It shows the set of headlines generated by the system when run against a real news story discussing Apple Computer's decision to start direct internet sales and comparing it to the strategy of other computer makers.</Paragraph> <Paragraph position="3"> 2In the experiments discussed in the following section, a beam width of three, and a minimum beam size of twenty states was used. In other experiments, we also tried to strongly discourage paths that repeated terms, by reweighting after backtracking at every state, since, otherwise, bi-grams that start repeating often seem to pathologically overwhelm the search; this reweighting violates the first order Markovian assumptions, but seems to to more good than harm.</Paragraph> </Section> </Section> class="xml-element"></Paper>