File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0403_intro.xml
Size: 4,069 bytes
Last Modified: 2025-10-06 14:00:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0403"> <Title>I I I I I I I I I I I I I I I I I I I Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies</Title> <Section position="3" start_page="0" end_page="21" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> On October 12, 1999, a relatively small number of news sources mentioned in passing that Pakistani Defense Minister Gen. Pervaiz Musharraf was away visiting Sri Lanka. However, all world agencies would be actively reporting on the major events that were to happen in Pakistan in the following days: Prime Minister Nawaz Sharif announced that in Gen.</Paragraph> <Paragraph position="1"> Musharrafs absence, the Defense Minister had been -sacked and replaced by General Zia Addin. Large numbers of messages from various sources started to inundate the newswire: about the army's occupation of the capital, the Prime Minister's ouster and his subsequent placement under house arrest, Gen.</Paragraph> <Paragraph position="2"> Musharrafs return to his country, his ascendancy to power, and the imposition of military control over Pakistan.</Paragraph> <Paragraph position="3"> The paragraph above summarizes a large amount of news from different sources. While it was not automatically generated, one can imagine the use of such automatically generated summaries. In this paper we will describe how multi-document summaries are built and evaluated.</Paragraph> <Section position="1" start_page="0" end_page="21" type="sub_section"> <SectionTitle> 1.1 Topic detection and multi-document </SectionTitle> <Paragraph position="0"> summarization The process of identifying all articles on an emerging event is called Topic Detection and Tracking (TDT). A large body of research in TDT has been created over the past two years \[Allan et al., 98\]. We will present an extension of our own research on TDT \[Radev et al., 1999\] to cover summarization of multi-document clusters.</Paragraph> <Paragraph position="1"> Our entry in the official TDT evaluation, called CIDR ~adev et al., 1999\], uses modified TF*IDF to produce clusters of news articles on the same event. We developed a new technique for multi-document summarization (or MDS), called centroid-based summarization (CBS) which uses as input the centroids of the clusters produced by CIDR to identify which sentences are central to the topic of the cluster, rather than the individual articles. We have implemented CBS in a system, named MEAD.</Paragraph> <Paragraph position="2"> The main contributions of this paper are: the development of a centroid-based multi-document summarizer, the use of cluster-based sentence utility (CBSU) and cross-sentence informational subsumption (CSIS) for evaluation of single and multi-document summaries, two user studies that support our findings, and an evaluation of MEAD.</Paragraph> <Paragraph position="3"> An event cluster, produced by a TDT system, consists of chronologically ordered news articles from multiple sources, which describe an event as it develops over time. Event clusters range from2 to 10 documents from which MEAD produces summaries in the form of sentence extracts.</Paragraph> <Paragraph position="4"> A key feature of MEAD is its use of cluster centroids, which consist of words which are central not only to one article in a cluster, but to all the articles.</Paragraph> <Paragraph position="5"> MEAD is significantly different from previous work on multi-document summarization \[Radev & McKeown, 1998; Carbonell and Goldstein, 1998; Mani and Bloedorn, 1999; MeKeown et aI., 1999\], which use techniques such as graph matching, maximal marginal relevance, or language generation. Finally, evaluation of multi-document summaries is a difficult problem. There is not yet a widely accepted evaluation scheme. We propose a utility-based evaluation scheme, which can be used to evaluate both single-document and multi-document summaries.</Paragraph> </Section> </Section> class="xml-element"></Paper>