File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1059_intro.xml
Size: 6,101 bytes
Last Modified: 2025-10-06 14:01:11
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1059"> <Title>Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics1</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The explosion of the World Wide Web has brought with it a vast hoard of information, most of it relatively unstructured. This has created a demand for new ways of managing this often unwieldy body of dynamically changing information. The goal of automatic text summarization is to take a partially-structured source text, extract information content from it, and present the most important content in a condensed form in a manner sensitive to the needs of the user and task (Mani and Maybury 1999). Summaries can be 'generic', i.e., aimed at a broad audience, or topic-focused, i.e., tailored to the requirements of a particular user or group of users. Multi-Document Summarization (MDS) is, by definition, the extension of single-document summarization to collections of related documents. MDS can potentially help the user to see at a glance what a collection is about, or to examine similarities and differences in the information content in the collection.</Paragraph> <Paragraph position="1"> Specialized multi-document summarization systems can be constructed for various applications; here we discuss a biographical summarizer. Biographies can, of course, be long, as in book-length biographies, or short, as in an author's description on a book jacket. The nature of descriptions in the biography can vary, from physical characteristics (e.g., for criminal suspects) to scientific or other achievements (e.g., a speaker's biography). The crucial point here is that facts about a person's life are selected, organized, and presented so as to meet the compression and task requirements.</Paragraph> <Paragraph position="2"> While book-quality biographies are out of reach of computers, many other kinds can be synthesized by sifting through large quantities of on-line information, a task that is tedious for humans to carry out. We report here on the development of a biographical MDS summarizer that summarizes information about people described in the news. Such a summarizer is of interest, for example, to analysts who want to automatically construct a dossier about a person over time.</Paragraph> <Paragraph position="3"> Rather than determining in advance what sort of information should go into a biography, our approach is more data-driven, relying on discovering how people are actually described in news reports in a collection. We use corpus statistics from a background corpus along with linguistic knowledge to select and merge descriptions from a document collection, removing redundant descriptions. The focus here is on synthesizing succinct descriptions. The problem of assembling these descriptions into a coherent narrative is not a focus of our paper; the system currently uses canned text methods to produce output text containing these descriptions. Obviously, the merging of descriptions should take temporal information into account; this very challenging issue is also not addressed here.</Paragraph> <Paragraph position="4"> To give a clearer idea of the system's output, here are some examples of biographies produced by our system (the descriptions themselves are underlined, the rest is canned text). The biographies contain descriptions of the salient attributes and activities of people in the corpus, along with lists of their associates. These short summaries illustrate the extent of compression provided. The first two summaries are of a collection of 1300 wire service news documents on the Clinton impeachment proceedings (707,000 words in all, called the 'Clinton' corpus). In this corpus, there are 607 sentences mentioning Vernon Jordan by name, from which the system extracted 82 descriptions expressed as appositives (78) and relative clauses (4), along with 65 descriptions consisting of sentences whose deep subject is Jordan. The 4 relative clauses are duplicates of one another: who helped Lewinsky find a job. The 78 appositives fall into just 2 groups: friend (or equivalent descriptions, such as confidant), adviser (or equivalent such as lawyer). The sentential descriptions are filtered in part based on the presence of verbs like testify, plead, or greet that are strongly associated with the head noun of the appositive, namely friend.</Paragraph> <Paragraph position="5"> The target length can be varied to produce longer summaries.</Paragraph> <Paragraph position="6"> Vernon Jordan is a presidential friend and a Clinton adviser. He is 63 years old. He helped Ms. Lewinsky find a job. He testified that Ms.</Paragraph> <Paragraph position="7"> Monica Lewinsky said that she had conversations with the president, that she talked to the president. He has numerous acquaintances, including Susan Collins, Betty Currie, Pete Domenici, Bob Graham, James Jeffords and Linda Tripp.</Paragraph> <Paragraph position="8"> 1,300 docs, 707,000 words (Clinton corpus) 607 Jordan sentences, 78 extracted appositives, 2 groups: friend, adviser.</Paragraph> <Paragraph position="9"> Henry Hyde is a Republican chairman of House Judiciary Committee and a prosecutor in Senate impeachment trial. He will lead the Judiciary Committee's impeachment review. Hyde urged his colleagues to heed their consciences , the voice that whispers in our ear , 'duty, duty, duty.' Clinton corpus, 503 Hyde sentences, 108 extracted appositives, 2 groups: chairman, impeachment prosecutor.</Paragraph> <Paragraph position="10"> Victor Polay is the Tupac Amaru rebels' top leader, founder and the organization's commander-and-chief. He was arrested again in 1992 and is serving a life sentence. His associates include Alberto Fujimori, Tupac Amaru Revolutionary, and Nestor Cerpa.</Paragraph> <Paragraph position="11"> 73 docs, 38,000 words, 24 Polay sentences, 10 extracted appositives, 3 groups: leader, founder and commander-in-chief.</Paragraph> </Section> class="xml-element"></Paper>