File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/x98-1028_abstr.xml

Size: 13,309 bytes

Last Modified: 2025-10-06 13:49:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1028">
  <Title>A Text-Extraction Based Summarizer</Title>
  <Section position="1" start_page="0" end_page="224" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present an automated method of generating human-readable summaries from a variety of text documents including newspaper articles, business reports, government documents, even broadcast news transcripts. Our approach exploits an empirical observation that much of the written text display certain regularities of organization and style, which we call the Discourse Macro Structure (DMS). A summary is therefore created to reflect the components of a given DMS. In order to produce a coherent and readable summary we select continuous, well-formed passages from the source document and assemble them into a mini-document within a DMS template. In this paper we describe an automated summarizer that can generate both short indicative abstracts, useful for quick scanning of a list of documents, as well as longer informative digests that can serve as surrogates for the full text. The summarizer can assist the users of an information retrieval system in assessing the quality of the results returned from a search, preparing reports and memos for their customers, and even building more effective search queries.</Paragraph>
    <Paragraph position="1"> Introduction A good summarization tool can be of enormous help for those who have to process large amounts of documents. In information retrieval one would benefit greatly from having content-indicative quick-read summaries supplied along with the titles returned from search. Similarly, application areas like routing, news on demand, market intelligence and topic tracking would benefit from a good summarization tool.</Paragraph>
    <Paragraph position="2"> Perhaps the most difficult problem in designing an automatic text summarization is to define what a summary is, and how to tell a summary from a non-summary, or a good summary from a bad one.</Paragraph>
    <Paragraph position="3"> The answer depends in part upon who the summary is intended for, and in part upon what it is meant to achieve, which in large measure precludes any objective evaluation. A good summary should at least be a good reflection of the original document while being considerably shorter than the original thus saving the reader valuable reading time.</Paragraph>
    <Paragraph position="4"> In this paper we describe an automatic way to generate summaries from text-only documents. The summarizer we developed can create general and topical indicative summaries, and also topical informative summaries. Our approach is domain-independent and takes advantage of certain organization regularities that were observed in news-type documents. The system participated in a third-party evaluation program and turned out to be one of the top-performing summarizers. Especially the quality/length ratio was very good since our summaries tend to be very short (10% of the original length).</Paragraph>
    <Paragraph position="5"> The summarizer is still undergoing improvement and expansion in order to be able to summarize a wide variety of documents. It is also used successfully as a tool to solve different problems, like information retrieval and topic tracking.</Paragraph>
    <Paragraph position="6"> Task Description and Related Work For most of us, a summary is a brief synopsis of the content of a larger document, an abstract recounting the main points while suppressing most details. One purpose of having a summary is to quickly learn some facts, and decide what you want to do with the entire story. Depending on how they are meant to be used one can distinguish between two kinds of summaries. Indicative summaries are not a replacement for the original text but are meant to be a good reflection of the kind of information that can be found in the original document. Informative summaries can be used as a replacement of the original document and should contain the main facts of the document. Independent of their usage summaries can be classified as general summaries or topical summaries.</Paragraph>
    <Paragraph position="7"> A general summary addresses the main points of the document ignoring unrelated issues. A topical sum- null mary will report the main issues relevant to a certain topic, which might have little to do with the main topic of the document. Both summaries might give very different impressions of the same document. In this paper we describe a summarizer that summarizes one document, text only, at a time. It is capable of producing both topical and generic indicative summaries, and topical informative summaries.</Paragraph>
    <Paragraph position="8"> Our early inspiration, and a benchmark, have been the Quick Read Summaries, posted daily off the front page of New York Times on-line edition (http://www.nytimes.com). These summaries, produced manually by NYT staff, are assembled out of passages, sentences, and sometimes sentence fragments taken from the main article with very few, if any, editorial adjustments. The effect is a collection of perfectly coherent tidbits of news: the who, the what, and when, but perhaps not why. Indeed, these summaries leave out most of the details, and cannot serve as surrogates for the full article. Yet, they Mlow the reader to learn some basic facts, and then to choose which stories to open.</Paragraph>
    <Paragraph position="9"> This kind of summarization, where appropriate passages are extracted from the original text, is very efficient, and arguably effective, because it doesn't require generation of any new text, and thus lowers the risk of misinterpretation. It is also relatively easier to automate, because we only need to identify the suitable passages among the other text, a task that can be accomplished via shallow NLP and statistical techniques. Nonetheless, there are a number of serious problems to overcome before an acceptable quality summarizer can be built. For one, quantitative methods alone are generally too weak to deal adequately with the complexities of natural language text. For example, one popular approach to automated abstract generation has been to select key sentences from the original text using statistical and linguistic cues, perform some cosmetic adjustments in order to restore cohesiveness, and then output the result as a single passage, e.g., (Luhn 1958) (Paice 1990) (Brandow, Mitze, &amp; Rau 1995) (Kupiec, Pedersen, &amp; Chen 1995). The main advantage of this approach is that it can be applied to almost any kind of text. The main problem is that it hardly ever produces an intelligible summary: the resulting passage often lacks coherence, is hard to understand, sometimes misleading, and may be just plain incomprehensible. In fact, some studies show (cf. (Brandow, Mitze, &amp; Ran 1995)) that simply selecting the first paragraph from a document tends to produce better summaries than a sentence-based algorithm.</Paragraph>
    <Paragraph position="10"> A far more difficult, but arguably more &amp;quot;humanlike&amp;quot; method to summarize text (with the possible exception of editorial staff of some well-known dailies) is to comprehend it in its entirety, and then write a summary &amp;quot;in your own words.&amp;quot; What this amounts to, computationally, is a full linguistic analysis to extract key text components from which a summary could be built. One previously explored approach, e.g., (Ono, Sumita, &amp; Miike 1994) (McKeown &amp; Radev 1995), was to extract discourse structure elements and then generate the summary within this structure. In another approach, e.g., (DeJong 1982) (Lehnert 1981) pre-defined summary templates were filled with text elements obtained using information extraction techniques. Marcu (Marcu 1997a) uses rhetorical structure analysis to guide the selection of text segments for the summary; similarly Teufel and Moens (Teufel &amp; Moens 1997) analyze argumentative structure of discourse to extract appropriate sentences. While these approaches can produce very good results, they are yet to be demonstrated in a practical system applied to a reasonable size domain. The main difficulty is the lack of an efficient and reliable method of computing the required discourse structure.</Paragraph>
    <Section position="1" start_page="223" end_page="223" type="sub_section">
      <SectionTitle>
Our Approach
</SectionTitle>
      <Paragraph position="0"> The approach we adopted in our work falls somewhere between simple sentence extraction and textunderstanding, although philosophically we are closer to NYT cut-and-paste editors. We overcome the shortcomings of sentence-based summarization by working on paragraph level instead. Our summarizer is based on taking advantage of paragraph segmentation and the underlying Discourse Macro Structure of News texts. Both will be discussed below. null</Paragraph>
    </Section>
    <Section position="2" start_page="223" end_page="224" type="sub_section">
      <SectionTitle>
Paragraphs
</SectionTitle>
      <Paragraph position="0"> Paragraphs are generally self-contained units, more so than single sentences, they usually address a single thought or issue, and their relationships with the surrounding text are somewhat easier to trace.</Paragraph>
      <Paragraph position="1"> This notion has been explored by Cornell's group (Salton et al. 1994) to design a summarizer that traces inter-paragraph relationships and selects the &amp;quot;best connected&amp;quot; paragraphs for the summary. Like in Cornell's system, our summaries are made up of paragraphs taken out of the original text. In addition, in order to obtain more coherent summaries, we impose some fundamental discourse constraints on the generation process, but avoid a full discourse analysis.</Paragraph>
      <Paragraph position="2"> We would like to note at this point that the summarization algorithm, as described in detail later,  does not explicitly depend on nor indeed require input text that is pre-segmented into paragraphs. In general, any length passages can be used, although this choice will impact the complexity of the solution. Lifting well-defined paragraphs from a document and then recombining them into a summary is relatively more straightforward than recombining other text units. For texts where there is no structure at all, as in a closed-captioned stream in broadcast television, there are several ways to create artificial segments. The simplest would be to use fixed word-count passages. Or, content-based segmentation techniques may be applicable, e.g., Hearst's Text-Tiling (Hearst 1997).</Paragraph>
      <Paragraph position="3"> On the other hand, we may argue that essentially any length segments of text can be used so long as one could figure out a way to reconnect them into paragraph-like passages even if their boundaries were somewhat off. This is actually not unlike dealing with the texts with very fine grained paragraphs, as is often the case with news-wire articles. For such texts, in order to obtain an appropriate level of chunking, some paragraphs need to be reconnected into longer passages. This may be achieved by tracking co-references and other text cohesiveness devices, and their choice will depend upon the initial segmentation we work up from.</Paragraph>
      <Paragraph position="4"> Discourse Macro Structure of a Text It has been observed, eg., (Rino &amp; Scott 1994), (Weissberg &amp; Buker 1990), that certain types of texts, such as news articles, technical reports, research papers, etc., conform to a set of style and organization constraints, called the Discourse Macro Structure (DMS) which help the author to achieve a desired communication effect. For instance, both physics papers and abstracts align closely with the Introduction-Methodology-Results-Discussion-Conclusion macro structure. It is likely that other scientific and technical texts will also conform to this or similar structure, since this is exactly the structure suggested in technical writing guidebooks, e.g. (Weissberg &amp; Buker 1990). One observation to make here is that perhaps a proper summary or an abstract should reflect the DMS of the original document. On the other hand, we need to note that a summary can be given a different DMS, and this choice would reflect our interpretation of the original text. A scientific paper, for example, can be treated as a piece of news, and serve as a basis of an un-scientific summary.</Paragraph>
      <Paragraph position="5"> News reports tend to be built hierarchically out of components which fall roughly into one of the two categories: the What-Is- The-News category, and the optional Background category. The Background, if present, supplies the context necessary to understand the central story, or to make a follow-up story self-contained. The Background section is optional: when the background is common knowledge or is implied in the main news section, it can, and usually is omitted. The What-Is-The-News section covers the new developments and the new facts that make the news. This organization is often reflected in the summary, as illustrated in the example below from NYT 10/15/97, where the highlighted portion provides the background for the main news:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML