File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0404_intro.xml

Size: 6,608 bytes

Last Modified: 2025-10-06 14:00:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0404">
  <Title>Extracting Key Paragraph based on Topic and Event Detection -- Towards Multi-Document Summarization</Title>
  <Section position="2" start_page="0" end_page="31" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> As the volume of olfline documents has drastically increased, summarization techniques have become very importaalt in IR and NLP studies. Most of the summarization work has focused on a single document. Tiffs paper focuses on multi-document summarization: broadcast news documents about the same topic. One of the major problems in the multi-document summarization task is how to identify differences and similza'ities across documents. This can be interpreted as a question of how to make a clear distinction between an e~ent mM a topic in docu= meats. Here, an event is the subject of a document itself, i.e. a writer wants to express, in other words, notions of who, what, where, when. why and how in a document. On the other hand, a topic in this paper is some unique thing that happens at some specific time and place, and the unavoidable consequences.</Paragraph>
    <Paragraph position="1"> It'becomes background among documents. For example, in the documents of :Kobe Japan quake', the event includes early reports of damage, location and nature of quake, rescue efforts, consequences of the quake, a~ld on-site reports, while the topic is Kobe Japaa~ quake. The well-known past experience from IR ~ that notions of who, what, where, when, why and how may not make a great contribution to the topic detection and tracking task (Allan and Papka, 1998) causes this fact, i.e. a topic and an event are different from each other 1 .</Paragraph>
    <Paragraph position="2"> 1 Some topic words can also be an event. Fbr instance: in the document shown in Figure 1: 'Japan: and =quake' are topic words and also event words in the document. However, we regarded these words as a topic, i.e. not be an event. In this paper: we propose a. method fi)r extracting key paragraph for multi-document smnmarization based on distinction between a topic and an event. We use a silnple criterion called domain dependency of words as a solution and present how the i.dea of domain dependency of words can be utilized effectively to identify a topic and an event: and thus allow multi-document summarization.</Paragraph>
    <Paragraph position="3"> The basic idea of our approach is that whether a word appeared in a document is a topic (an event) or not, depends on the domain to which the docu- null ment belongs. Let us take a look at the following document from the TDT1 corpus.</Paragraph>
    <Paragraph position="4"> (1-2) Two Americans known dead in Japan quake 1. The number of \[Americans\] known to have been killed in Tuesday's earthquake in Japan has risen to two, the \[State\] \[Department\] said Thursday.</Paragraph>
    <Paragraph position="5"> 2. The first was named Wednesday as Voni Lynn  ~Vong~ a teacher from California. \[State I \[Department\] spokswoman Christine Shelly declined to name the second: saying formalities of notifying  mation about private \[American\] citizens in Japan had received over 6,000 calls, more than half ot'Th-'e'm seeking direct assistance.</Paragraph>
    <Paragraph position="6"> 6. The Pentagon has agreed to send 57:000 blankets to Japan and \[U.S.\] ambassador to Tokyo ~Valter Mondale has donated a $25,000 discretionary fund for emergencies to the Japanese Red Cross, Shelly said. 7. Japan has also agreed to a visit by a team of \[U.S.\] experts headed by Richard Witt, national director of the Federal Emergency Management Agency.</Paragraph>
    <Paragraph position="7">  words) is 'Two Americans known dead in Japan quake'. Underlined words denote a topic, and the words marked with '\[ \]' are events. '1,,,7' of Figure 1 is paragraph id. Like Lulm's technique of keyword extraction, our method assumes that an event associated with a document appears throughout parm graphs (Luhn, 1958), but a topic does not. This is because an event is the subject of a document itself. while a topic is an event, along with all directly related events. In Figure 1, event words 'Americans' and 'U.S.', for instance, appears across paragraphs, while a topic word, for example, 'Kobe' appears only the third paragraph. Let us consider further a broad coverage domain which consists of a small number of sanaple news documents about the same topic, 'Kobe Japan quake'. Figure 2 and 3 are documents with 'Kobe Japan quake'.</Paragraph>
    <Paragraph position="8">  (l-l) Quake collapses buildings in central Japan 1. At lea.~t two people died and dozens were injured when a powerful earthquake rolled through central Japan Tue..~lay morning, collapsing buildings and setting off fires in the cities of Kobe and Osaka. 2. The Japan Meteorological Agency said the earthquake, which measured 7.2 on the open-ended Richter scale: rmnbled across Honshu Island from the Pacific Ocean to the Japan Sea.</Paragraph>
    <Paragraph position="9"> Figure 2: The document titled 'Quake collapses buildings in central Japan' (1-3) Kobe quake leaves questions about medical system 1. The earthquake that devastated Kobe in January  raised serious questions about the efficiency of Japan's emergency medical system, a government report released on Tuesday said. 2. 'The earthquake exposed many i~ues in terms of quantity, quality, promptness and efficiency of Japan's medical care in time of disaster,' the report  Underlined words in Figure 2 and 3 show the topic of these documents. In these two documents, :Kobe' which is a topic appears in eveD&amp;quot; document, while 'Americans' and 'U.S.' which are events of the document shown in Figure 1, does not appear. Our technique for making the distinction between a topic and an event explicitly exploits this feature of the domain dependency of words: how strongly a word features a given set of data.</Paragraph>
    <Paragraph position="10"> The rest of the paper is organized as follows.</Paragraph>
    <Paragraph position="11"> The next section provides domain dependency of words which is used to identify a topic and an event for broadcast news documents. We then present a method for extracting topic and event words: and describe a paragraph-based summarization algorithm using the result of topic and event extraction. Finally~ we report some experiments using the TDT1 corpus which has been developed by the TDT (Topic Detection and Tracking) Pilot Study (Allan and Carbonell, 1998) with a discussion of evaluation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML