File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2129_metho.xml

Size: 9,018 bytes

Last Modified: 2025-10-06 14:07:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2129">
  <Title>Multi-Topic Multi-Document Summarization</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Multi-Topic Multi-Document Summarization
UTIYAMA Masao
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
HASIDA K6iti
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Summarization of multiple documents featuring multiple topics is discussed. The example trea.ted here consists of fifty articles about the Peru hostage incident tbr \])ecember 1996 through April 1997. They include a. lot of topics such as opening, negotiation, ending, and so on. The method proposed in this paper is based on spreading activation over documents syntactically and semantically annotated with GI)A (Global l)ocument Annotation) tags. The method extracts important documents aald important parts therein, and creates a network consisting of important entities and relations among them. It also identifies cross-document coreferences to replace expressions with more concrete ones. The method is essentially multi~ lingua\] due to the language-independence of the GDA tagset. This tagset can provide a standard fornm.t tbr the study on the transfbrmation and/or generation stage of summarization process, among other natural language processing tasks.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="892" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A large ('.vent consists of a, number of smaller events. These component events are usually related trot such relations may not be strong enough to define larger topics. For example, a war may consist of opening, battles, negotiations, and so on. These relatively independent events are considered to be topics by themselves and would accordingly be reported in multiple news re'titles.</Paragraph>
    <Paragraph position="1"> Summarization of such a large event, or multiple documents about multiple topics, is the concern of this paper. Summarization of multiple documents containing nmltiple topics is an unexplored research issue. Some previous studies on summarization (McKeown and Radev, 1995; Barzilay et al., 1999; Mani and Bloedorn, 1999) deal with multiple docmnents about a single topic, but not about multiple topics 1.</Paragraph>
    <Paragraph position="2"> In order to smnmarize lnultiple docmne, nts with multiple topics, one needs a general, semantics-oriented method for evaluating importance. Summarization of a single document may largely exploit the doculnent structure. As an extreme example, the first paragraph of a newspaper article often serves as a smmnary of the entire article. On the other hand, summa.rization of multiple, documents in general must be more based on their semantic structures, because the, re is no overall consistent document structure across them.</Paragraph>
    <Paragraph position="3"> Selection of multiple important topics (not keywords) tbr nmltiple-topic summarization has not; yet been really addressed in the previous literatm:e. The present paper proposes a method, based on spreading a.ctivation, for extracting important topics and important documents. Another method proposed which is usefifl for grasping the overview of nlultiple documents is visualization of important entities mentioned and relationships among them. Visualization of relationships among keywords has been studied in the context of information retrieval (Niwa et al., 1997; Sanderson and Croft, \] 999), but to the authors' knowledge the present study is the first to address such visualization in the context of summm'ization. Of conrse a. concise summary of the entire set of multiple doculnents can be obtained by recovering sentences from important entities and their relationships ~s demonstrated in section 3.3.</Paragraph>
    <Paragraph position="4"> The present study assumes documents annotated with GDA (Global Document Annota- null tion) Lags (Itasida, 1997; Nagao and llasida, 1!)98). Since the GI)A tagset is designed to be inclel)endent of any particular natural language, the proposed method is essent, ially multilingual. Another merit of using annotate, d documents is that we ca.n separate the a,nalysis phase from the whole process of summarization so that we ca,n locus on the latter, generation t)hase of summa.rization process. Annotated documents can also be useflfl for a common input format for the study of summarization, among other natural language processing tasks.</Paragraph>
  </Section>
  <Section position="5" start_page="892" end_page="893" type="metho">
    <SectionTitle>
2 The GDA Tagset
</SectionTitle>
    <Paragraph position="0"> GI)A is a project to make on-line documents ntachinc-ullderstanda.ble on the basis of a linguistic ta.gset, while developing and si)reading technologies of content-based presentation, retrieval, question-answering, smnma.rization, translation, among othe, rs, with much higher quality than before. GI)A thus proposes an integrated global plattbrm for e,h',ctronic content authoring, t)resental;ion, a,nd reuse. The GI)A tagset 2 is an XM1, (eXtensible Markup l,anguage) insta,nce which allows ma.chines to automatically infex the semantic and pra.gma.tic structures uncle, flying the raw (locuments.</Paragraph>
    <Paragraph position="1"> Under the current sta.te of the art, GI)Atagging is senfiautomatic and calls for manual correction by human mmotators; othe, rwise annotation would ma,ke no sense. &amp;quot;l~h( ,, cost involved here pays, because annota,ted documents are generic information contents from which to rend(',r diverse types of 1)resenta.tions, poi;entially involving summariza.tion, narra,tion, visualization, translation, information retriewfl, information extra.ction, and so forth. The present p~,per concerns summarization only, trot the merit of GI)A-tagging is not a,t all restricted to smmnarization, and that is why it is considered reasonable to assume Gl)A-tagged input here.</Paragraph>
    <Paragraph position="2"> 2.1. Syntactic structure An example of a. Ol)A-tagged sentence is shown in Figure 1. &lt;su&gt; means sentential unit. &lt;np&gt;, &lt;v&gt;, and &lt;adp&gt; stand for noun t)hrase, verb, and adnominal or adverbial phrase.</Paragraph>
    <Paragraph position="3"> &lt;su&gt; and the tags whose name end with 'p' (such as &lt;adp&gt; and &lt;vp&gt;) a,re called phrnsal ta.qs. In a sentence, an (;lement (a text Sl)an  from a begin tag to the corresponding end tag) is usually a syntactic constituent. The elements enclosed in phrasal tags are called ph,~asal elements, which cannot be the head of larger elements. So in Figure 1. 'flies' is specified to be the hea.d of the &lt;su&gt; element and qike' the head of the &lt;adp&gt; element.</Paragraph>
    <Section position="1" start_page="892" end_page="893" type="sub_section">
      <SectionTitle>
2.2 Coreferences and Anaphora
</SectionTitle>
      <Paragraph position="0"> Each element ma.y have an identifier as the va.lue for l;he id attrit)ute,. Corefe, rences, including identity ana.t)hora , are annotated by the eq attribute, as follows: &lt;np id=&amp;quot;j0&amp;quot;&gt;John&lt;/np&gt; beats &lt;adp eq=&amp;quot;j0&amp;quot;&gt;his&lt;/adp&gt; dog.</Paragraph>
      <Paragraph position="1"> When the shared sc, nm.ntic content is not the rctb, renl; lint the typ(', (kind, se, t, etc.) of the retb, rents, the eq.ab attribute is used like the following: You bought a &lt;np id=&amp;quot;cl&amp;quot;&gt;car&lt;/np&gt;.</Paragraph>
      <Paragraph position="2"> 3 bought &lt;np eq. ab=&amp;quot;cl&amp;quot;&gt;one&lt;/np&gt;, too.</Paragraph>
      <Paragraph position="3"> A zero anaphora is encoded as follows: Tom visited &lt;np id=&amp;quot;ml&amp;quot;&gt;Mary&lt;/np&gt;. He had &lt;v iob=&amp;quot;ml&amp;quot;&gt;brought&lt;/v&gt; a present.</Paragraph>
      <Paragraph position="4"> iob=&amp;quot;ml&amp;quot; means that the indirect object of brough, t is elemenl~ whose id value is ml, that is, Mary.</Paragraph>
      <Paragraph position="5"> Other relations, such as sub and sup, can also be encoded, sub represents subset, t)art, or element. An example follows: She has &lt;np id=&amp;quot;bl&amp;quot;&gt;many books&lt;/np&gt;.</Paragraph>
      <Paragraph position="6"> &lt;namep sub =''b i &amp;quot;&gt; c c AI i ce ~ s  Adventures in Wonderland' '&lt;/namep&gt; is her favorite.</Paragraph>
      <Paragraph position="7"> sup is the inverse of sub, i.e., ineluder of any sort, which is superset a.s to subset, whole as to part, or set as to element.</Paragraph>
      <Paragraph position="8"> Syntactic structures and corefc, rences are essential for the summarization method described in section 3. l?urther details such as semantics, coordination, scoping, illocutionary act, and so on, are omitted here.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML