File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1009_metho.xml

Size: 17,471 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1009">
  <Title>A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure</Title>
  <Section position="4" start_page="75" end_page="76" type="metho">
    <SectionTitle>
SUMMONS \[Radev &amp; McKeown 98\] is a
</SectionTitle>
    <Paragraph position="0"> knowledge-based multi-document summarization system, which produces summaries of a small number of news articles within the domain of terrorism. SUMMONS uses as input a set of semantic templates extracted by a message understanding system \[Fisher et al. 96\] and identifies some patterns in them such as chang e of perspective, contradiction, refinement, agreement, and elaboration. The techniques used in SUMMONS involved a large amount of knowledge engineering even for a relatively small domain of text (such as accounts of terrorist events) and is not directly suitable for domain-independent text analysis. The planning operators used in it present, however, the ideal first step towards CST.</Paragraph>
    <Paragraph position="1"> \[Mani &amp; Bloedorn 99\] use similarities and differences among related news articles for MDS. They measure the effectiveness of their method in two scenarios: paragraph ahgnment across two articles and query-based information retrieval. None of these scenarios evaluates the generation of query-independent summaries of multiple articles in open domains.</Paragraph>
    <Paragraph position="2"> The Stimulate projects at Columbia University \[Barzflay &amp; al. 99\], \[McKeown &amp; al. 99\] have been using natural language generation to produce multi-document summaries. Their technique is called theme intersection: paragraph alignment across news stories with the help of a semantic network to identify phrases which convey the same meaning and then generate new sentences from each theme and order them chronologically to produce a summary.</Paragraph>
    <Paragraph position="3"> We should note here that RST has been used to produce single-document summaries \[Marcu 97\]. For multi-document summaries, CST can present a reasonable equivalent to RST.</Paragraph>
    <Section position="1" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
2_3 Time-dependent documents
</SectionTitle>
      <Paragraph position="0"> Time-dependent documents are related to the observation that perception of an event changes over time and include (a) evolving summaries (summaries of new documents related to an ongoing event that are presented to the user assuming that he or she has read earlier summaries of related documents) \[Radev 99\] and (b) chronological briefings \[Radev &amp; McKeown 98\]. \[Carbonell et al. 98\] discuss the motivation behind the use of time-dependent documents and \[Berger &amp; Miller 98\] describe a language model for time-dependent corpora.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="76" end_page="77" type="metho">
    <SectionTitle>
3 Representing cross-document
</SectionTitle>
    <Paragraph position="0"> structure We will introduce two complementary data structures to represent multi-document clusters: the multi-document cube (Section 0) and the multi-document graph (Section 0).</Paragraph>
    <Section position="1" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
3.1 Multi-document cubes
</SectionTitle>
      <Paragraph position="0"> Definition A multi-document cube C (see Figure 3 (a)) is a three dimensional structure that  represents related documents. The three dimensions are t (time), s (source) and p (position within the document).</Paragraph>
      <Paragraph position="1"> Def'mition A document unit U is a tuple (t,s,p) see Figure 3 (b). Document units can be defined at different levels of granularity, e.g., paragraphs, sentences, or words.</Paragraph>
      <Paragraph position="2"> Definition A document D is a sequence of document units U1U2... Un which corresponds to a one-dimensional projection of a multi-document cube along the source and time dimensions.</Paragraph>
      <Paragraph position="3"> Some additional concepts can be defined based on the above definitions.</Paragraph>
      <Paragraph position="4"> Definition A snapshot is a slice of the multi-document cube over a period of time At - see Figure 3 (c).</Paragraph>
      <Paragraph position="5"> Definition An evolving document is a slice of the multi-document cube in which the source is fixed and time and position may vary.</Paragraph>
      <Paragraph position="6"> Definition An extractive summary S of a cube C is a set of document units, S c C, see Figure 3</Paragraph>
    </Section>
    <Section position="2" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
3.2 Multi-document graphs
</SectionTitle>
      <Paragraph position="0"> While multi-document cubes are a useful abstraction, they cannot easily represent text simultaneously at different levels of granularity (words, phrases, sentences, paragraphs, and documents). The second formalism that we introduce is the multi-document graph. Each graph consists of smaller subgraphs for each individual document (Figure 4). We use two types of links. The first type represents inheritance relationships among elements within a single document. These links are drawn using thicker lines. The second type represents semantic relationships among textual units. The example illustrates sample links among documents, phrases, sentences, and phrases.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="77" end_page="80" type="metho">
    <SectionTitle>
4 A taxonomy of cross-document
</SectionTitle>
    <Paragraph position="0"> relationships (W), phrases (P), sentences or paragraphs (S), or entire documents (D). The examples are from our MDS corpus (built from TDT and Web-based sources).</Paragraph>
    <Paragraph position="1"> Figure 5 presents a proposed, taxonomy of cross-document relationships. The Level column indicates whether the relation applies to words</Paragraph>
    <Paragraph position="3"> The same text appears in more than one location Two text spans have the same information content Same information content in different languages One sentence contains more information than another Conflicting information Information that puts current information in context The same entity is mentioned One sentence cites another document Qualified version of a sentence One sentence repeats the information of another while adding an attribution Similar to Summary in RST: one textual unit summarizes another Additional information which reflects facts that have happened since the last account  One example of a cross-document relationship is the cross-sentence informational subsumption (CSIS, or subsumption), which reflects that certain sentences repeat some of the information present in other sentences and may, under certain circumstances, be omitted during summarization. In the following example, sentence (2) subsumes (1) because the crucial information in (1) is also included in (2) which presents additional content: &amp;quot;the court&amp;quot;, &amp;quot;last August&amp;quot;, and &amp;quot;sentenced him to life&amp;quot;.  (1) John Doe was found guilty of the murder. (2) The court found John Doe guilty of the murder of Jane Doe last August and sentenced him to life.</Paragraph>
    <Paragraph position="4"> e.g., by referring to a person arrested at a crime scene as an &amp;quot;alleged&amp;quot; or &amp;quot;suspected&amp;quot; perpetrator. (5) Adams reportedly called for an emergency meeting with Trirnblc to lry to salvage the assembly.</Paragraph>
    <Paragraph position="5"> (6) Sirra Fein leader Gerry Adams appealed for an urgent meeting with Trimble.</Paragraph>
    <Paragraph position="6"> (7) The GIA is the most hardline of the Islamic militant groups which have fought the Algerian authorities since 1992.</Paragraph>
    <Paragraph position="7"> (8) The GIA is seen as most hardline of the Islamic militant groups which have fought the Algerian government during the past seven years.</Paragraph>
    <Paragraph position="8"> Paraphrase (3) Ford's program will be launched in the United States in April and globally within 12 months.</Paragraph>
    <Paragraph position="9"> (4) Ford plans to introduce the program first for its employees in the United States, then expand it for workers abroad.</Paragraph>
    <Section position="1" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
Modality
</SectionTitle>
      <Paragraph position="0"> New stories are often written in a way that makes misattributions of information difficult, Attribution (9) In the strongest sign yet that Russia's era of space glo~ is coming to an end, space officials announced today that cosmonauts will leave the Mir space station in August and it will remain unmanned.</Paragraph>
      <Paragraph position="1"> (I O) The crew aboard the Mir space station will leave in August, and the craft will orbit the Earth unmanned until early next year.</Paragraph>
    </Section>
    <Section position="2" start_page="77" end_page="80" type="sub_section">
      <SectionTitle>
Indirect Speech
</SectionTitle>
      <Paragraph position="0"> (I 1) An anonymous caller told the Interfax news agency that the Moscow explosion and a Saturday night bomb blast in southern Russia were in response to Russia's military campaign against Islamic, rebels in the southern territory of Dagestan.</Paragraph>
      <Paragraph position="1">  (12) An anonymous caller to Interfax said the blast and a car-bomb earlier this week at a military apartment building in Dagestan were &amp;quot;our response to the bombing of villages in Chechnya and Dagestan.&amp;quot; Followup (13) Denmark's largest industrial unions have rejected a wage proposal, setting the stage for a nationwide general strike, officials announced Friday.</Paragraph>
      <Paragraph position="2"> (14) A national strike entered its second week Monday, paralyzing Denmark's main airport and leaving most gasoline stations out of fuel and groceries short of frozen and canned foods.</Paragraph>
      <Paragraph position="3"> Judgment (15) Hardline militants of A\]geria's Armed Islamic Group (GIA) threatened Sunday to create a &amp;quot;bloodbath&amp;quot; in Belgium if the authorities there do not release several of its leaders jailed last month.</Paragraph>
      <Paragraph position="4"> (16) The GIA is demanding that Belgium release several of its leaders jailed in Belgium last month.</Paragraph>
      <Paragraph position="5"> Fulfillment (17) WASHINGTON, May 31 The Federal Bureau of Investigation plans to put suspected terrorist Osarna bin Laden, sought in connection with the bombings of the US embassy bombings in Afr/ea, on its &amp;quot;Ten Most Wanted&amp;quot; list, CNN reported Saturday. (18) WASHINGTON, June 7 The Federal Bureau of Investigation added Saudi fugitive Osama Bin Laden, sought for his part in the 1998 bombings of US embassies in Africa, to its &amp;quot;Ten Most Wanted List&amp;quot; Monday. Elaboration (19) Fugitive Saudi national bin Laden is believed to be the mastermind behind last year's bloody attacks against US embassies in Kenya and Tanzania.</Paragraph>
      <Paragraph position="6"> (20) Bin Laden, 41, is believed to be the mastermind behind last year's bloody attacks against US embassies in Kenya and Tanzania. Update (21) The confirmed death toll has already reached  49, while over 50 people are still unaccounted for, many presumed dead and buried in the ruins.</Paragraph>
      <Paragraph position="7"> (22) The con.firmed death toll has already reached 60, and another 40 people are still unaccounted for, most presumed dead and buried in the ruins.</Paragraph>
      <Paragraph position="8"> Definition (23) Yeltsin said the security forces must unite to fight terrorists, adding that he had appointed Interior Minister Vladimir Rushailo to head a special tea m coordinating anti-terrorist activities.</Paragraph>
      <Paragraph position="9"> (24) Yeltsin said the security forces must unite to fight terrorists, adding that he had named Rushailo to head a special team coordinating anti-terrorist activities.</Paragraph>
      <Paragraph position="10"> Contrast (25) Agriculture Minister Loyola de Palacio estimated the loss at dlrs 10 million. (26) Agriculture Minister Loyola de Palacio has estimated losses from mined produce at 1.5 billion pesetas (dlrs 10 million), although farmers groups earlier claimed total damages of nearly eight times that amount.</Paragraph>
      <Paragraph position="11"> Historical background (27) Elian's mother and 10 others died when their boat sank as they tried to reach the United States from Cuba.</Paragraph>
      <Paragraph position="12"> 5 Using CST for information fusion In this section we describe how CST can be used to generate personalized multi-document summaries from clusters of related articles in four steps: ehstering, document structure analysis, link analysis, and personalized graph-based summarization (Figure 6).</Paragraph>
      <Paragraph position="13"> The first stage, clustering, can be either query-independent (e.g., based on pure document  similarity [Allan et al. 98]) or based on a user query (in which case clusters will be the sets of documents returned by a search engine). The second stage, document analysis, includes the generation of document trees representing the sentential and phrasal structure of the document [Hearst 94, Kan et al. 98].</Paragraph>
      <Paragraph position="14"> : \</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="80" end_page="81" type="metho">
    <SectionTitle>
4. Stnnmarization
</SectionTitle>
    <Paragraph position="0"> The third stage is the automatic creation and typing of links among textual spans across documents. Four techniques for identifying related textual units across documents can be used: lexical distance, lexical chains, information extraction, and linguistic template matching. Lexical distance (see e.g., [Allan 96]) uses cosine similarity across pairs of sentences. Lexieal chains [Barzilay &amp; Elhadad 97] are more robust than lexical matching as they take into account linguistic phenomena such as synonymy and hypernymy. The third technique, information extraction [Radev &amp; McKeown 98] identifies salient semantic roles in text (e.g., the place, perpetrator, and effect of a terrorist event) and converts them to semantic templates. Two textual units are considered related whenever their semantic templates are related. Finally, a technique that will be used to identify some relationships such as citation, contradiction, and attribution is template matching which takes into account transformational grammar (e.g., relative clause insertion). For link type analysis, machine learning using lexieal metrics and cue words is most appropriate (see [Kupiec et al. 95], [Cohen &amp; Singer 96]).</Paragraph>
    <Paragraph position="1">  The final step is summary extraction, based on the user-specified constraints on the summarizer. A graph-based operator defines a transformation on a multi-document graph (MDG) G which preserves some of its properties while reducing the number of nodes. An example of such an operator is the link-preserving graph cover operator (Figure 7). Its effect is to preserve only these nodes from the source MDG that are associated with the preferred cross-document links. In the example, the shaded area represents the summary subgraph G&amp;quot; of G that contains all four cross-document links and only these nodes and edges of G which are necessary to preserve the textual structure of G'.</Paragraph>
    <Paragraph position="2"> Sumzo~ 1 The ~th~ of Elian Gotmd~ arrived Thmtday in the United State* saying he w~ated U.S.</Paragraph>
    <Paragraph position="3"> authorities to hand over his r.~ as soon as p~s~lC/: ,~o be could hug ~ nnd take hkn back to Cuba.</Paragraph>
    <Paragraph position="4"> TMt* ctb~s whe w~ gnmted visat to mtved to the United Stats with the Gomml~ family Elinn~ pediauieiaaa, landed'Urn teadaer and a male cousin -- wen not oil the pfalae.</Paragraph>
    <Paragraph position="5"> Summary 2 The father of ~ Gcn~alez m-bred Thursday in the United States saying he vamtod U~.</Paragraph>
    <Paragraph position="6"> audaotltles to hand owe his son a~ soon as p~ss~le so he could hug Elian and take him back to Cuba.</Paragraph>
    <Paragraph position="7"> Three eche*s whe were g~mted vlsas to travel to the United States wilh the Gonzalez f~uqy Eliia~ pediatrician, kindel.gartel~ ~ and a male cousin - were not on the plane.</Paragraph>
    <Paragraph position="8"> The U,S. govermmmt proved itself iatramigent On April $, on the issue of the vlsas l~ by Cuba fev a delegation composed of childn~, ~ton saxl p~holo~..~ts that wou~ acc~ml~my EliZa's father to that eoutm3, to receive custody of the child, reports Pmasa Lamina from Washington.</Paragraph>
    <Paragraph position="9"> The child's motha&amp;quot; aud I0 othea's were Idlled whim the boat sank 8s tl~y tfiod to flee Cuba for the United States. Elima and two adohe sa~vlved.</Paragraph>
    <Section position="1" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
5.1 Example
</SectionTitle>
      <Paragraph position="0"> The example in Figure 8 shows two summaries based on different user preferences. Summary (b) is based on &amp;quot;longer extract&amp;quot;, &amp;quot;report background information&amp;quot;, and &amp;quot;include all sources&amp;quot;. Summary (a) is generated from two CNN articles, while (b) is generated from two CNN articles plus one ffirom the Gramna of Havana, and one from ABC News.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML