File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1008_intro.xml
Size: 4,117 bytes
Last Modified: 2025-10-06 14:01:17
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1008"> <Title>Document Fusion for Comprehensive Event Description</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Conventional text retrieval systems respond to a user's query by providing a (ranked) list of documents which potentially satisfy the information need. After having identified a number of documents which are actually relevant, the user reads some of those documents to get the information requested. To be sure to get a comprehensive account of a particular topic, the list of documents one has to read may be rather long, including a severe amount of redundancy; i.e., documents partially conveying the same information.</Paragraph> <Paragraph position="1"> Although this problems basically holds for any text retrieval situation, where comprehensiveness is relevant, it becomes particularly evident in the retrieval of news texts.</Paragraph> <Paragraph position="2"> News agencies, such as AP, BBC, CNN, or Reuters, often describe the same event differently. For instance, they provide different background information, helping the reader to situate the story, they interview different people to comment on an event, and they provide additional, conflicting or more accurate information, depending on their sources.</Paragraph> <Paragraph position="3"> To get a description of an event which is as comprehensive as possible and also as short as possible, a user has to compile his or her own description by taking parts of the original news stories, ignoring duplicate information. Typical users include journalists and intelligence analysts, for whom compiling and fusing information is an integral part of their work (Carbonell et al., 2000). Obviously, if done manually, this process can be rather laborious as it involves numerous comparisons, depending on the number and length of the documents.</Paragraph> <Paragraph position="4"> The aim of this paper is to describe an approach automatizing this process by fusing information stemming from different documents to generate a single comprehensive document, containing the information of all original documents without repeating information which is conveyed by two or more documents.</Paragraph> <Paragraph position="5"> The work described in this paper is closely related to the area of multi-document summarization (Barzilay et al., 1999; Mani and Bloedorn, 1999; McKeown and Radev, 1995; Radev, 2000), where related documents are analyzed to use frequently occurring segments for identifying relevant information that has to be included in the summary. Our work differs from the work on multi-document summarization as we focus on document fusion disregarding summarization. On the contrary, we are not aiming for the shortest description containing the most relevant information, but for the shortest description containing all information. For instance, even historic background information is included, as long as it allows the reader to get a more comprehensive description of an event.</Paragraph> <Paragraph position="6"> Although the techniques that are used for multi-document fusion and multi-document summarization are similar, the task of fusion is complementary to the summarization task. They differ in the way that, roughly speaking, multi-document summarization is the intersection of information within a topic, whereas multi-document fusion is the union of information.</Paragraph> <Paragraph position="7"> They are similar to the extent that in both cases nearly equivalent information stemming from different documents within the topic has to be identified as such.</Paragraph> <Paragraph position="8"> The remainder of this paper is structured as follows: Section 2 introduces the main components and challenges of implementing a document fusion system. Issues of evaluating document fusion and some preliminary evaluation of our system are presented in Section 3. In Section 4, some conclusions and prospects on future work are given.</Paragraph> </Section> class="xml-element"></Paper>