File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0211_intro.xml
Size: 11,547 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0211"> <Title>Using Coreference Chains for Text Summarization</Title> <Section position="3" start_page="0" end_page="78" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper we report preliminary work which explores the use of coreference chains to construct text summaries. Sparck Jones (1993) has described summarization as a two stage process of (1) building a representation of the source text and (2) generating a summary representation from the source representation and producing an output text from this summary representation. Our source representation is a set of coreference chains - specifically those chains of referring expressions produced by an information extraction system designed to participate in the MUC-7 coreference task (DARPA, 1998).</Paragraph> <Paragraph position="1"> Our summary representation is a 'best chain', selected from the set of coreference chains by the application of one or more heuristics. The output summary is simply the concatenation of (some subset of) sentences from the source text which contain one or more expressions occurring in the selected coreference chain.</Paragraph> <Paragraph position="2"> The intuition underlying this approach is that texts are in large measure 'about' some central entity, which is effectively the topic, or focus of the discourse. This intuition may be false there may be more than one entity of central concern, or events or relations between entities may be the principal topic of the text. However, it is at the very least an interesting experiment to see to what extent a principal coreference chain can be used to generate a summary. Further, this approach, which we have implemented and preliminarily evaluated, could easily be extended to allow summaries to be generated from (parts of) the best n coreference chains, or from event, as well as object , coreference chains.</Paragraph> <Paragraph position="3"> The use of document extracts formed from coreference chains is not novel. Bagga and Baldwin (1998) describe a technique for cross-document coreference which involves extracting the set of all sentences containing expressions in a coreference chain for a specific entity (e.g.</Paragraph> <Paragraph position="4"> John Smith) from each of several documents.</Paragraph> <Paragraph position="5"> They then employ a thresholded vector space similarity measure between these document extracts to decide whether the documents are discussing the same entity (i.e. the same John Smith). Baldwin and Morton (1998) describe a query-sensitive (i.e. user-focused) summarization technique that involves extracting sentences from a document which contain phrases that corefer with expressions in the query. The resulting extract is used to support relevancy judgments with respect to the query.</Paragraph> <Paragraph position="6"> The use of chains of related expressions in documents to select sentences for inclusion in a generic (i.e. non-user-focused) summary is also not novel. Barzilay and Elhadad (1997) describe a technique for text summarization based on lexical chains. Their technique, which builds on work of Morris and Hirst (1994), and ultimately Halliday and Hasan (1976) who stressed the role of lexical cohesion in text coherence, is to form chains of lexical items across a text based on the items' semantic relatedness as in- null dicated by a thesaurus (WordNet in their case).</Paragraph> <Paragraph position="7"> These lexical chains serve as their source representation, from which a summary representation is produced using heuristics for choosing the 'best' lexical chains. From these the summary is produced by employing a further heuristic to select the 'best' sentences from each of the selected lexical chains.</Paragraph> <Paragraph position="8"> The novelty in our work is to combine the idea of a document extract based on coreference chains with the idea of chains of related expressions serving to indicate sentences for inclusion in a generic summary (though we explore the use of coreference between query and text as a technique for generating user-focused summaries as well).</Paragraph> <Paragraph position="9"> Returning to Halliday and Hasan, one can see how this idea has merit within their framework. They identify four principal mechanisms by which text coherence is achieved - reference, substitution and ellipsis, conjunction and lexical cohesion. If lexical cohesion is a useful relation to explore for getting at the 'aboutness' of a text, and hence for generating summaries, then so too may reference (separately, or in conjunction with lexical cohesion). Indeed, identifying chains of coreferential expressions in text has certain strengths over identifying chains of expressions related merely on lexical semantical grounds. For, there is no doubt that common reference, correctly identified, directly ties different parts of a text together - they are literally 'about' the same thing; lexical semantic relatedness, as indicated by an external resource, can never conclusively establish this degree of relatedness, nor indeed can the resource guarantee that semantic relatedness will be found when it exists. Further, lexical cohesion techniques ignore pronomial anaphora, and hence their frequency counts of key terms, used both for identifying best chains and best sentences within best chains, may often be inaccurate, as focal referents will often be pronominalised.</Paragraph> <Paragraph position="10"> Of course there are drawbacks to a coreference-based approach. Lexical cohesion relations are relatively easy to compute and do not rely on full text processing - this makes summarisation techniques based on them rapid and robust. Coreference relations tend to require more complex techniques to compute.</Paragraph> <Paragraph position="11"> Our view, however, is that summarisation research is still in early stages and that we need to explore many techniques to understand their strengths and weaknesses in terms of the type and quality of the summaries they produce.</Paragraph> <Paragraph position="12"> If coreference-based techniques can yield good summaries, this will provide impetus to make coreference technologies better and faster.</Paragraph> <Paragraph position="13"> The basic coreference chain technique we describe in this paper yields generic summaries as opposed to user-focused summaries, as these terms have been used in relation to the TIPSTER SUMMAC text summarization evaluation exercise (Mani et al., 1998). That is, the summaries aim to satisfy a wide readership by supplying information about the 'most important' entity in the text. But of course this technique could also be used to generate summaries tailored to a user(group) through use with a preprocessor that analyzed a user-supplied topic description and selected one or more entities from the topic description to use in filtering coreference chains found in the full source document. null The rest of this paper is organised as follows. In Section 2 we briefly describe the system we use for computing coreference relations. Section 3 describes various heuristics we have implemented for extracting a 'best' coreference chain from the set of coreference chains computed for a text; and, it discusses how we select 'best' sentences to include in the summary from those source text sentences containing referring expressions in the 'best' chain. Section 4 presents a simple example and shows the different summaries that different heuristics produce. Section 5 describes the limited evaluation we have been able to carry out to date, but more importantly introduces what we believe to be a novel and interesting way of reusing some of the MUC materials for assessing summaries.</Paragraph> <Paragraph position="14"> 2 Coreference in the LaSIE system The LaSIE system (Gaizauskas et al., 1995) has been designed as a general purpose IE system which can conform to the MUC task specifications for named entity identification, coreference resolution, IE template element and relation identification, and the construction of scenario-specific IE templates. The system has a pipeline architecture which processes a text one sentence at a time and consists of three prin- null cipal processing stages: lexical preprocessing, parsing plus semantic interpretation, and discourse interpretation. The overall contributions of these stages may be briefly described as follows (see (Gaizauskas et al., 1995) for further details): lexical preprocessing reads and tokenises the raw input text, performs phrasal matching against lists of proper names, identifies sentence boundaries, tags the tokens with parts-of-speech, performs morphological analysis; parsing and semantic interpretation builds lexical and phrasal chart edges in a feature-based formalism then does two pass chart parsing, pass one with a special named entity grammar, pass two with a general grammar, and, after selecting a 'best parse', which may have only partial coverage, constructs a predicate-argument representation of each sentence; discourse interpretation adds the information from the predicate-argument representation to a hierarchically structured semantic net which encodes the system's world and domain model, adds additional information presupposed by the input, performs coreference resolution between new and existing instances in the world model, and adds any information consequent upon the new input.</Paragraph> <Paragraph position="15"> The domain model is encoded as a hierarchy of domain-relevant concept nodes, each with an associated attribute-value structure describing properties of the concept. As a text is processed, instances of concepts mentioned in a text are added to the domain model, populating it to become a text-, or discourse-, specific model.</Paragraph> <Paragraph position="16"> Coreference resolution is carried out by attempting to merge each newly added instance with instances already present in the discourse model. The basic mechanism, detailed in Gaizauskas and Humphreys (1997), is to examine, for each pair of newly added and existing instances: semantic type consistency/similarity in the concept hierarchy; attribute value consistency/similarity, and a set of heuristic rules, some specific to particular types of anaphora such as pronouns, which can act to rule out a proposed merge. These rules can refer to various lexical, syntactic, semantic, and positional information about instances, and have mainly been developed through the analysis of training data. A recent addition, however, has been the integration of a more theoretically motiv-.</Paragraph> <Paragraph position="17"> ated focus-based algorithm for the resolution of pronominal anaphora (Azzam et al., 1998).</Paragraph> <Paragraph position="18"> This includes the maintenance of a set of focus registers within the discourse interpreter, to model changes of focus through a text and provide additional information for the selection of antecedents.</Paragraph> <Paragraph position="19"> The discourse interpreter maintains an explicit representation of coreference chains created as a result of instances being merged in the discourse model. Each instance has an associated attribute recording its position in the original text in terms of character positions. When instances are merged, the result is a single instance with multiple positions which, taken together, represent a coreference chain.</Paragraph> </Section> class="xml-element"></Paper>