File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1012_metho.xml
Size: 19,649 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1012"> <Title>The effects of analysing cohesion on document summarisation</Title> <Section position="4" start_page="76" end_page="78" type="metho"> <SectionTitle> 2 Technology base </SectionTitle> <Paragraph position="0"> As an integral component of an infrastructure for document analysis with a number of intercoimccted and mutually enabling linguistic filters, the summarization systeln discussed here makes use of'shallow' linguistic functions. The infrastructure is designed from the ground up to perform a variety of linguistic feature extraction functions, ranging fl'om single pass tokenisation, lexical tookup and morphological analysis, to coinplex aggregation of representative (salient) phrasal units across multidoculnent collections. Given such a document processing environment, the design of our summarizer is based on sentence selection mechanisms utlilizing salience ranking of phrasal traits in individual documents, when viewed against a background of the distribution of phrasal vocabulary across a large multi-document collectkm.</Paragraph> <Section position="1" start_page="76" end_page="77" type="sub_section"> <SectionTitle> 2.1 Linguistic filters </SectionTitle> <Paragraph position="0"> In essence, we have a robust text analysis system for identification of proper nanms and technical terms, since these are most likely to carry the bulk of the semantic load in a document. Howeveb in addition to simple identification of certain phrasal types, capabilities also exist for identifying their variants (contractions, abbreviations, colloquial uses, etc.) in individual documents in a multi-document collection. A collection vocabulary of canonical forms and variants, with statistical information about their distribution behaviom; are used in the summarizer's salience calculation. Salience, in turn, is a major component of the sentence-level score that selects the sentences for extraction (see 2.2 below).</Paragraph> <Paragraph position="1"> As a frequency-based system, our summarizer is ideally positioned to exploit linguistic analysis, filtering, and normalization functions. Morphological processing allows us to link multiple variants of the same word, by normalizing to lemma forms. Proper name identification is enhanced with context disambiguation, named entity typing, and variant normalisation; as a result the system's frequency analysis is more precise, and less sensitive to noise; ultimately, this leads to more robust salience calculation. Normalisation of different variants of the same concept to a canonical form is further facilitated by processes of abbreviations unscrambling, resolution of definite noun phrase anaphora, and aggregation across the entire document collection. The set of potentially salient phrases is enriched by the identification and extraction of technical terms; this enables the recognition of certain multi-word concepts mentioned in the document, with discourse properties indicative of high topicality value, which is also directly relevant to salience determination.</Paragraph> <Paragraph position="2"> Each document in a collection is analyzed individually.</Paragraph> <Paragraph position="3"> All 'content' (non-stop) words, as well as all phrasal units identified by the linguistic filters, are deemed to be vocabulary items, indexed via their canonical forms. With a view to future extensions of the base summarization function (see Section 5), these retain complete contextual information about the variants they have been encoun-tered in, as well as the local context of each occurrence. The vocabulary items are counted and aggregated across documents to form the collection w)cabulary. In addition to all the canonical forms and variants, the collection vocabulary contains the composite frequency of each canonical form, and its information quotient, a statistical measure of the distribution of a vocabulary item in the collection.</Paragraph> <Paragraph position="4"> Aggregating together similar items from different documents (cross-document co-reference) is far from straight-forward for multi-word items; howeveb being able to carry out a process of cross-document coreference resolution is clearly a further enabling capability for obtaining more precise collection statistics. A pronominal anaphora resolution function further contributes to the quality of the collection statistics.</Paragraph> <Paragraph position="5"> In addition to the domain vocabulary, the summarizer also has access to document structure information. A hierarchical representation of the document separates content and layout metadata, and makes the latter explicit in a document structure tree. Encoded are data including: appearance and layout tags; document title; abstract, and other front matter; (sub-)section, etc. headings; paragraphs, themselves composed of sentences; 'floating' objects like tables, figures, captions; side-bars and other text extraneous to the main document narrative; etc. Document structure is constructed by 'shadowing' markup parsing, as markup tags are used to construct the document structure tree; for documents without markup, structure determination is carried out on the basis of page layout cues. The document structure records additional discourse-level annotations, such as cue phrases marking rhetorical relations, quoted speech, and so forth. All of these elements both contribute directly to the summarizer's set of heuristics, as well as inform the discourse segmentation process.</Paragraph> </Section> <Section position="2" start_page="77" end_page="78" type="sub_section"> <SectionTitle> 2.2 Salience-driven summarization </SectionTitle> <Paragraph position="0"> With its set of linguistic filters, our frequency-based summarizer can exploit linguistic dimensions beyond single word analysis; this is not unlike the approach of (Aone et al., 1997). Due to the sophistication and integration of the filters (see Section 2.1), we are able to exploit a richer source of domain knowledge than most other frequency-based systems.</Paragraph> <Paragraph position="1"> Frequency alone is poor indicator of salience, even when ignoring stop words. Unlike early frequency-based techniques for sentence selection, we utilize the more indicative inverse document frequency measure, adapted from information retrieval, in which the relative frequency of an item in a document is compared with its relative frequency in a background collection. The trade-off, however, for more precise term salience is the summarizer's dependence on background collection statistics; we return to this issue below.</Paragraph> <Paragraph position="2"> Sentence selection is driven by the notion of salience; the summary is constructed by extracting the most salient sentences in the full document. The salience score of a sentence is derived partly from the salience of vocabulary items in the document and partly from its position in the document structure (e.g. section-initial, paragraphinternal, and so forth) and the salience of the surrounding sentences. The calculation of inverse document frequency for a w)cabulary item t compares its relative frequency in the document with its relative frequency in the collection. We define the item's salience score to be this inverse document frequency measure (in the formula below, No'oft and Nlgo,., refer to, respectively, to the number of items in the collection, and document).</Paragraph> <Paragraph position="4"> Salient items are items occurring more than once in the document, whose salience score is above an experimentally determined cutoff, or items appearing in a strategic position in the document structure (e.g. title, headings, etc.; see Section 2.1). All others are assigned zero salience.</Paragraph> <Paragraph position="5"> The score for a sentence is made up of two components.</Paragraph> <Paragraph position="6"> The salience component is the sum of the salience scores of the items in the sentence. The structure component reflects the sentence's proximity to the beginning of the paragraph, and its paragraph's proximity to the beginning and/or end of the document. Structure score is secondary to salience score; sentences with no salient items get no structure score.</Paragraph> <Paragraph position="7"> A set of heuristics address some of the coherencerelated problems discussed earlier (see 1). For example, under certain conditions, a sentence might be selected for inclusion in the summary, even if it has low, or even zero, score: sentences immediately preceding higher scoring ones in a paragraph may get promoted by virtue of an 'agglomeration rule'. Agglomeration is an inexpensive way of preventing dangling anaphors without having to resolve them. Another problem for sentence-based summarizers, that of thematic under-representation (or, loosely speaking, coverage; see 1), is addressed by an 'empty section' rule, which is of particular interest for this paper. Longer documents with multiple sections, or news digests containing several stories, may be unevenly represented in a sentence-extracted summary. The 'empty section' rule aims to ensure that each section is represented in the summary by forcing inclusion of its highest scoring sentences, ob if all sentence scores are zero, its first sentence.</Paragraph> <Paragraph position="8"> As a general purpose summarize~, ours makes extensive use of small scale linguistic information (term phrasal patterns) and large scale statistical information (term distribution patterns). With the exception of tile heuristic rules outlined earlier in this section, tile summarizer is operating without any focused analysis of cohesion factors in tile input text. I lence the departure point for this work, as already discussed (in Section 1): can the summarizer's performance be improved, if we take into account lexical cohesion in the source? We address this question by making the summarizer aware of certain discourse-level features of the document, and in particular, by leveraging tile topic shifts in it; to this end, the infrastructure has been augmented with a function for linear discourse segmentation.</Paragraph> </Section> <Section position="3" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 2.3 Linear discourse segmentation </SectionTitle> <Paragraph position="0"> Segmentation is a document analysis function which directly exploits one of tile core text cohesion factors, patterras of h'xicaI repetition (see Section 1.1), for&quot; identifying some baseline data concerning tile distribution of topics in a text. In particulal, discourse segmentation is driven by tile determination of points in the narrative where perceptible discontinuities in the text cohesion are detected.</Paragraph> <Paragraph position="1"> Such discontinuities are indicative of topic shifts, t%llowing the original idea of lexical chains (Morris and l lirst, 1991), subsequently developed specifically for the purposes of segmentation of expository text (Hearst, 1994), we have adapted an algorithm for discourse segmentation to our document processing environment. In particulab while remaining sensitive to tile distribution of &quot;terms&quot; across tile docunlent, and calculating similarity between adjacent text blocks by a cosine measure, our procedure differs from that in (Hearst, 1994) in several ways.</Paragraph> <Paragraph position="2"> We only take into account content words (as opposed to all terms yielded by a tnkenizatkm step). These are normalized to lemma forms. &quot;Termhood&quot; is additionally refined to take into account multi-word sequences (pcnper names, technical terms, and so forth, as discussed in Section 2.1 above), as well as a notion of co-reference, where different name variants get &quot;aggregated&quot; into the same canonical form. The cohesioi~ calculation (tll\]Ction is biased towards different types of possible break points: thus certain cue phrases (&quot;llowever&quot;, &quot;On lhe olher ham/&quot;) unambiguously signal a topic shift; document structure elements--such as sentence beginnings, paragraph openers, and section heads--are exploited for their 'pre-disposition' to act as likely segment boundaries; and so forth (see Section 2.1). The function is also adjusted to reduce the noise from block comparisons where the block bnundary--and thus a potential topic shift--falls at unnatural break points (such as tile middle of a sentence).</Paragraph> <Paragraph position="3"> By making segmentation another component within our document processing environment, we are able to use, transparently, the results of processes such as lexical mM morl;hohNical lookup, docmnent structure identification, and cue I#trase detection. Likewise, segmentation results are naturally incorporated in an annotation superstructure which records the various levels of document analysis: discourse segments are just another type of a 'span' (annotation) over: a number of sentences, logically akin to a paragraph (Bird and Liberman, 1999).</Paragraph> <Paragraph position="4"> Apart from the adjustments and modifications outlined above, we use essentially ltearst's formnla for computing lexical similarity between adjacent blocks of text bl and b2 (t denotes a discourse element term identified as such by prior processing, ranging over tim text span of the currently analyzed block; Wt,l,N is the normalized frequency of occurrence of the term in block b~\,):</Paragraph> <Paragraph position="6"> Unlike most applications of segmentatkm to date, which are concerned with the identification of segment boundaries, we are primarily interested ira Ieveraging the content of the segments, to the extent that it is indicative of the focus of attention, and (indirectl3; at least) points at tile topical shifts to be utilized for surnmary generation. We use tile segmentation results (together with the name and term identificatkm and salience calculation delivered by other functions) in order to ensure that all the base data for inferring the topic stamps, and topic shifts, ix available to the user.</Paragraph> </Section> </Section> <Section position="5" start_page="78" end_page="79" type="metho"> <SectionTitle> 3 Segmentation-assisted summaries </SectionTitle> <Paragraph position="0"> What is tile relationsldp between segmentation and summarizatkm: is segmentation a strictly &quot;under the covers&quot; function for tile summarizer, or might segmentation resuits be of any interest, and use, to tile end nser? We fOCLIS ()l 1. SOIlle strategies for incorporating segmentation results in tile summary generation process. 1 tmvever, unlike (Kan et al., 11998) (whose work also seeks to leverage linear segmentation for the explicit purposes of document summarization), we further take tile view that with an appropriate interface metaphor where the user has an overview of the relationships between a sunnnary sentence, the key salient phrases within it, and its enclosing discourse segment--a sequence of visually demarkated segments can impart a lot of information directly leading to in-depth perception of the summary, as it relates to the full docun~ent (Boguraev and Neff, 2000).</Paragraph> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 3.1 Strategies for utilizing segments </SectionTitle> <Paragraph position="0"> (_~ol'ln'llOll intuitions suggest a number of strategies for leveraging the results of linear discourse segmentation for enhancing stunmavizaLion. As topic shift points in tile text are 'published' into the document structure (see Section 2.3), by defining a segment as an additional type of document span (akin to sentence, paragraph, section, and so forth), the summarizer transparently, and immediately, becomes aware of the segmentation results. We also make arrangements for a mechanism whereby certain strategies for incorporating segmentation results into the SUlnnlarization process were easy to cast in summarizer terms.</Paragraph> <Paragraph position="1"> Thus, for instance, a heuristic requMng that each segment is represented in the summary can be naturally expressed by treating segments as sections, and strictly enforcing the 'empty section' rule (see 2.2). The selection of a segment-initial sentence for tile summary can be eraforced simply by boosting the salience score for that sentence above a known threshold. A decision to drop an anecdotal (or otherwise peripheral; see below) segment from consideration in summary generation would be realised by setting, as a last step prior to sumlnary generation, the sentence salience scores for all sentences in the segment to zeros.</Paragraph> </Section> <Section position="2" start_page="78" end_page="79" type="sub_section"> <SectionTitle> 3.2 Other benefits of segmentation </SectionTitle> <Paragraph position="0"> Such strategies are discussed in mnre detail latch as they naturally belong with their evaluation, llere we highlight a few observations concerning the overall benefits that segmentation brings to summarization. Thus, in addition to facilitating sentence-based summaries with certain discourse and rhetorical properties, it turns out that under certain conditions the summarizer can operate very effectively without a need for background corpus statistics. This is a better solution than the highly genre-sensitive approach of supplying a 'generic' background collection, against which summaries could be generated even for documents which are not a priori part of the collection.</Paragraph> <Paragraph position="1"> Note that the derivation of a background collection and statistics for it might be impractical for a variety of reasons: lack of access to a sufficiently large and representative data sample; no time for processing; sparse storage resources; and so forth. Clearly, being able to operate without such statistics is an operational bonus for the summarizer.</Paragraph> <Paragraph position="2"> Another use for segmentation is for optimising the use of source input, as well as possibly maximising its re-use. Occasionally, the document contains 'noise'--possibly in the form of anecdotal leads, closing remarks tangential to the main points of the story, side-bars, and so forth--which are inappropriate sources for summary sentences. Linear segmentation sensitive to topic shifts and document structure would identify such source fragments and remove them from consideration by the summarizer. Conversely, in certain news reporting genres a whole document fragment (typically towards the beginning or the end of the document) functions as a summary of the story: we would like to be able to use this fragment; clearly identifying it as a segment would help.</Paragraph> <Paragraph position="3"> We also use segmentation to handle long documents more effectively. While the collection-based salience determination works reasonably well for the averagelength news story, it has some disadvantages. For longer documents, with requisite longer summaries, the notion of salience degenerates, and the summary becomes just an incoherent collection of sentences. (Even if paragraphs, rather than sentences, are used to compose the summary--see e.g. (Mitra et al., 1997)--the same problems of coherence degradation and topical underrepresentation, remain.) We use segmentation to identify contiguous sub-stories in long documents, which are then individually passed on to the summarizer; the results of sub-story summaries are 'glued' together.</Paragraph> </Section> </Section> class="xml-element"></Paper>