File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0712_metho.xml
Size: 21,935 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0712"> <Title>T A van Dijk and W Kmtsch Cognitive Psychology and Discourse Recalling and Summarmmg Stones In W U Dressier, editor, Current Trends</Title> <Section position="4" start_page="74" end_page="75" type="metho"> <SectionTitle> 2 Intelligent Content Selection </SectionTitle> <Paragraph position="0"> Criteria In order to identlfy generic content eelectlon features that can be used by COSY-MATS m any apphcatlon context, an extensive corpus analyms was camed out on a variety of real-world texts Three mare types of text have been analysed newspaper articles, sczent~fic papers and (s~entsfic) author abstracts The subcorpus of newspaper artxcles (160) m extremely dwerse m both ~ts content * and form The topics range from business news and legal reports to social commentary, me&cal msues and pohtlcs Slmalarly, the other two subcorpora constst of 170 articles and abstracts, respectxvely, that pertain to scmnttfic fields such as computer science, the natural scmnces, as well as pinlosophy and hngmstxcs In addltmn, the texts are of varying length from half a page m the case of the abstracts and most news agency reports, to four or more pages, when smentlfic papers and newspaper special reports are revolved Consequently, apart from covering a range of subject domama, the corpus used m designing the content selection processes m COSY-MATS a\]8o represents more than two text types The corpus was analysed both on the surface and on more abstract levels Given the chvermty of the types of text and the writing styles exh~inted m the corpus, regulantms regarding the rhetorical develop- null ment of the texts and the central mformat4onal umts thereto could not be easdy estabhshed 0nly m the case of the smenttfic papers and thetr abstracts could any statements be made on the logical progresmon of the presentation of the content, from the purpose of the research, to the methodology, the experimental set-up and the evaluation of any results (d (Gopink, 1972, Jordan, 1993, Lucas et ai, 1993, Mmzell et al, 1971)) The newspaper articles were mainly studmd m terms of groups of ad3acent sentences and the rhetoncal relatlonsinp between them (d (Ono et al, 1994)) No generahsatlons could be made regarding their top-level orgamsatlon A number of theories of pragmatlcs, dmcourse analyms and text development have prowded useful concepts for tins study of the corpus at a Ingher level * a) theories winch are preoccupmd with the comrnun:catsng agents, their goals, plans and behefs, such as Speech Act Theory (Austin, 1962, Searle, 1969), Rhetorical Structure Theory (RAT) (Mann and Thompson, 1987), or AI research on scripts (Lehnert, 1981, Schank and Abelson, 1977) and behef ascnptmn (Wdks and Balhm, 1987) * * b) theories on the tracking of the dsscourse h~tory by means of identlfymg the focused items thereto, e g (Grosz, 1986, Hobbs, 1978, Relchman, 1985, S1dner, 1983, Webber, 1983) * c) theories of cohesson and coherence and how these are m~mfested on the surface of the text, e g SysteInlc-~uuctlonal Lmgutstlcs (Halhday and Hasan, 1976) and the Problem-Solutmn mformation metastructure (Hoey, 1994, Jordan, 1984) (cf (Pmce, 1981)) The &verslty of the subject matter covered m the corpus has meant that specmhsed keywords were ignored m its analysm Instead the emphasts was placed on functlon words and.regular general-language content words winch are assooated wlth the mstantlatmn of the semantlc, rhetorical and pragmatlc functlous cous~dered Such lemcal xtems can be employed as markers, not only of the development of the dmcourse but also of the focused and central points thereto In thin process, the var1ous cohesion and coherence theorettlcal frameworks were very mfluentlal, as were the computatlonal approaches to focus pre&ctmn and identtficatmn As a result of thin corpus analysm at the sur-.</Paragraph> <Paragraph position="1"> face and more abstract levels, 87 features have been identified as relevant to content selectmn and importance determmatlon across domains and, largely, text types (Aretoulakt, 1996) Three descnptlve levels are used for thezr classflicatlon the pragmatzc, the :ntermed:ary and the surface, m decreasing order of abstractlon The three levels reflect, m a sense, the three maul trends m dtscourse theory identflled Thus, the 24 pragmatxc features</Paragraph> <Paragraph position="3"> mcatmg agents Pragmatxc features such as Plan and Goat for instance, are remmmcent of AI work on scripts (Sch~nk and Abelson, 1977), Elabora- II twn and FEzplanatwn can be parallelled to P,.STrela- &quot; ttons (Mann and Thompson, 1987) * ,, developed COSY-MATS (of ectlon 3) To tins effect, a number of mterlevel mappings were identl- | fled both between the pragmatxc and the lower lev- . els, and between the mtermechary and the surface Clusters of Pragmatic Features Figure 1 The intermediary .features (Fig 2) represent rhetorical semantxc criteria often employed m the processing of focus reformation and m anaphor chsamblguatlon For example, Topscalzsahon, Focus Change, Cardznahty and Elhps~ have all been used m computatmnal contexts such as (Hobbs, 1978, Rexchman, 1985, Sldner, 1983, Webber, 1983) Finally, the surface features (Fig 3) comclde mostly with exphmt cues m the text wlnch denote cohesive and coherence relatlous among sentences (d (Li~hn, 1958, Pmce, 1981)) The Functzon Word and the Common Content Word Pools, for instance, conmst of lemcal 1terns with a semantic/rhetorical load exteuslvely dmcussed m a Systennc-Fauctxonal (Coulthard, 1994) and Problem-Solutxon context (Jordan, 1984, Jordan, !995) Consequently, by using features such as these m COSY-MATS, all three levels of language --from the low-level surface to the Ingh-level pragmatic-can be CoUectlvely consxdered m order to 'hohstlcally' determine the unportance of m&wdual propo-S~tlons m a text Apart from tins grouping of the features into different levels, the surface and the mterme&ary features proposed m tins scheme have also been used to objectify the abstract pragmahc features Tins was m order to faclhtate the automatic evaluation of the latter during the actual operatxon of the fullylevels These mappings were compiled m a manual which was used by 5 subjects m encoding texts from tins corpus (Aretoulakl, i996) The encoded texts were then employed for the empzncal testing of a prototype of the content selection module, reported m Section 4 Example mappings are given below * The pragmatic feature Repehtzon m correlated to the surface features Personal and Possessz~e It is also associated with the mtermechary Focus Change (Sldner, 1983, Webber, 1983).and Ellspszs (Hovy, 1987) Tins m because the central topxcs m a text are often resumed by means of .anaphora, both m the same sentence and later on m other nnportant.sentences * The presence of unpersonal phrases m the Passs~e on the surface level m extensively used to express a Generahsahon on the pragmatic level The latter denotes a central text umt by deftration (Gopmk~ 1972, Lehnert, 1981, van Dyk and Kmtsch, 1978) * The surface Negatzon m correlated to the mterme&ary Contrast (Jordan, 1984) * Modals such as &quot;should&quot; are also exteusxvely used on the surface of discourse, when proposing, evaluating, or making tentative claims m general Thus, tins feature m also related to the pragmatic Behef/Doubt, Volttzon/Fear and Plan (cf (Fakumoto and TsUjnl 1994)) Ewdence for the usefulness of the mterlevel mappings proposed m the context of the COSY-MATS con-</Paragraph> <Paragraph position="5"> ~gure 3 tent selectson feature scheme was provided by vaL sdatmn tests regarding the tunfornnty of the few ture evaluation practices among the human encoders (Aretoulakl, 1996) The encoding of an identical part of the corpus by means of all the pragmatic features showed that there was a total of 79 6% agreement among the encoders on the evaluation of the pragmatic features, using the above-mentioned manual Consequently, the identified surface .and other less subjective features can be fully exploited later on for the automation of the encoding of the abstract pragmatic features The vahdatwn tests also mdtcated that there was 96% agreement on which of the corpus sentences were m~portant and wlnch nmmportant for the corresponding texts</Paragraph> </Section> <Section position="5" start_page="75" end_page="77" type="metho"> <SectionTitle> 3 A Scalable Architecture for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="75" end_page="77" type="sub_section"> <SectionTitle> Intelligent Sumn~risation </SectionTitle> <Paragraph position="0"> Having identified 'umversal' content selection features, as well as some of the ways these interact vath each other, the following arc\]ntecture w~ designed for a full-scale zmplementation of thecosY-MATS s-mmansatlon shell (Fig 4) (Aretoulala, 1996) Every sentence m the text to be sttmmansed s is first processed by a cluster of standard symbohc analysets, morphological, syntactic, semantsc and pragmatic The resnlt of tlus processing ~s the e~valua tion oPS a set of basic hngtustic and extrahngtustic Xwlnch is assumed to be integral and coherent, rather features that prowde the input for a Cascade of low and lngher-level Artdic~al Neural Networks (ANNS), each responslble for specific subtusks The low-level ANNs map hngmstxc features (surface and mtermechary) into extrahngtustzc features (mtermedlary and pragmatic) The pragmatzc features provide the mput to the lnghest-level content selection ANN that ultunately determines the relative degree of nnportauce of each sentence This latter ANN zs also the only component of COSY-MATS that has been implemented to date Finally, the sentences selected as unportant during the content selectlon phase vnll be used as the basis for generating either a comprehenssve summary or a more concise abstract (Aretoulakl, 1996) This processing wdl take place m another duster of symbohc processors, almost symmetric to that used for text analysm and mterpretatzon It is here that the plan-rag and the actual synthesm of the summary/abstract wdl be reahsed However, it is unportant to note that the output hat of the best-sconng sentences produced by the content selection ANN can a/so be used to pro~nde a draft summary, z e a concatenation of already-e~tmg sentences instead of an original text (cf (Kuplec et al, 1995)) Tins m also the only type of generatlon that m currently preoccupying tlns research (cf Sectmn 4 1) D~plte the dominance of the generic modules thereto, COSY-MATS does provide for the incorporation of apphcatlon-spectfic mformatlon F~rst of all, the architecture m lnghly modular, so that new specaahsed processors can be --m prmclple--- rumply plugged m The smaphcaty of the interface between the various modules means that new modules that are either symbohc or connectmmst can equally well be accommodated For example, m adchtmn to the extstmg lower-level ANNS, other ANNS can be easily incorporated winch have been trained to recogmse specfllc keywords and structural phrases that dflferentmte one domain or text type from the other m expressing the same rhetorical and pragmatic functions Hence, COSY-MATS can function as a shell for the btuldmgof specmhsed summ~rtsers As regards the front-end symbohc analysers, the processing that will take place thereto wall be dictated by the type of data that needs to be computed m the ANNs The latter computatmn, m turn, wdl be based on .the ldentzfied generic and apphcatmnspecflic mappings across the three levels of descnptaon the pragmatic, the mterme&ary and the surface (Sectmn 2) In ad&tmn, it ts the nnplementat~on of the content selection ANN that will determine the eventual type and number of pragmatic features reqmred for the whole process of summaneatmn (Sectmn 4) As a result, a partial analysts and interpretation of the input text only need to be performed m COSY-MATS The common problem m NLU-based systems of combmatozaal explosmn and mefllcment computatmn m the search for a solution will thus be largely avoided At the same tnne, thts pragmatmm m the analysts and interpretation processes does not decrease the amount of deep processtug (semant!c, chscourse and pragmatic) camed out m the system High-level processing ts sahent m the pragmatic featuresldenttfied These are~ nonetheless, 'grounded' by means of the generic lower-level features, as well as other surface and semantic charactenstlcs of texts pertaining to the specL~C apphcation of interest In surnm.~lT, the proposed arclntecture ts both modular and hybrid The complex task of content selectmn ts systematically decomposed into much more manageable computations In ad&tlon, the.</Paragraph> <Paragraph position="1"> strong points of both symbohc and connectlontst processing are combined m a compiementary Way (cf (Axetoulah, 1996)) The symbohc analysers can. work vnth structured data of arbitrary length laden w~th variables They also have powerful symbol-matching faczhtms (as ts appropriate for lower-level text analysas) In contrast, the ANNS are able to deal wtth fuzzy and inexact proceasmg (as ts revolved m nnportance determination and raterlevel feature mappings) (McClelland and Rumelhartl 1986, Rumelhart and McClelland, 1986)</Paragraph> </Section> </Section> <Section position="6" start_page="77" end_page="78" type="metho"> <SectionTitle> 4 Empirical Evidence </SectionTitle> <Paragraph position="0"> As the first and most cructal step m unplementmg COSY-MATS, a prototype of its content selectmn ANN was developed Tins ts a standard feed-forward back-propagation network (Rumelhatt et al, 1986) Tins ANN receives m&wdual text sentences from the text to be snmmansed, hand-coded 2 by means of the identified pragmatic features, and assagns to them degrees of maportance It has been a major assumption behind tins work that it m feature combmatzons rather than individual features that charactense sentence importance (Sectmns 1 & 2) An ANN learns such interactions naturally, wlnch m why the connectlomst paradigm was.adopted for the content selection task The training corpus conststed of 1,8801 sentences m total, taken from the real-world text collection described m Sectmn 2 1,100 of them are sentences largely out of thetr context, wtule the remmnmg 780 sentences make up 29 full texts In contrast to the dwersaty of the former subcorpus, each of the latter texts ts approxLmately 23 sentences long and was fully encoded The encoding was camed out by 5 mchvlduals on the basas of the above-mentmned manual wlnch exemphfies the correlations between the surface and the more abstract features m the proposed scheme The manual was used m order to standarchse the encoding process as much as possable, as well as to vahdate the proposed ways m wlnch the evaluatmn of the abstract pragmatic features can be objectified and fully automated later on m the completed system Experiments to date (cf (Aretoulah, 1996)) have demonstrated the superiority of the pragmattc features over input to the ANN from aLcross the three levels of abstraction (58 1% vs 56 1% success on average, where 'success' coincides with agreement vnth the judgement made by the human encoder regardmg the level of nnpo/'tance of the corresponding sentence) The snnultaneous use of control experiments wtth nomy data S has ensured the vah&ty of these results (50 1% success) In addttlon, the testmg on whole texts has prowded comparable results to those acqmred with molated sentences, namely 56 8% success on average, thts suggests that the pragmatic features are sufficiently abstract to capture tuerarch~cal and structural aspects of the corresponchng dmcourses The dlversaty of the corpus m terms of subject matter, text type and length provides sutBcaent ewdence for the appropriateness of the pragmatic fea2given that the remaining components of COSY-MATS have not been tmplemented as yet, tures for the Ingh-level representatlon of texts from any domain or text type Moreover, the portabflity Of these pragmatlc content selectlon features has also been partly proved wlth experiments on whole texts (AretoulaJa, 1996) These re&cared that only a small amount of retraining ~s reqmred for the ANN to deal wlth new text types, winch mvolves a hmited number of representatlve texts Thus, what is pre&cted to dL~er between text types is the relatlve influence of each of the identflled features m the final wmghtmg of the corresponding sentence</Paragraph> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 4.1 Generating Draft Sllmmarles </SectionTitle> <Paragraph position="0"> The 'draft' s11,nmanes that result after concatenating the sentences of the input text that were selected by the ANN as Important are, on the whole, adequate for current awareness purposes (See (Aretoula\]a, to appear) for a detailed evuluatmn of tins and other draft output) The ANN recelves a single --coherent and largely cohemve--- text each tIme, rather than a collection of unrelated texts Sentence selection was based on the 24 pragmatic features used for their encoding and the statmt~cal correlatlous among them, as mchcated m the tratmng corpus Most Importantly1 by faltering out the sentences for winch the AnN &d not have a clear dectslon, I e by adapting the corresponchng threshold on-/me, content selectmn can be more fiue-grarned and the output summanes more brief An example draft summary for a newspaper article after the apphcatlon of tiLtS type of fdtermg ~s shown below In tins case, there was 8~ 6~ agreement between the ANN decision and the corresponding human judgement regarding the importance of m&v~dual sentences m thin article 4 (I) Moscow e&tors fee\] the old-fashmned grip of the state (Headline) (~) Intense party pressure for the &enuseal of a prominent hberal e&tor and a new campmgn to d~sere&t the ra&cal pohtw~sn Bona Yeltsm - both apparently with the badang of President Gorbachev - have rinsed fears among re- null formers of a conservative swing by the Soviet leaderslup (5) On Monday evemug, he was summoned to the Central Comm,ttee to be told m so many words by Va&m Medvedev, the Pohtburo member m charge of ideology, that he should leave has post (6) The move follows a harsh talk dehvered last week by Mr Gorbachev to a group of semor Soviet e&tors, m which he gave several a dressing down (12) Some joumalmts are talking of a protest strike.</Paragraph> <Paragraph position="1"> (13) 'The press ~s qmte stmp ly now facing bans on what ~t can write about, we're going back 4The 5 subjects were free as to the number of sentences they could p\]ek out from any text as unportant Importance, m turn, was defined as the relative m&spensab~hty from the final S,,rnm~ry of the proposmons expressed m the corresponding sentence Thts was determined on the basra of the whole text the sentence belongs. to i to the situation of years ago,' one complained yesterday (16) The motion, which could prefigure a head-on clash between the party and a steadily more assertive parhament, attacks the Central Comrmttee ldeol0gy department for fits 'unacosptable attempts' to cow a newspaper (22) Ba~n$ for Mr Yeltsm zs not umversal (23) But the fact that the parhamentary exchanges were broadcast on prime tune televlszon leaves no doubt that a campmgn m under way to smear a man whose huge following makes hun Mr Gorbachev's only real rival Despite the coincidental coheslveness thereto, tins draft output comprises the majority of the semantically substantial sentences m the input text The concatenation of sentences from the original Is undoubtedly a much simpler task than the generatmn of an extended summary or a concise abstract Novel text synthes~s m the fully-developed COSY-~/ATS wall also benefit from the proposed mappings between the surface and the more abstract content selectmn features Since the corresponding modules, however, have not been implemented yet, the processes revolved wall not be exemphfied here</Paragraph> </Section> </Section> <Section position="7" start_page="78" end_page="79" type="metho"> <SectionTitle> 5 Conclusion: COSY-MATS is not a </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="78" end_page="79" type="sub_section"> <SectionTitle> Utopia </SectionTitle> <Paragraph position="0"> All experimental results to date indicate that content selection m the completed COSY=MATS environment can be robust and efficient, even m the absence of any custonnsatlon to the spemfic apphcatlon (domare or text type) or the user reqmrements Tins m due to the adoption of the connectlon~t paradzgm for fins fuzzy task and the proven generic nature of the pragmatic and lower-level features used thereto In the context of further tmplementmg tins summansatlon shell, current work mcludes the testing of ulternatlve.learnmg algolr/thms for the prototype content selectlon ANN m order to Improve ~ts success rate In addlt\]on, the more ngourous specflicatlon of the mappings between the surface cues and the.</Paragraph> <Paragraph position="1"> mtermechary and pragmatlc features is attempted for the subsequent development of speclahsed processors that compute them Thus, the encoding of the pragmatlc features will be fully automated and It will also be posslble to measure the precme effect that tlns wfl/ have on the trmnlng of the whole cascade of ANNa, glven the current praco tlce of hand-coding Moreover, the impact on the content selectlon ANN Of incorporating apphcatlono dependent mformatlon m the system will also be stu&ed (cf (Aretoulakl, 1996)) What is nnportant Is that research to date has proved that the reahsatlon of the COSY-MATSmtelhgent and scalable summansatzon shell m by no means a utopla</Paragraph> </Section> </Section> class="xml-element"></Paper>