File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0702_metho.xml

Size: 39,030 bytes

Last Modified: 2025-10-06 14:14:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0702">
  <Title>[6\] Defense Advanced Research Prolects Agency Fourth Message Understanding Conference (MUC-4), McLean, VlrgTaua, 1992 Software and Intelhgent Systems</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Capsule overviews
</SectionTitle>
    <Paragraph position="0"> The malonty of techmques for &amp;quot;summansatlon&amp;quot;, as apphed to average-length documents, fall within two broad categories those that rely on template mstantmtlon and those that rely on passage extrachon Work m the former framework traces its roots to some pioneering research by DeJong \[7\],-and Trot \[29\], more recentlydeg the DARPA-sponsored TIPSTER programme (\[2\])--and, m parUcular, the message understanchng conferencces (MUC e g \[6\] and \[I\])--have prowded fertile ground for such work, by placing the emphams of document analysm to the ldentdlca~on and extracfaon of certain core entttms and facts m a document, which are &amp;quot;packaged&amp;quot; together m a template There are shared mtulttons among researchers that generaUon of smooth prose from thts template would ymld a summary of the document's core content, recent work, most notably by McKeown and colleagues (cf \[21\]), focuses on making these mtul~ons more concrete While prowdmg a rich context for research m generatlon, this framework requires an analysm front end capable of mstantmtmg a template to a statable level of detail Given the current state of the art m text analysm m general, and of semanttc and discourse processing m partmular (Sparck Jones, \[27\] and \[28\], discusses the depth of understanding requtred for constructing true summaries), work on template-driven, knowledge-based summansabon to date m hardly domain- or gen~mdependent The alternative framework largely escapes thts constramt, by vmwmg the task as one of Identifying certain passages (typmally sentences) whtch, by some metnc, are deemed to be the most representatwe of the documant's content The techmque dates back at least to the 50% (Luhn, \[17\]), but it is relattvely recently that these ideas have been filtered through research wath strongly pragmatm constraints, for instance what kinds of documents are ophmally suited for being &amp;quot;abstracted&amp;quot; m such a way (e g Preston and Wllhams \[23\], Rau et a/\[25\]), how to derive more representattve sconng fimctlons (e g for complex documents, such as multi-topic ones, Salton et al \[26\], or where training from professionally prepared abstracts m possible, Kupmc et al \[15\]), what heuristics nught be developed for tmprovmg readabfltty and coherence of &amp;quot;narrattves&amp;quot; made up of dmconhguous source document chunks, Paine (\[22\]), or with optnnal pres~taUons of such passage extracts, armed at retaining some sense of larger and/or global context (Mahesh \[18\]) The cost of avoiding the requirement for a languageaware front end ts the complete lack of intelligence or even context-awareness--at the back end the vahdlty, and uttht~ of sentence- or paragraph-sEed extracts as representahons for the document content m still an open questLon (Rau \[24\]), especmlly with the recent wave of commercml products announcing built-in &amp;quot;summansa-Uon&amp;quot; (by extractton) features (Caruso \[4\]) * In this work, we take an approach wbach nught be construed as striving for the best of both worlds We use hngmstmally-mtenswe techruques to ,dentffy highly sahent phrasal umts across the entree span of the document, capable of fiancttonmg as to/nc stamps The set of topic stamps, presented m ways which both retam local and reflect global context, m what we call sahence-based content charactensatwn, or a capsule overvww, of the document null A capsule overvtew m not a summar~ m that it does not attempt to convey document content as a sequence of sentences It m, however, a senu-formal (normahsed) representation of the document, derived after a process</Paragraph>
    <Paragraph position="2"> of data reducUon over the original text Indeed, by adoptmg ~ner granularity Of representauon (below that of sen 7 tence), we consaously trade m &amp;quot;readab~hty&amp;quot; (or narrative coherence)for tracking of detad 2 In particular, we seek to charactense a document's content m a way which ms representahve of the full flaw of the narratwe this ~s m contrast to passage extraction methods, which typically h~ghhght only certain fragments (an unavoidable consequence of the compronuses necessary when the passages are sentence-stzed) A capsule overwew ms not a fully mstantmted meanmg template eather A pnmary considerahon m our work ms that content charactensahon methods apply to any document source or type Tins emphasms on domain independence translates into a processing model which stops short of a fully mstantmted semantic representation Sun~larly, the requirement for eJ~iaent, and sca/ab/e, technology necessitates operahng from a shallow syntactic base, thus our procedures are designed to arcumvent the need for a comprehensive parsing engine Not having to rely upon the parsing components typically seeking to dehver mdepth, full, syntactic analysms of text, makes ~t posslble to generate capsule summaries for a variety of documents, up to and including real data from unfanuhar domains or novel genres For us, a capsule overwew is instead a coherently presented hst of those hngumt~c expressions wluch refer to the most pronunent objects mentioned m the dl~urse--Its topw stamps---and prowde further spenficat~on of the relational contexts m wluch they appear The mtmt~ons underlying our approach can be illustrated with the following news article s.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PRIEST IS CHARGED WITH POPE ATTACK
</SectionTitle>
    <Paragraph position="0"> A Spamsh/b'~est was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed unth a bayonet approached the Pope whde he was saym8 prayers at Farina on Wednesday raght &amp;quot; According to the pohce, Fernandez told the mveshgators today that he trained for the past s~x months for the assault He was alleged to have claimed the Pope 'looked fonous' on heanng the priest's cnUc~sm of his handhng of the church's affmrs If found gmlty, the Spamard faces a prison sentence of</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="3" type="metho">
    <SectionTitle>
15-20 years
</SectionTitle>
    <Paragraph position="0"> There are a number of reasons why the title, &amp;quot;Priest Is Charged unth Pope Attack; ~s a highly representa~ve abstraction of the core content of the article It encapsulates the essence of what the story m about there are two actors, ldentzfied by.ther most prominent charactenst~cs, one of them has been aRacked by the other, the perpetrator has been charged, there ms an unphcat~on of malice to the act The title bnngs the complete set of sahent facts together, m a thoughtfully composed statement, designed to be brief yet mformat~ve Whether a present day natural language analysms program can derive---without being primed of a domain and genre---the mformat~on requned to generate such a summary ms arguable 01us ms assunung, of course, that generatlon techmques could, m their own right, do the planning and dehvery of such a concme and mformatlon-paclc.ed message) However, part of the task of dehvenng accurate content charactensatlon m being able to identify the components of tlus abstractlon (e g, &amp;quot;priest', &amp;quot;pope', &amp;quot;attack', &amp;quot;charged ~nth') It m from these components that, eventually, a message template would begin to be constructed It ms also precmely these components, vmwed as phrasal umts with certain dmscourse properties, that a capsule overview should locate and present as a charactensahon of the content of a text document Our strategy ms to mane a document for the most sahentuand by hypothesis, the most representative--phrasal umts, as well as the relational expressions they are assocmted with, with the goal of establmshmg the land of core con&amp;quot; tent specification that ms captured by the title of this example null The remainder of thins paper ms orgamsed as follows Given the miportance we asslgn to phrasal Ident~catlon, we outhne m Sectlon 2 the starting point for flus work research on temunology Identfflcatlon and extendmg tl'ns to non-techmcal domains In partlcular, we focus on the problems that base-hn e terminology Identlficatlon encounters when apphed to open-ended range of text documents, and outhne a set of extensions reqmred for adapting it to the goal of core content ldent~ficatlon Essentlany, these bod down to formahsmg and ~mplementmg an operational nohon of sahence which can be used to unpose an ordering on phrasal umts accorchng to the topical prominence of the objects they refer to, thin ms discussed m Section 3 Section 4 illustrates the processes revolved m topic ldentlhcatlon and construction of capsule overviews by example We close by posltmnmg thins work within the space.of summansat~on techmques 2. Phrasal identification for content characterisation The ldentxficatlon and extraction of techmcal terminology ms~ arguably, one of the better understood and most robust NLP technologms within the current state of the art of phrasal analyms What Is pafacularly interesting for us m the fact that the llngtustlc pmpertms of techrucal terms lead to the defimtzon of computational procedures, capable of term identification across a wide range of techlcal prose, whale mamtmmmg their quahty regardless of document domain and type Since topic stamps are essentrolly phrasal units with certain dmcourse pmpertms--they manifest a hlgh degree of sahence within contlguous d~ourse segments--we define the task of content charactensatlon as one of Identifying phrasal umts wlth lexzco-syntactu: properttes smular to those of techmcal 2A hst of topic stamps Is, by itself, not a coherent summary, however, by employing appropnately daslgned presentatlon metaphors.--aumng, overall, to retain contextual cues assocmted wlth toplc stamps m context--our tol:~C stamps are more contentful than just a hst of (noun or verb) phrases This paper focuses on the hngmshc processes underlyng the automabc Idenhhcahon and extrachon of toplc stamps and their mgamsatlon within capsule ovennews The msues of the right presentatmn metaphor and  terms and with dascourse propertms whxch signify their status as &amp;quot;most prominent&amp;quot; In Section 3, we show how these dmcourse propert|es are computable as a function of the grammatical dtstnbut~on of the phrase Below we dmcuss the potentml of ternunology ~denttficatton for content charactensat~on</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.1 Technical terminology: strengths
and limitations
</SectionTitle>
      <Paragraph position="0"> One of the best defined procedures for of techmcal terminology xdentfftcatton ts that developed by Justeson and Katz \[10\], who focus on multi-word noun phrases occumng m continuous texts A study of the hngutstlc proper~es of these constituents--preferred phrase stnactures, behavaour towards lex~caltsatxort, contraction patterns, and certain dascourse propertms--leads to the formulatlon of arobust and domain-independent algorithm for term xdentffacatlon Justeson and Katz's TERMS algorithm accompltshes htgh levels of coverage, it can be nnplemented within a range of underlying NLP technologins (e g morphologically enhanced lexlcal look-up \[10\], part-of-speech taggmg \[5\], or syntactac parsing \[20\]), and it has strong cross~hngmstlc apphcalaon (see, for instance, \[3\]) Most traportantly for our purposes, the algorithm ts particularly useful for generating a &amp;quot;first cut&amp;quot; towards a broad charactensatlon of the content of the document Conventional uses of techmcal terminology are most commonly ldeniafied with text indexing, computattonal lexxcology, and machine-assisted translation Less common ts the use of techmcal terms as a representalaon of the topical content of a document Thin is to a large extent an artifact of the accepted vmw---at least m mformatton retrieval context--stapulatmg that terms of mterest are the ones that dtstmgmsh documents from each other, almost by defimtton, these are not the terms which are representattve of the &amp;quot;aboutness&amp;quot; of a document Stall, ms ts dear that a program hke TERMS IS a good starting point for .dtstdhng representatave lists For example, \[10, append|x\] presents several term sets &amp;quot;stochashC/ neural net; 'lomt dzstnlmtzon; ~eature vector; covanance matruc; &amp;quot;training algorithm; and so forth, accurately charactense a document as belonging to the stalastacal pattern dasstficataon domam, 'word sense; &amp;quot;lextcal knowledge; &amp;quot;lexzcal aralnguay resolutwn ; &amp;quot;word meaning', 'seraantzc mterlrre: tatwn; &amp;quot;syntachc reahzatwn', and so forth assagn, equally rehably, a document to the lextcal semantlcs domain Such hsts are representalave, unfortunatel~ they can easily become overwhelming Conventtonall3~ volume ts controlled by promoting terms with htgher frequencms Thts, however, is a very weak metric for our purposes, xt also does not scale down well for texts wluch are smaller than typical instances of techrucal prose or sc~ent~c arttcles----such as news stones, press releases, or web pages The notxon of techrucal term needs appropriate extensions, so that xt apphes not)ust to sc~entzhc prose, but to an open-ended set of document types and genres Below we address thin msue by dtscussmg how a basic term set can be enriched m order to convey a more refined ptcture of content</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Extended phrasal analysis
</SectionTitle>
      <Paragraph position="0"> As noted above, wathout the dosed nature of the techmcal domains and documentation, it ts not dear what use can be made of term sets derived from arbttrary texts Certainly we cannot even talk of &amp;quot;techmcal terms&amp;quot; m the narrower sense assumed by the TERMS algonthm The question ts whether smular phrase Identlhcatmn technology generates phrase sets whtch can be construed as broadly characteristic of the topical content of a document, m the same way m which a term set can be vmwed as charactenstlc of the domain to whxch techmcal prose belongs In other words, the questmn concerns the wider apphcablhty of hngmstlc processing targeted at term sdentlflcatlon, relalaon extractmn, and ob)ect cross-classlficatlon Can a set of phrases denved m this way provide a representalaonal base which enables rapid, compact, and accurate apprecaalaon of the mformatwn contained m an arlntrardy chosen document7 Three problems arise when &amp;quot;varalla&amp;quot; term sets are considered as the basts for a content charactensatton task ' Undergeneratlon For a set of phrases to be truly representative of document content, ~t must provlde an exhaustme description of the enlat~es dmcussed m the document That is, it ought to contain not lUSt those expres:.swns whtch satmfy the strict phrasal defimtlon of &amp;quot;techmcal term&amp;quot;, but rather every expresmon which ment|ons a parttcapant m the events described m the text Phrasal analysts must therefore be extended to include pronouns and reduced descriptions, m addltton to the more complex nominals wtuch correspond to true techmcal terms Overgenerahon Relaxation of the canomcal phrasal deflrutton of techmcal term leads to mformataon overload When apphed to a document without regard to domain or genre, a system whmch extracts phrases on the basts of relaxed canomcal ternunology constraints wdl typscally generate a term set far larger than a user can absorb wzthout cogmtave overhead At the same lame, the set may contain several dtstmct phrasal umts whtch refer to the same dascourse object Wsthout some means of resolving anaph0nc relations, these crucaal cconnechons will be lost &amp;quot; D|fferentxahon Fmall~ whtle a hst of terms may be topical for the particular source document m which they occur, other documents within the same domain are hkely to yield smular, overlapping sets of terms Unacceptably, this nught result m two documents containing the same or smular terms being dassffted as &amp;quot;about the same thing'; when mfact they mzght focus on completely dafferent subtopics wtthtn the general domain they share Although we approach these problems m shghtly different ways, the solutaons are interconnected, and it ts thetr mteractton that is cruaal to the derivation of capsule overvaews from extended phrasal anaiyses The ex* act mechamsms revolved m the processmg are described m more detail m Section 3, here we outline the modffzcataons and extens~ous to traditional term ~dentfflcataon technology wluch address the above problems Ftrst, undergeneratlon ts resolved by nnplementmg a statable generahsatwn---and relaxatxon--of the notion of a term, so that |dentffacatlon and extraction of phrasal</Paragraph>
      <Paragraph position="2"> but which results m an extended phrase set, containing an exhausbve hstmg of the objects mentmned m the text Second, .overgenerabon m resolved through reductmn of the extended phrase set m two ways The extended phrase set m transformed, through the apphcatlon of an anaphora resolutmn procedure (See Sechon 3 below, and Kennedy and Boguraev \[13\], \[14\]), into a set of expressions wluch umquety identify the objects referred to m the text (hereafter a referent set) However, the data reductmn ansmg from distalhng the extended phrase set down to a smaller referent set is still not enough In order to ehnunate cogmhve overload for the user, the referent set must be further reduced to a small, coherent, and easily absorbed listing of lUSt those expressmns wl~ch identify the most important obiects m the text An mtmbve and stratghtforward means * of accomphshmg this mvolves ranking the members of the referent set according to a measure of the prominence, or unportance, m the text of theobjects they refer to Such a ranking not only prowdes the basis for identifying topic stamps, it also solves the thn.d problem above, that of dffferentatton Although two related documents may mstanUate the same term sets, ff the documents are concerned with &amp;fferent topics, then the relalave importance of the terms m the two documents wdl differ as a funcbon of differences m use and grammatical dlstnbutlon The underlying mtmtaon Is that term sets can be dffferonhated m two ways lexlcally, by wrtue of contmnmg different terms, or hmrarchlcally, by wrtue of the ordering of then&amp;quot; members Ordered term sets, m the latter case, provide chstmct charactermatlons of documents, even ff the overall lexlcal make-up of the term sets Is smular Given a formahsed notmn of&amp;quot;importance&amp;quot;, we can generate a coherent set of topic stamps from an undlfferentmted referent set, while overcommg the lack of coherence inherent m unordered term sets The challenge, then, Is to define a statable selection procedure, operating over a larger set of phrasal units than that generated by a typical term ldenUficahon algonthra (mcluclmg not only all terms, but term-hke phrases, as well as thmr variants, reduced forms, and anaphonc references), malong reformed choices about the degree to which each phrase Is representabve of the text as a whole, and presenting its output m a form whtch retains contextual mformauon for each phrase The key to normahsmg the content of a document to a smaU set of dmtmgmshed, and discriminating, phrasal umts Is being able to estabhsh a containment hierarchy of phrases (wluch would eventually be exploited for capsule overvaew presentaiaon at different leveb.of granularity), and being able to make refmed ludgements concerning the degree of relevance of each unit, wainn its own Oocal) discourse segment In other words, we need to be able to filter a term set m such a way that those expressions which are most representabve of the content of the document are selected as topic stamps The next Section describes the process of constructing exactly this type of &amp;quot;unportance-based&amp;quot; ranking by building on and extending a crucaal feature of the anaphora resolutaon procedure used to generate the reference set sahence</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Salience-based content charac-
</SectionTitle>
    <Paragraph position="0"> terisation .... .;.., Sahence m a measure of the relatnre pronunence of objects m &amp;scourse objects with high sahence are the focus of attentmn, those wath low salience are at the periphery In an effort to resolve the problems facing a term-based approach to content charactensattorb we have developed a procedure whtch uses a sahence feature as the basis for the type of &amp;quot;ranlang by unportance&amp;quot; of referents discussed above, and ultunately for topxc stamp ldenbficataon By determining the sahence of the members of a referent set, an ordering can be Imposed which, m connecbon with an appropriate choice of threshold value, prowdes the basis for a reducbon of the entn'e term set to only those terms which ldenbfy the most prominent parbcnpants m the dmcourse Ttus reduced set of terms, m combmabon with relabonal mformabon of the sort dincussed m the prevmus secbon and folded into an appropnate presentaiaon metaphor, may then be presented as a * charactensalaon of a docmnent's content Crucmlly, tlus analysis sahsfies the reqmrements menboned above R Is consise, it m coherent, and xt does not introduce the cogmhve overload assooated3wth a fifll-scale term analysis Thin strategy for scahng up the phrasal analysis pro- * wded by standard term ldentlficatmn technology has at its core the uUhsahon of a crucml feature of dmcourse structure the pronunence, over some segment of text, of particular ref~ents--somethmg that Is mmsmg from the tradRmnal technology for &amp;quot;bare&amp;quot; terminology ldentfficahon Below we describe the core detmls of our technology Frost, we explain more concretely what we mean by &amp;quot;segment of text&amp;quot;, why segments are nnportant, end how they are detemuned Second, we present a method for determining salience which, when apphed to arbitrary sets of phrasal umts, generates an ordering that accurately represents the relabve pronunence of the objects referred to m a document We also descnbe what hngmshc mformatmn, available through scalable and robust tden- ' tlficatmn technologies, can be leveraged to reform such a nobon of sahence Finally, we give an over~ew of a hngmslac processing envaronment wluch, w~ule carrying out these tasks, remains open-ended wRh respect to the language, domain, style and genre of the texts we want to be able to handle</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Discourse segmentation
</SectionTitle>
      <Paragraph position="0"> The example m Sechon I illustrates the anportance of discourse segmentataon As R happens, the htle m this case works as an overvaew of the content of the passage because the text Rself Is fmrly short As a text increases m length, the &amp;quot;completeness&amp;quot; of a short descrlpbon as a charactensahon of content deteriorates If the mtenbon Is to use concise descnphons consmbng of one or two topical phrases (topic stamps) plus modlficataonal and relalaonal mformatmn as the primary mformahonbearing umts for capsule oversnew, then it follows that texts longer than (roughly) one to three paragraphs must be broken down mto smaller umts or segments .The approach to segmentation we adopt tmplements a s~mdanty-based algorithm along the hnes of the one developed by Hearst \[8\], which ldenl~hes topically coherent sections of text using a lex~cal smulanty measure In the final presentation of results, each segment ms assoaated with a concmse, phrasal-based descnptmn of ~ts content w~thout loss of accuracy The set of such descnptwns, ordered according to linear sequencing of the segments m the text, may then be used as the basis for a capsule overview The problem of content charactensatzon of a large text, then, ms reduced to the problem of finding topic stamps for each segment m the document</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Local salience
</SectionTitle>
      <Paragraph position="0"> As noted m Section 2 2, the set of expressions generated by extended phrasal analysms typically contains a number of anaphonc expresmons--pmnouns, reduced descriptions, ere--which must be resolved Our anaphora resolutzon algorithm m based on a procedure developed by Lappm and Leass \[16\], and Is described m detail m Kennedy and Boguraev \[13\], \[14\], m essence, ~t develops an adaptation for denying rehable interpretation from conmderably shallower hngmstxc analysts of the mput We make the snnphfymg assumption that every phrase identified by extended phrasal analysts constitutes a &amp;quot;mentmn&amp;quot; of a participant rathe chscourse (see Mare and MacnuUan \[19\] for chscusszon of the notion of &amp;quot;mention&amp;quot; rathe context of proper names interpretation ) Coreference Is represented by eqmvalence classes of nominals, where each eqmvalence dass corresponds to a umque referent m the dmscourse The set of such eqmvalence classes constitutes the referent set dmcussed above However, anaphora resolutwn ms nnportant not only for reducing the extended phrase set, ~t also plays a crucml role m the ~dent~flcat~on of topic stamps The reason thins ms so ms that ~t ms based on a strict defxut~on of the notwn of sahence Roughly spealang, an antecedent for an anaphonc expressmn is located by first ehnunatmg all mmposmble can&amp;date antecedents, then ranking the remaining candidates according to a'Iocal sahence measure, and selecting the most sahent candidate as the antecedent Local sahence ms a funcct~on of how a canchdate satmsfles a set of grammatical, syntactic, and contextual parameters Following Lappm and Leass, we refer to these constraints as &amp;quot;sahence factors&amp;quot; Ind~wdual &amp;quot;sahence factors are associated with numerical values, as follows 4 SENT 100 Iff the expressmn ts m the current sentence CNTX ,50 df the expresmon Is m the current discourse segment SUBJ 80 df the expresslon ts a subject EXST 70 ~f the expresswn ts m an existentml construction POSS 65 lf~ the expression m a possessrce ACC 50 ~ the expression ~s a duect object DAT ,40 lff the expression Ls an mchrect object OBLQ 30 ~ the expresslon Is the complement of a preposlt~on HEAD 80 Iff the expresmon Is not contained m another phrase ARG ,50 df the expresslon Is not contained m an adjunct The local sahence of a candldate ms the sum of the values of the sahence factors that are satlsfed by some member of the eqmvalence class to which the canchdatebelongs (note that values may be sahsfied at most once by each member of the class) One m~portant aspect of these numencal values ms that they nnpose a relational structure among the sahence factors, crucmUy, as observed by Lappm and Leass, such a structure reflects the relahve ranking of the factors Thin ms lUStffmd both hngmstlcally, as a consequence of the role played by the functmnal hmrarchy m deternunmg anaphonc relations (see e g Keenan and Comne \[12\]), as well as by experimental results (see Lappm and Leass \[16\], Kennedy and Boguraev \[13\], \[14\] for dtscusswn) An nnportant feature of local sahence ms that it ms vanable the sahence of a referent decreases and increases according to the frequency with which new members are added to the eqmvalence class to which it belongs When an anaphonc hnk ms estabhshed, the anaphorms added to the eqmvalence class to which its antecedent belongs, and the sahence of the class ms boosted accordingly If a referent ceases to be mentioned m the text, however, ~ts local sahence ms mcrementaUy decreased .</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Discourse salience
</SectionTitle>
      <Paragraph position="0"> Corts!der again the news article dmscussed m Section 1 Intmtlvely, the reason why 'priest&amp;quot; ms at the focus of the tale ms that there are no less than eight references to the same actor m the body of the story (these are marked by ltalicmsmg them m the example), moreover, these references occur m pronunent syntactlc posltlons five are subjects of mare clauses, two are subjects of embedded clauses, and one ms a possessor Smularl)~ the mason why 'Pape&amp;quot; ms the secondary object of the title ms that he ms also receives multiple mentions (five), but these references tend to occur m less prominent positions (two are dtmct objects) In order to generate such a broad picture of the prominence of referents across a dmcourse, we mamtam a measure of the sahence of referents both m the text as a whole, and m the dmscourse segments m which they occur Thin ms accomphshed through an elaboratmn of the local sahence computation described above, which interprets the same conchtwns with respect to a nortdecreasing dwcourse sahence value Local sahence, because of lts vanablhty, provides a reahshc representation of the antecedent space for an anaphor In contrast, discourse sahence reflects the dmtnbutlonal propertles of a referent as the text story unfolds Thins non-clecreasmg salience measure underhes a detailed representation of dmcourse structure which, when overlayed onto the results of dmcourse segmentation, gives a coherent representation of the topical pronunence of particular referents m speahc segments of text Specifically, it becomes the basts for exactly the type of Importance-based ranking of referents dmcussed m Sectzon 2 2 Using thin ordering, we define the topic stamps.</Paragraph>
      <Paragraph position="1"> for a segment S to be the n lughes t ranked referents m s (where n ms a scalable value) 4Our sahence factors nurror those used by Lappm and Leass, with fl'te excephon of Poss, wluch Is sens~ve to possessive expmsmons, and CNTX, wl~ch Is sensllav e to the chscourse segment m whlch a candldate appears</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 , Example
</SectionTitle>
    <Paragraph position="0"> The operatxonal coml&gt;onents tocontent charactensaUon described here fall m the following categories dmcourse segmentation, phrasal analysm (of nonunal expressions and relaUons), anaphora resolution and generation of the referent set, calculatton of d~scourrse sahence and ranking of referents by segment, tdenttficat~on of topic stamps, and enriching topic stamps wtth relatwnal context(s) Some of the fimct~onahty follows duectly from teraunology ~dent~ficatton, m parttcular, both relation tdentff~cattton and extended phrasal analys~s are camed out by running a phrasal grammar over a stream of text tokans tagged for morphologtcal, syntactic, and grammat~ca.l fimct~on, thin ts m addtt~on to a grammar mmmg for terms and, generally6 referents (Base level lmgmsttc analys~s m pmwded by the LINGSOFT supertagger, \[11\] ) The later, more semant~cally-mtensxve algorithms are described m some detail m \[13\] and \[14\] We illustrate the procedure by htghhghtmg certain aspects of a capsule over~ew of a recent Forbes amcle (\[9\]) The document xs of medmm-to-large stze (approxtmately four pages m print), and focuses on the strategy of Gilbert Ameho (Apple Computer's CEO) concerning a new operating system for the Macintosh Too long to quote here m full, the following passage from the beginning of the article contains the first, second and third segments, as ~denUfied by the discourse segmantat~on component described m Section 31 (cf \[8\]), m the example below, segment boundaries are marked by extra verttcal space) ON'I~ DAY everythmgBdlC-ateshassoldyou up tonow whekhexst sWmdows 95 car Windows 97 wall become dbsolete, ' declan~ Calbert Ameho, the bossatAppleComputer Gates m vulnerable at that pomt And we want to make sum we re m~ly to come forward wtth a superior arawer&amp;quot; Bdl Gates vul~a~ble~ Apple wmdd swoop m and take Mtcrosoffs customem~ R~hculoos' lmpm.~ble~ In the last ~ year Apple lc~i: $816 Mtcm~ofi made $2 2 bdhon Macrosoft has a market value ttm'ty 0rues that of Apple Outlandish and ~ran&amp;ose as Ameho s ~dea sounds, ,t makes sei~.~ for Apple to think m sud~ bsg. bold terms App~ m m a pomum where st~ndm~ pat almost certainly meatus slow death It's a Int hke a pattent wsth a pwbably terminal a~e clecu:bng to take a chance on an untested but pm~ new drug A bold s~ategy m the least risky strategy As things sUmcl, customers and outude softwme developers ahke are de~rtmg the company Apple needs s~e~mg dram~c to pexsuade them to stay aboard Ar~4,~l rede.agn of the desktop comput~ aught do the trick If they think the redemgn has met&amp; they may feel compelled to Ket c~ the bandwagoa lest tt leave them beluncl Lots ol &amp;quot;ffs/' but you can t a~me Amelm of lacking vmon Today's desktop~ he says are dl-eqmpped tohandle the coming power of the Intemet Tomorrow s m~ must accommodate nve~ of data, mulume&amp;a and mulUtask,mg (msghng several tasks sun~.tsly) &amp;quot; * We're war the point of upgradm R, he says Time to scrap your operaUn~ system and staxq over Theopemtmgsystemmthesoftwate that~entrolshow your computer's pare (memmy, dmkdnves scream) interact wtth apphcatmns hkegamesanciWebbrowsere Once you've done that buynewapphcauc~s to go wah ~e ~'eensmeered operaung sysumL Ameho 53,bnngsalotofc~hblhtytothasta~ l-l~tesumemcludesboth a rescue of Nahonel Senucea~uc'tor from neat-bankruptcy and 16 patents including one for comventmg the cha~oupled device But where t~ Amelm going to get th~ new operating ~ystem~ From Be Inc, m Menlo Park Cahf, a half-hour s drive from App~s Cupemno headquartext a hot httle company founded by ex-Apple v~ary Jean-Lores Gassee Its BeCS, t~w ~mg du~cal trois, m that radu~ redesagn m opereung ~ that Ameho ~s talking about Mamed to ha.q:lware .from Apple and .Apple cloners, the BeOS lust aught be a crechble compeutor to Mtcrosoft's Windows, whu~ runs cs* IBM-coznpauble hardware * The capsule overvxew was automaticallygenerated by a fully tmplemented, and operational, system, whlch incorporates all of the processing components ldenttfied above -The relevant sec~ons of the overvtew (for the three segments of the passage quoted) are as follows  The chwslon of this passage into segments, and the segment-based asssgnment of topic stamps, exemphfies a capsule overwew's &amp;quot;trackmg&amp;quot; of the underlying coherence of a story The dxscourse segmentation component recogmzes shifts m toplc--m thin example, the shxft from dmcussmg the relation between Apple and Microsoft to, some remarks on the future of desktop computing to a summary of Ameho's background and plans for Apple's operating system Layered on top of segmentation are the topic stamps themselves, m thear relational contexts, at a phrasal level of granularity The first segment sets up the d~cusslon by posmonmg Apple opposae Microsoft m the marketplace and focusing on thetr major products, the operating systems The topic stamps ldentffted for th~ segment, APPLE and MICROSOFT, together with thmr local contexts, are both mdxcatlve of the introductory character of the operung paragraphs and htghly representatwe of the gmt of the first segment Note that the apparent unmformahveness of some relational contexts, for example, ' APPLE ~s m a posztwn ; does not pose a serious problem An adjustment of the gvanu!anty--at.capsule overwew presenta-Uon ttme--reve/ds the larger context m wl'uch the topsc stamp occurs (e g, a sentence), which m turn inherits the lugh topxcahty ranking of Its anchor 'APPLE ~ In a posztwn where standing pat almost certmnly means slow death&amp;quot; For second segment of the sample, OPERATING SYSTEM and DESKTOP MACHINEShave been Identified as representatwe The set .of four phrases illustrated provtdes an encapsulated snapshot of the segment, w~ch mtroduces Ameho's wews on commg challenges for desk-top machines and the general concept of an operating system Agem, even ff some of these are somewhat under-specified, more detail is easily available by a change m granulan~ wluch reveals the defimtlonal nature of the even larger context &amp;quot;The OPERATING SYSTEM L~ the software that controls how your computer's parts &amp;quot; The third segment of the passage exemphfied above assoaated w~th the stamps GILBERT AMELIO and NEW OPERATING SYSTEM The reasons, and lmgumt~c rationale, for the selection of these particular noun phrases as topxcal are essenttally ldenUcal to the mtmtlon behind &amp;quot;truest&amp;quot; and &amp;quot;Pope&amp;quot; being the central topics of the example 'm Sectmn I The computational just~ficatton for the choices hes m the extremely high values of sahence, resultmg from taking into account a number of factors coreferentaahty between &amp;quot;Ameho&amp;quot; arid &amp;quot;G:Ibert Ametw', coreferentaahty between &amp;quot;Ameho&amp;quot; and 'H~s; syntactic pronunence of 'Ameho&amp;quot; (as a subject) promoting topical status ~gher than for instance &amp;quot;Apple&amp;quot; (wluch appears m adjunct posmons), I'ugh overall frequency (four, countmg the anaphor, as opposed to three fo r 'Apple'--even the two get the same number of text occurrences m the segment)--and boost m global sahence measures, due to * &amp;quot;prlnung&amp;quot; effects of both referents for &amp;quot;G~Ibert Araeho&amp;quot; and * operating system&amp;quot; m thepnor dmcourse of the two preceding segments Even ff we are unable to generate a single phrase summary m the form of, sa~ &amp;quot;Ameho seeks a new operating system; the overview for the closing segment comes close, arguably, it Is even better than any single phrase summary As the dmcusslon of this example Illustrates, a capsule overview is derived by a process which fecthtates partial understanding of the text by the user The final set of topic stamps is designed to be representative of the core of the document content It ~s compact, as it ~s a sigmficantly cut-down versmn of the full list of identified terms It ts Inghly mformatzbe, as the terms mcluded m R are the most prominent ones m the document It representatme of the whole document, as a separate topic tracking module effecbvely maintains a record of where and how referents occur m the entire span of the text As the topics are, by de~mtmn, the primary content-beanng entities m a document, they offer accurate approximation of what that document m about</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML