File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0704_metho.xml

Size: 14,538 bytes

Last Modified: 2025-10-06 14:14:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0704">
  <Title>Automated Text Summarization in SUMMARIST</Title>
  <Section position="4" start_page="18" end_page="18" type="metho">
    <SectionTitle>
2. Interpretation Sunply aggregatmg
</SectionTitle>
    <Paragraph position="0"> together frequently mentmned portions of the input text does not m itself make an abstract What are the central, most important, concepts m the following story9 John and Bdl wanted money They bought ski-masks and guns and stole an old car from a netghbor Wearing their ski-masks and wavmg their guns, the two entered the bank, and within minutes left the bank with several bags of $100 bdls They drove away happy, throwing away the ski-masks and guns m a sidewalk trash can They were never caught The popular method of sunple word counting would indicate that the story is about sk|-masks and guns, both of which are mentmned three times, more than any other word Clearly, however, the story is about a robbery, and any summary of It must menUon th|s fact Some process of interpreting the mdlwdual words as part of some encompassing concept is requued One such process, word clustenng, ~s an essentml technique for topic =dent=ficaUon m IR This techmque would match the words &amp;quot;gun&amp;quot;, &amp;quot;mask&amp;quot;, &amp;quot;money&amp;quot;, &amp;quot;caught&amp;quot;, &amp;quot;stole&amp;quot;, etc, against the set of words that form the so-called signature for the word &amp;quot;robbery&amp;quot; Other, more soph|sttcated forms of word clustering and fusion are possible, mcludmg script matchmg, deductive reference, and concept clustenng 3. Generation Two options exist either the output is a verbatim quotaUon of some portion(s) of the input, or ~t must be generated anew In the former case, no generator is needed, but the output is not lflcely to be htgh-quahty text (although this might be sufficient for the apphcatlon)</Paragraph>
  </Section>
  <Section position="5" start_page="18" end_page="19" type="metho">
    <SectionTitle>
2 The Structure of SUMMARIST
</SectionTitle>
    <Paragraph position="0"> For each of the three steps of the above 'equation', SUMMARIST uses a mixture of symbolic world knowledge (from WordNvt and slmdar resources) and statistical or IR-based techniques Each stage employs several different, complementary, methods (SUMMARIST will eventually contain several modules m each stage) To date, we have developed some methods for each stage of processing, and are busy developing additional methods and lmkmg them rote a single system In the next sections we describe one method from each stage The overall architecture is shown m Figure 1 Figure 1 Architecture of SUMMARIST</Paragraph>
    <Paragraph position="2"> s e, the title (TI) is the most hkely to bear topics, I 2,1 Tonic Identification followed by the first sentence of paragraph 2, the ~- * first sentence of paragraph 3, etc In contrast, for II Several techmques for topic identification have the Wall Street Journal the OPP is B been reported in the hterature, including methods r-r1 P1S1 P1S2 1 based on Posmon \[Luhn 58, Edmundson 69\], Cue L- -, , . , J Ill Phrases \[Baxendale 58\], word frequency, and Evaluation. We evaluated the OPP method m El Discourse Segmentation ~darcu 97\] various ways. In one of them, coverage.is the m .... . . t?action of the (human-supphed) keywords that we ae~oe nere just our work on are included verbatim m the sentences selected II SUMMARIST s Positaon module. This method under the nohcv (A random selectmn nohcv exploits the fact that m some genres, regularities would extra~ct sentences with a random di~._.bution of d~conrse structure _ and/or, methods of of topics, a good position policy would extract exposmon mean mat certain sentence posmons rich ton c anna ~ t e II i-be___,~ sentences We measured _h_ El tend to carry more topic matenal than others ...... r- - ......... ....... effectiveness of an OPP by taking cumulatively I we aermea me upt:ma: J\[&amp;quot;O..flllOn J&amp;quot;Oll~ ~U~I'*) as mnr'~, nf ItQ e~ntenc~.~ first m~t th~ bile. then th~ a list that indicates m what ordmal positaons in the t,-tl~'~l'~,~'p?gl&amp;quot;&amp;quot;~'nd-~n'~n-&amp;quot; Oln'n--ra~rt'~&amp;quot;dot~'~-~ text hlgh-topm-beanng sentences occur We th'~.-e~F~t'o~'muTtl~n-rd'kev'nl~ra~'~'~v~m'ate.'&amp;quot;~&amp;quot;d I developed a method of automatically trammg new ~n~ w&amp;quot;-~'n'an~w nf';~c'~ea&amp;quot;~,~o-~-~ l&amp;quot;~,~'~'~v~ OPPs, given a collection of genre-related texts ~v~r'~l~&amp;quot;~'~l~'~'~'e&amp;quot;~-~'~nr&amp;quot;~e ~-hn~v'~ ,~ with keywords This work, descnbed in \[Lm and ~;~~. 9-'h~-,~l~'-~--~l~vn-hv-~v~d~w'~,~,~'- ~',~',,~ II Ho'vy 97a\], Is the first systematic study and t~ge~'~e~ ' t'h'e&amp;quot;'muit-'~-w&amp;quot;-o~ co'ntn%ut'm-ns &amp;quot;(wm~do-~, evaluation ot the Position method reported sizes 1 to 5) in the top ten sentence posmons For the Ziff-Davis corpus (13,000 newspaper (R10), the columns reach 95% over an extract of II articles announcing computer products) we have 10 sentences (approx 15% of a typm.al Zlff-Daws found that the OPP ss text) an extremely encouraging result</Paragraph>
    <Paragraph position="4"/>
  </Section>
  <Section position="6" start_page="19" end_page="22" type="metho">
    <SectionTitle>
OPP POSITIONS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="20" end_page="22" type="sub_section">
      <SectionTitle>
2.2 Topic Interpretation (Concept
Fusion)
</SectionTitle>
      <Paragraph position="0"> The second step m the summarization process is that of concept interpretation In this step, a collection of extracted concepts are 'fused' into their one (or more) higher-level unifying concept(s) Concept fusion can be as simple as part-whole construction, for example when wheel, chain, pedal, saddle, hght,.frame, and handlebars together fuse to bicycle Generally, though, R Is more complex, ranging from dlrect concept/word * clustering as used in IR \[Paice 90\] to scnptally based inference as in scripts \[Schank and Abelson 77\] Fusing topics into one or more characterizing concepts m the most difficult step of automated text summarizatmn Here, too, a variety of methods can be employed All of them assocmte a set of concepts (the mdwators) with a characteristic generahzatlon (the fuser or head) The challenge is to develop methods that work reliably and to construct a large enough collectmn of mdicator-fuser sets to achieve effective topic  Wave front A topw is a particular subject that we write about or d~cuss To identify the topics of texts, IR researchers make the assumptmn that the more a word Is used in a text, the more important it Is m that text. But although word frequency counting operates robustly across different domains without relying on stereotypical text structure or semantic models, they cannot handle synonyms, pronommahzation, and other forms of coreferentlahty Furthermore, word counting misses conceptual generahzatmns John bought some vegetables, fi'u:t, bread, and milk -4 John bought some grocer:es The word counting method must be extended to recognize that vegetables, frmt, etc, relate to grocerzes Recogmzing this inherent problem, people started using Amficml Intelhgence techmques \[Jacobs 90, Mauldm 91\] and statistical techmques \[Salton et al 94\] to incorporate semantac relataons among words Following this trend, we have developed a new way to ldenUfy topics by counting concepts instead of words, and genemhzmg them using a concept generahzatlon taxonomy As approximation to such a hierarchy, we employ WordNet \[Miller et al 90\] (though we could have used any machine-readable thesaurus) for inter-concept relatedness links In the hmR case, when WordNet does not contain the words, this technique defaults to word counting As described in \[Lm 95\], we locate the most appropnate generalization somewhere m middle of the taxonomy by finding concepts on the interesting wavefront, a set of nodes representing concepts that each generalize a set of approximately equally strongly represented subconcepts (ones that have no obvious dominant subconcept to specmhze to) Evaluation: We selected 26 amcles about new computer products from BusmessWeek (1993-94) of average 750 words each For each text we extracted the eight sentences containing the most interesting concepts using the wavefront technique, and comparing them to the contents of a professional's abstracts of these 26 texts from an onhne service We developed several weighting and scoring variations and tried various rauo and depth parameter se~ngs for the algorithm We also implemented a random sentence selectmn algorithm as a baseline comparison The average recall (R) and precision (P) values over the three sconng vanatmns were RffiO 32 and pro 35, when the system produces extracts of 8 sentences&amp;quot; In comparison, the random selection method had RffiO 18 and PffiO 22 precismn in the same experimental setting While these R and P values are not tremendous, they show that semantic knowledge--even as limited as thatm WordNet--does enable unprovements over traditional IR word-based techniques However, the hm~tations of WordNet are serious drawbacks there is no domain-specific knowledge, for example to relate customer, waiter, cashier, food, and menu together with restaurant We thus developed a second technique of concept interpretation, using category s:gnatures We discuss this next  * Can one automatically find a set of related words that can collectwely be fused into a single word9 To test thaiSS .Idea we developed the Concept Signature method \[Lm and Hovy 97b\] We defined a signature to be a list of word mdlcators, each with relatwe strength of associatmn, jointly associated with the signature head.</Paragraph>
      <Paragraph position="1"> To construct signatures automatically, we used a set of 30,000 texts from the Wall Street Journal (1987) The Journal editors have classified each text into one of 32 classes---AROspace, BNKmg, ENVironment, *TELecommunications, etc We counted the occurrences of each content word (canonicalized morphologically to remove* plurals, etc ), m the texts of a class, relative to the number of tunes they occur m the whole corpus (this is the standard tftdf method) We then selected the top-sconng 300 terms for each category and created a signature with the category name as its head The top terms of four example slgnatures are shown m Figure 3 It is qmte easy to determine the idenUty of the signature head just by mspecUng the top few signature mdlcators  * SUMMARIST will use signatures for summary creatmn as follows After the topic identification module(s) ldentifyhes a set of words or concepts, the signature-based concept interpretation module wdl iden~fy the most pertinent signatures subsummg the topic words, and the signature's head concept will then be used as the summarizing fuser concepts Matching the identified topic terms against all signature indicators involves several problems, mcludmg takmg rote account the relative frequencies of occurrence and resolwng matches wRh muRlple signatures, and specifying thresholds of acceptablhty Evaluation. First, however, we had to evaluate the quality of the signatures formed by our algorithm Recogmzmg the similarity of signature recognmon to document categorization, we evaluated the effectiveness of each signature by seeing how well R serves as a selectmn criterion on new texts As data we used a set of* 2,204 prewously unseen WSJ news articles from 1988 For each test text, we created a single-text 'document signature' usmg the same ~f:dfmeasure as before, and then matched this document signature against the category signatures The closest match provided the class mto which the text was categorized We tested four different matching functions, mcludmg a simple binary match (count 1 if a term match occurs, 0 otherwise), curve-fit match (mimmtze the difference m occurrence frequency of each term between document and concept signatures), and cosine match (mmma~ze the cosine angle in the hyperspace formed when each signature is viewed as a vector and each word frequency specifies the distance along the dimension for that word)  These matching functions all prowded approximately the same results The values for Recall and Precmion ' (R--0 756625 and P---0 69309375). are very . encouraging and compare well wah recent IR results \[TREC 95\] Extending this work will reqmre the crealaon of concept signatures for hundreds, and eventually thousands, of different topics needed for robust summartzatlon We plan to mvestagate the effectiveness of a varterty of methods for doing</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
this
2.3 Summary Generation
</SectionTitle>
      <Paragraph position="0"> The final step in the summarization process hs to generate the summary, conslstmg of the fused concepts, m Enghsh A range of posslbdmes occurs here, from sunple concept printing to sophlsUcated sentence p!annmg and surface-form reahzat~on Although, as mentioned m Section 1, s~mple extract summaries reqmre no generatton stage, eventually SUMMARIST wdl contain three generation modules, assocmted as approprmte with the various levels for various apphcatlons 1 Topzc output Sometimes no summary Is really needed, a simple hst of the summartzmg topics ~s enough SUMMARIST wall print the fuser concepts produced by stage 2 of the process, sorted by decreasing nnportance</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="22" end_page="22" type="metho">
    <SectionTitle>
2 Phrase concatenatzon SUMMARIST wdl
</SectionTitle>
    <Paragraph position="0"> mclude a rudimentary generator that composes noun phrase- and clause-stzed umts into stmple sentences It wdl extract the noun phrases and clauses from the mput text, by following hnks from the fuser concepts through the words .that support them back into the mput text 3 Full sentence planmng and generatton SUMMARIST wdl employ the sentence planner being bruit at ISI (m collaboration with the HealthDoe project from the Umverslty of Waterloo) \[Hovy and Wanner 96\], together with a sentence generator such as Penman \[Penman 88, Matthlessen and Bateman 91\], FUF \[Eihadad 92\], or NitroGen \[Kmght and Hatmvassdoglou 95\] to produce well-formed, fluent, summaries, takmg as input the fuser concepts and their most closely related concepts as Identified by SUMMARIST's topic ldenUficatlon stage</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML