File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0706_metho.xml
Size: 19,418 bytes
Last Modified: 2025-10-06 14:14:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0706"> <Title>A Proposal for Task-based Evaluation of Text Summarization Systems</Title> <Section position="4" start_page="31" end_page="32" type="metho"> <SectionTitle> 3 Previous Evaluations </SectionTitle> <Paragraph position="0"> During the course of their development, most of the above systems were subject to some form of evaluation Many of these evalualaons rehed on the presence of a human-generated target abstract, or the notion of a single 'best' abstract, although there is fairly uniform acceptance of the behef that any number of acceptable abstracts could effectwely represent the content of a single document Human-generated abstracts attempt to capture the central concept(s) of a document using the terminology of the document, along the lines of a generic summary The comparisons made between the human-generated versus machmegenerated summaries were mtended pnmanly for the developers' own benefit, and evaluate the technology ~tself, rather than the utahty of the technology for a given task .Other evaluauons did focus on specific tasks and potenual uses of automatac summaries, but only w~th respect to a single sys* tern and a hrmted document set Many dttferent techmques were attempted m the area of mtrmslc or developer-oriented evaluauons, wluch judge the quahty of summaries Edmundson (1969) compared sentence selection m the automauc abstracts to the target abstracts, and also performed a subjective evaluation of the content Johnson et al (1993) proposed matching a template of manually generated key concepts with the concepts included m the abstract, and performed one sample abstract evaluation Pa~ce and Jones (1993) used a set of statlslacs to deteranne ff the summary effeclavely captured the focal concepts, the non-focal concepts, and conclusions Using a smctly statastlcal measure, Kuplec et al (i995) calculated the percentage of sentence matches and parual matches between thetr automatlc summary and a manually generated abstract The mare problem with this type of evaluation is ~ts rehance on the nouon of a single 'correct' abstract Smce many different representations of a document can form an effecave summary, this is an inappropriate measure In extrinsic or task-oriented evaluations, the mformat~on retrieval notlon of relevancy of a document to a specific topic ~s the common measure for summarization testing Make et al (1994) analyzed key sentence coverage and'also recorded t~rmng and precision/recall stattstles to make relevance decisions based on summaries for a domain-specific summarizer Brandow et al (1995) had news analysts compare the summaries generated using stat~stteal and natural language processing (NLP) techniques to suramanes using the mmal sentences (called the &quot;lead summaries&quot;) of the document Brandow et al (1995) discovered that m general, expenenced news analysts felt that the lead summaries were more acceptable than the summartes created using sophisticated NLP teehmques Mare and Bloedom (1997) generated smnlar precision/recall and tmung measures for an mformalaon retrieval experiment usmg a graph search and matching techmque and</Paragraph> <Paragraph position="2"> learned that then&quot; summaries were effecuve enough to support accurate retrieval</Paragraph> </Section> <Section position="5" start_page="32" end_page="34" type="metho"> <SectionTitle> 4 Proposed Evaluation </SectionTitle> <Paragraph position="0"> Full text summanzauon as a major task an TIPSTER Phase HI TIPSTER Phase I sponsored research an reformation extracUon and reformation retrieval, and supported the Message Understanding Conferences (MUC). and Text REtrieval.</Paragraph> <Paragraph position="1"> Conferences (TREC) for evaluating extraction and retrieval performance, respectively (Merchant, 1993) TIPSTER Phase II concentrated on defining a common architecture to facdnate mtegratxon of the two technologaes TIPSTER Phase HI conUnues to advance research m extraction and retneval, and adds text summarization in both the research and formal evaluation arenas (Merchant, 1996) Thas propose d evaluation wdl be a formal, large scale, multiple task, multiple system evaluation independent from any single approach or methodology As outlined m Table 1, the proposed evaluation for text summarization will be task-based, judging the utdny of a summary to a particular task It wall be an evaluation for users, determining fitness for a pamcular purpose, versus an evaluation stnctly for developers It is not intended to pick the best systems, but to understand some of the ~ssues revolved an budding summanzauon systems and evaluating them It wall provade an enwronment whereby systems wdl be judged independently on their apphcablhty to a given task We will began with at least two tasks for the first evaluatton, following the MUC and TREC examples of testing along multiple &mensions We hope thin will avoad any re&rection of research efforts based on relative performance on any gaven task Ad&Uonal tasks will be added m subsequent years to evaluate other aspects of text summaries These tasks will also reflect continued maturation of the technology</Paragraph> <Section position="1" start_page="32" end_page="34" type="sub_section"> <SectionTitle> 4.1 Goals </SectionTitle> <Paragraph position="0"> Automauc text summarization systems lend themselves to many tasks An mformaUve summary may be used as the basis for execuUve decisions An mdlcauve summary may be used as an mmal ln&cator of relevance prior to reviewing the full . text of a document (and possibly ehmmatlng the need to view that full text) Summaries (used m place of full text documents) may also be used to ~mprove precision m mformatton retrieval systems, since users would be searching only the content-relevant words or phrases gathm a document (Brandow et al, 1995) Forthis mlual evaluation, we wdl concentrate on tasks that appear to offer the possablhty of near term payoff for users We attempted to devise tasks that model the real world activmes of reformation analysts and consumers of large quantities of text These tasks were designed based on anterwews with users who spend a majority of their workday searching through volumes of on-hne text for reformation relevant to thear area of interest We will begm with tasks that address the focus (genenc or user-darected) of the summaries The first task, categortzation, wdl evaluate generic summaries, and the other, adhoc retrieval, wall evaluate user-&rected summaries, as described below 4.1.1 Task 1 - Categorization Whale mformauon routang systems are beeormng prevalent m many work enwronments, there Is stall a role m many such places for a central rewew authority to scan and &stnbute all incoming documents based on their content, essentmlly perforrmng a manual routing task These rewew' ers deal bothwith a broad topic base and with data from mulUple sources They must browse a document qmckly to determine the key concepts, and forward that document to the appropnate mdwzdual null A related task revolves scanmng a large set of documents that has beenselected using an extremely broad m&cator or concept A user will browse through this data and categorize ~t accordmg to various parameters For example; on the World-Wide-Web (WWW), mformataon seekers frequently enter short, broad queries that return * hundreds or even thousands of documents The user must determine which documents represent the greatest potentaal for prowdmg mformataon of mterest Integrating text summanzataon into each of the above scenarios, the user would be presented a generic summary m heu of the full text, from which he or she wall make a categonzataon decision null The evaluataon task wall simulate the manual routing scenario described above The goal wdl be to decide qmckly whether or not a document contmns mforrnataon about any of a hrmted number of topic areas The document wall be hrmted to a single topic Selectaons from the TREC test collectaons of query topics and documents wall be used as the data for the evaluauon we roll select a mtmmum of five d~stlnct topics, approxlmately 200 documents per topic At least two of the topics wdl be entaty-based 0 e based on the MUC categories of person, Iocataon, and organ~zataon) The toplcs wdl be related at a very broad level The document set prowded will be that returned as a result of five simple queries to a commonly used mformataon retrieval system, wluch should provade an adequate m~x of shorter and longer documents The resultmg documents wall be randomly mixed The TREC test collectaons are described m detml m Harman (1993) Only the documents will be provided to the evaluauon partac~pants Summanzauon systems developed by the parttclpants wall autornaucally generate a generic summary of each document There wall not be any constraints on the format of the summary All summaries submated by the partactpants wdl be combined by the evaluataon . organizers into a single group and randomly mtxed The full text of the document and the lead sentences of the document (up t.o the specified cutoff length) wall be used as baselines The summaries prowded by the paraclpants, the basehne lead summaries, and the full text documents wdl be m~xed together, resulting m N+2 versions of a single document, where N is the number of evaluataon paruc~pants This document set .wdl be randomly &wded among the assessors Assessors for the evaluataon wdl be professional reformation analysts Each assessor wall read a summary or document and categorize it into one of the five topic areas that were selected by the organmers, or 'none of the above', which can be considered a sixth category No assessor will read more than one version (summary or full-text) of a single document The assessor's decision-making process wdl be tamed The assessor wdl then move on to the next document of summary In addmon to the TREC relevance judgments, a minimum of two addmonal assessors wall read all of the full text documents to estabhsh a ground truth relevance decision for each The assessors wdl be tamed, and their categonzataon decisions wall be compared to the ground truth assessments This methodology wall assure that the assessors' own categonzataon performance can be measured along with the perfor-~ mance of the summanzataon systems Both the volume of data avadable on-line and the prevalence of mformataon retaaeval engines have created an ~mmechate apphcataon for Implementhag a text summanzataon filter as a back end to an mformatton retrieval engine, whereby the user could qmckly and accurately judge the relevancy Applying text surnmanzatlon to the above scenario, the user would be presented a summary based on the query (a user-dlrected summary), instead of the full text, from wtuch he or she wdl make a relevance assessment The second evaluatmn task wall simulate the adhoc retrieval scenario descnbed above The goal wdl be to decide the relevancy of a retrieved document by looking only at the user-directed summary that has been generated by the system under evaluation The TREe collection will also provide the common test data used for fiats task m the same pro-. porUons as for the categorization tasks, five hand-selected topics and approxmaately 200 documents for each topic The document set prowded wdl be that returned as a result- of five queries to a commonly used mformauon retrieval system In tins case, both the topics and documents wdl be prov~ded to the pamclpants Summanzanon systems developed by the parucipants wdl then autornatitally generate a summary using the topic as the mdlcatlon of user interest The full text and a keyword-m-context (KWIC) hst wall be used as baselines null Assessors wall work wRh one topic at a ttme All summaries received from the participants for a given topic, along with the full text and the KWIC summaries will be combined rote a single grou p, randomly mtxed, and divided among the assessors Each assessor will review atoplc, then read each summary or document and judge whether or not it is relevant to the topic at hand The assessor wdl then move on to the next topic No assessor wall read more that one representaUon of a smgle document In addmon to the TREC relevance judgments, a minimum of two addmonal assessors ,roll read all of the full text documents to estabhsh a ground truth relevance deczsmn</Paragraph> </Section> <Section position="2" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.2 Evaluation Criteria </SectionTitle> <Paragraph position="0"> Both evaluations lughhght the acceptabdzty of a summary for a gwen task, wRh the assumption that there ~s not a single 'correct' summary The mare purpose wdl be to deterrnme ff the evaluator would make the same decision ff gwen the full text, and how much longer it would take to make that declsmn The ~deal outcome would be that the declsmn_could be made with the same accuracy m shorter urne, given the document summary For each task, we wdl record the time reqmred to make each decision, and the actual decision The declsmn for each evaluator wdl then be compared to the relevance decision for the basehnes Analysis of the results wdl include consideration of the effects of summary length on the Ume taken to make the relevance declsmn as well as Its effects on decision accuracy Quanutative measures</Paragraph> </Section> </Section> <Section position="6" start_page="34" end_page="34" type="metho"> <SectionTitle> * CategonzauonfRelevance Decisions </SectionTitle> <Paragraph position="0"> Determining relevance to a given topic Is an inherently subjecttve actwlty We intend to waUgate this by using a sound statis~cal model to determine the appropriate number of summaries to evaluate, and by structuring the evaluation m such a way as to avoid bias of any single assessor As previously discussed, we wdl esmbhsh low-end and high-end basdmes and use multiple assessors to create ground truth declsmns</Paragraph> </Section> <Section position="7" start_page="34" end_page="34" type="metho"> <SectionTitle> * Ttme Reqmred </SectionTitle> <Paragraph position="0"> The time reqmred to make a relevance or categorization decision using a summary will be recorded and compared with the time reqmred to make the same decision using the full text</Paragraph> </Section> <Section position="8" start_page="34" end_page="35" type="metho"> <SectionTitle> * Summary Length </SectionTitle> <Paragraph position="0"> In prevmus stu&es, 20-30% of full document length was often used as optnnal cutoff length for reformative summaries, wlth the supposmon that m&cattvo summaries would reqmre far less reformation ((Brandow et al, 1995) and (Kuplec et al, 1995)) For the tmual evaluatlon, whlch wdl use m&catwe summaries only, a document cutoff length wdl be estabhshed at 10% of the original document length Any summary exceeding that margin wall be truncated</Paragraph> <Section position="1" start_page="34" end_page="35" type="sub_section"> <SectionTitle> Quahtative measures * User Preference </SectionTitle> <Paragraph position="0"> Evaluators wdl be asked to indicate whether they prefer the full text or the summary as a basis for declslon-malang In addmon to this quahtatwe assessment, the evaluator will be encouraged to provide feedback as to why the summary was or was not acceptable for a given task This feedback will then he made avmlable for system developers It could also provide a basis for subsequent evaluattons null</Paragraph> </Section> </Section> <Section position="9" start_page="35" end_page="36" type="metho"> <SectionTitle> 5 Future Direction of Evaluation </SectionTitle> <Paragraph position="0"> Thls mmal evaluauon will address only a hn~ted number of issues mvolwng automaUc text summanzauon technology As we gmn more experience working wlth these systems and integrating them mto a user's work flow, the scope of the evaluauons wdl necessarily grow and change Some addtuonal features and tasks to be addressed petenually m future evaluations have already been identified, including cohesiveness of a summary, optunal length of a summary, and multl-document summaries Selected tasks are outlined m Table 2 and described briefly below</Paragraph> <Section position="1" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 5.1 Tasks and Measures </SectionTitle> <Paragraph position="0"> We are addr'essmg two mformaUon retrieval types of tasks dunng the first evaluatton, however, potenttal apphcatxons go beyond this hrmted scope One of the frequently mentloned uses of a text summary is as a substttute for the document dunng the indexing process of an mfonnmon retrieval system The nouon is that mdeyang based on summaries would result m more results retrievals because only the key concepts and contentbeanng words would have been indexed This idea could be evaluated using standard precmmn and recall mforraauon retrieval measures Summarizing across muluple documents m another extremely useful apphcatton Whde single document summaries are expected to prowde improved efficlency for the end-user, much of the mformanon rewewed from one summary to the next will be redundant Automatically generated summaries could result m even larger efficiency gains and productivity Lmprovements by dls~hng the mformauon from muluple documents rote a single summary An evaluauon of tius type of summary would be much more complex, posstbly comparing at a phrase-matching or key concept level the combined factual mforma~on included m a single summary with manually ldenufied key mformauon m mdwldual documents The evaluaUon would verify that the relevant aspects of key facts across documents have been successfully Identified and combmed m the resulting summary * A thlrd apphcat~on could focus on a decmonmalong task based on an mformauve summary An evaluaUon of thls type of summary could include filling out a template mdlcatmg key concepts m a document, slrmlar to the Pmce and Jones (1993) and Johnson et al (1993) evaluauons, possibly augmented by a quesuon/answer measure based on the full text and the summary</Paragraph> </Section> <Section position="2" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 5.2 Data </SectionTitle> <Paragraph position="0"> Newspaper arucles, such as those which wall be used for the first evaluation, represent only a small pornon of the type of information avmlable on-line A useful, effective surnrnanzer should be able to accept text m a variety of formats Wah each subsequent evaluataon, new sources of data will be added These new sources could be news feeds or web pages They will tend to be less formatted, vary greatly m length, and cover multiple topics At some point, we hope to introduce docu-.</Paragraph> <Paragraph position="1"> ments m languages other than Enghsh for summanzatlon either into then&quot; native language or into Enghsh</Paragraph> </Section> </Section> <Section position="10" start_page="36" end_page="36" type="metho"> <SectionTitle> 6 Acknowledgments </SectionTitle> <Paragraph position="0"> The author is grateful to Donna Harman and Beth Sundhelm for then&quot; support and assmance m designing the evaluauon The views expressed m this paper are those of the author and do not necessarily reflect the views * of the Department of Defense or any of its agen-</Paragraph> </Section> class="xml-element"></Paper>