File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0702_intro.xml
Size: 2,454 bytes
Last Modified: 2025-10-06 14:03:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0702"> <Title>Challenges in Evaluating Sumaries of Short Stories</Title> <Section position="3" start_page="0" end_page="8" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In recent years the automatic text summarization community has increased its focus on reliable evaluation. The much used evaluation methods based on sentence overlap with reference summaries have been called into question (Mani 201) as they provide only a rough approximation of semantic similarity between summaries. A number of deeper, more semantically-motivated approaches have been proposed, such as the factoid method (van Halteren and Teufel, 203) and the pyramid method (Nenkova and Passoneau 204). These methods measure similarity between reference and generated summaries more reliably but, unfortunately, have a disadvantage of being very labour-intensive. This paper describes experiments in evaluating automatically produced summaries of literary short stories. It presents an approach that evaluates summaries from two different perspectives: comparing computer-made summaries to those produced by humans based on sentence-overlap and measuring usefulness and informativeness of the summaries by themselves - a step critical when creating and evaluating summaries of a relatively unexplored genre. The paper also points out several challenges specific to evaluating summaries of fiction such as questionable suitability of traditional metrics (those based on sentence overlap), unavailability of clearly defined criteria to judge &quot;godness&quot; of a summary and a higher degree of redundancy in such texts.</Paragraph> <Paragraph position="1"> We achieve these goals by performing a two-step evaluation of our summaries.</Paragraph> <Paragraph position="2"> Initially, for each story in the test set we compare sentence overlap between summaries which the system generates and those produced by three human subjects.</Paragraph> <Paragraph position="3"> These experiments reveal that inter-rater agreement measures tend to be pessimistic where fiction is concerned. This seems due to a higher degree of redundancy and paraphrasing in such texts. The second stage of the evaluation process seeks to measure usefulness of the summaries in a more tangible way. To this end, three subjects answered a number of questions, first after</Paragraph> </Section> class="xml-element"></Paper>