File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0702_metho.xml

Size: 24,450 bytes

Last Modified: 2025-10-06 14:10:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0702">
  <Title>Challenges in Evaluating Sumaries of Short Stories</Title>
  <Section position="4" start_page="8" end_page="8" type="metho">
    <SectionTitle>
A MATER OF MEAN ELEVATION. By O. Henry (1862-1910).
</SectionTitle>
    <Paragraph position="0"> On the camino real along the beach the two sadle mules and the four pack mules of Don Senor Johny Armstrong stod, patiently awaiting the crack of the whip of the ariero, Luis. These articles Don Johny traded to the interior Indians for the gold dust that they washed from the Andean streams and stored in quils and bags against his coming.</Paragraph>
    <Paragraph position="1"> It was a profitable busines, and Senor Armstrong expected son to be able to purchase the cofe plantation that he coveted. Armstrong stod on the narow sidewalk, exchanging garbled Spanish with old Peralto, the rich native merchant who had just charged him four prices for half a gros of pot-metal hatchets, and abridged English with Rucker, the litle German who was Consul for the United States. [...] Armstrong, waved a god-bye and tok his place at the tail of the procesion. Armstrong concured, and they turned again upward toward Tacuzama. [...] Pering cautiously inside, he saw, within thre fet of him, a woman of marvelous, imposing beauty, clothed in a splendid lose robe of leopard skins. The hut was packed close to the smal space in which she stod with the squating figures of Indians. [...] I am an American. If you ned asistance tel me how I can render it. [...] The woman was worthy of his boldnes. Only by a suden flush of her pale chek did she acknowledge understanding of his words. [...] &amp;quot; I am held a prisoner by these Indians. God knows I ned help. [...] lok, Mr. Armstrong, there is the sea! reading only the summary and then after reading the complete story. The set included both factual questions (e.g. can you tell where this story takes place?) and subjective questions (e.g. how readable did you find this sumary?).</Paragraph>
    <Paragraph position="2"> Finally, we compare the two types of results with a surprising discovery: overlap-based measures and human judgment do not correlate well in our case.</Paragraph>
    <Paragraph position="3"> This paper is organized in the folowing manner. Section 2 briefly describes our summarizer of short stories. Section 3.1 discuses experiments comparing generated summaries to reference ones based on sentence overlap. The experiments involving human judgment of the summaries are presented in Section 3.2 and the two types of experiments are compared in Section 3.3.</Paragraph>
    <Paragraph position="4"> Section 4 draws conclusions and outlines posible directions for future work.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="9" type="metho">
    <SectionTitle>
2 Background: System Description
</SectionTitle>
    <Paragraph position="0"> A detailed description of our summarizer of short stories is outside the scope of this paper. For completeness, this section gives an overview of the system's iner workings.</Paragraph>
    <Paragraph position="1"> An interested reader is referred to our previous work (Kazantseva 206) for more information.</Paragraph>
    <Paragraph position="2"> The system is designed to create a particular type of indicative generic summaries - namely, summaries that would help readers decide whether they would like to read a given story. Because of this, a summary, as defined here, is not meant to summarize the plot of a story. It is intended to raise adequate expectations and to enable a reader to make informed decisions based on a summary only. We achieve this goal by identifying the salient portions of the original texts that lay out the setting of a story, namely, location and main characters. The present prototype of our system creates summaries by extracting sentences from original documents. An example summary produced by the system appears in Figure 1.</Paragraph>
    <Paragraph position="3"> The system works in two stages. First it attempts to identify important entities in stories (locations and characters). Next, sentences that are descriptive and set out the background of a story are separated from those that relate events of the plot. Finally, the system selects summary-worthy sentences in a way that favours descriptive ones that focus on important entities and occur early in the text.</Paragraph>
    <Paragraph position="4"> The identification of important entities is achieved by processing the stories using a gazetteer. Pronominal and noun phrase anaphora are very common in fiction, so we resolve anaphoric expressions of these two types. The anaphora resolution module is restricted to resolving singular anaphoric expressions that denote animate entities (people and, sometimes, animals). The main characters are then identified using normalized frequency counts.</Paragraph>
    <Paragraph position="5"> The next stage of the process attempts to identify sentences that set out the background in each story. The stories are parsed using the Conexor Machinese Syntax Parser (Tapanainen and Jarvinen 197) and sentences are split into clauses.</Paragraph>
    <Paragraph position="6">  Each clause is represented as a vector of features that approximate its aspectual type. The features are designed to help identify state clauses (John was a tal man) and serial situations (John always drops things) (Hudleston and Pulum 202, p. 123-124).</Paragraph>
    <Paragraph position="7"> Four groups of features represent each clause: character-related, location-related, aspect-related and others. Character-related features capture such information as the presence of a mention of one of the main characters in a clause, its syntactic function, how early in the text this mention occurs, etc. Location-related features state whether a clause contains a location name and whether this name is embedded in a prepositional phrase. Aspect-related features reflect a number of properties of a clause that influence its aspectual type. They include the main verb's lexical aspect, the tense, the presence and the type of temporal expressions, voice, and the presence of modal verbs.</Paragraph>
    <Paragraph position="8"> In our experiments we create two separate representations for each clause: fine-grained and coarse-grained. Both contain features from all four feature groups. The difference between them is only in the number of features and in the cardinality of the set of posible values.</Paragraph>
    <Paragraph position="9"> Two different procedures achieve the actual selection process. The first procedure performs decision tree induction using C5.0 (Quinlan 192) to select the most likely candidate sentences. The training data for this process consists of short stories annotated at the clause-level by the first author of this paper. The second procedure applies a set of manually created rules to select summary-worthy sentences.</Paragraph>
    <Paragraph position="10"> The corpus for the experiments contains 47 short stories from Project Gutenberg (htp:/ww.gutenberg.org) divided into a training set (27 stories) and a test set (20 stories). These are classical works writen in English or translated into English by authors including O.Henry, Jerome K. Jerome, Katherine Mansfield and Anton Chekhov.</Paragraph>
    <Paragraph position="11"> They have on average 3,33 tokens and 24 sentences (4.5 letter-sized pages). The target compression rate was set at 6% counted in sentences. This rate was selected because it corresponded to the compression rate achieved by the first author when creating initial training and test data.</Paragraph>
  </Section>
  <Section position="6" start_page="9" end_page="14" type="metho">
    <SectionTitle>
3 Evaluation: Experimental Setup
</SectionTitle>
    <Paragraph position="0"> We designed our evaluation procedure to have easily interpreted, meaningful results, and keep the amount of labour reasonable.</Paragraph>
    <Paragraph position="1"> We worked with six subjects (different than the authors of this paper) who performed two separate tasks.</Paragraph>
    <Paragraph position="2"> In Task 1 each subject was asked to read a story and create its summary by selecting 6% of the sentences. The subjects were explained that their summaries were to raise expectations about the story, but not to reveal what happens in it.</Paragraph>
    <Paragraph position="3"> In Task 2 the subjects made a number of judgments about the summaries before and after reading the original stories. The subjects read a summary similar to the one shown in Figure 1. Next, they were asked six questions, three of which were factual in nature and three others were subjective. The subjects had to answer these questions using the summary as the only source of information. Subsequently, they read the original story and answered almost the same questions (see Section 4). This process allowed us to understand how informative the summaries were by themselves, without access to the originals, and also whether they were misleading or incomplete.</Paragraph>
    <Paragraph position="4"> The experiments were performed on a test set of 20 stories and involved six participants divided into two groups of three people. Group 1 performed Task 1 on stories 1-10 of the testing set and Group 2 performed this task on stories 1-20. During Task 2 Group 1 worked on stories 1-20 and Group 2 - on stories 1-10.</Paragraph>
    <Paragraph position="5"> By adjusting a number of system parameters, we produced four different summaries per story. Al four versions were compared with human-made summaries using sentence overlap-based measures.</Paragraph>
    <Paragraph position="6"> However, because the experiments are rather time consuming, it was not posible to evaluate more than one set of summaries using human judgments (Task 2). That is  why only summaries generated using the coarse-grained dataset and manually composed rules were evaluated in Task 2.</Paragraph>
    <Paragraph position="7"> We selected this version because the differences between this set of summaries and gold-standard summaries are easiest to interpret. That is to say, decisions based on a set of rules employing a smaller number of parameters are easier to track than those taken using machine learning or more elaborate rules.</Paragraph>
    <Paragraph position="8"> On average, the subjects reported that completing both tasks required between 15 and 35 hours of work. Four out of six subjects were native speakers of English.</Paragraph>
    <Paragraph position="9"> Two others had a near-native and very god levels of English respectively. The participants were given the data in form of files and had four weeks to complete the tasks.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Creating Gold-Standard
</SectionTitle>
      <Paragraph position="0"> Sumaries: Task 1 During this task each participant had to create extract-based summaries for 10 different stories. The criteria (making a summary indicative rather than informative) were explained and one example of an annotated story shown. The instructions for these experiments are available at &lt;htp:/ww.site.uotawa.ca/~ankazant/instr uctions.zip&gt;.</Paragraph>
      <Paragraph position="1"> Table 1 presents several measures of agreement between judges within each group and with the first author of this paper (included in the agreement figures because this person created the initial training data and test data for the preliminary experiments).</Paragraph>
      <Paragraph position="2"> The measurement names are displayed in the first column of Table 1. Cohen denotes Cohen's kappa (Cohen 1960). PABAK denotes Prevalence and Bias Adjusted Kappa (Bland and Altman 1986). ICC</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
denotes Intra-class Correlation Coefficient
</SectionTitle>
      <Paragraph position="0"> (Shrout and Fleis 1979). The numbers 3 and 4 state whether the statistic is computed only for 3 subjects participating in the evaluation or for 4 subjects (including the first author of the paper).</Paragraph>
      <Paragraph position="1"> As can be seen in Table 1, the agreement statistics are computed for each group separately. This is because the sets of stories that they annotated are disjoint. The column Average provides an average of these figures to give a better overall idea.</Paragraph>
      <Paragraph position="2"> Cohen's kappa in its original form can only be computed for a pair of raters. For this reason we computed it for each posible pair-wise combination of raters within a group and then the numbers were averaged.</Paragraph>
      <Paragraph position="3"> The PABAK statistic was computed in the same manner using Cohen's kappa as its basis. ICC is the statistic that measures inter-rater agreement and can be computed for more than 2 judges. It was computed for all 3 or 4 raters at the same time. ICC was computed for a two-way mixed model and measures the average reliability of ratings taken together. The numbers in parentheses are confidence intervals for 9% confidence.</Paragraph>
      <Paragraph position="4"> We compute three different agreement measures because each of these statistics has its weakness and distorts the results in a different manner. Cohen's kappa is known to be a pessimistic measurement in the presence of a severe class imbalance, as is the case in our setting (Sim and Wright 205). PABAK is a measure that takes class imbalance into account, but it is to optimistic because it artificially removes class imbalance present in the original setting. ICC has weaknesses similar to Cohen's kappa (sensitivity to class imbalance). Besides, it assumes that the sample of targets to be rated (sentences in our case) is a random sample of targets drawn from a larger population. This is not</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
Anotator A.
</SectionTitle>
      <Paragraph position="0"> The Rev. Augustus Cracklethorpe would be quiting Wychwod-on-the-Heaththe the folowing Monday, never to set fot [...] in the neighbourhod again. The Rev. Augustus Cracklethorpe, M.A., might posibly have ben of service to his Church in, say, [...] some mision station far advanced amid the hordes of heathendom. In picturesque litle Wychwod-on-the-Heath [...] these qualities made only for scandal and disunion. Churchgoers who had not visited St. Jude's for months had promised themselves the luxury of feling they were listening to the Rev.</Paragraph>
      <Paragraph position="1"> Augustus Cracklethorpe for the last time. The Rev. Augustus Cracklethorpe had prepared a sermon that for plain speaking and directnes was likely to leave an impresion.</Paragraph>
      <Paragraph position="2"> Anotator B.</Paragraph>
      <Paragraph position="3"> The Rev. Augustus Cracklethorpe would be quiting Wychwod-on-the-Heaththe the folowing Monday, never to set fot [...] in the neighbourhod again. The Rev. Augustus Cracklethorpe, M.A., might posibly have ben of service to his Church in, say, [.] some mision station far advanced amid the hordes of heathendom. What mared the entire busines was the impulsivenes of litle Mrs. Penycop. Mr. Penycop, caried away by his wife's eloquence, aded a few halting words of his own. Other ladies felt it their duty to show to Mrs. Penycop that she was not the only Christian in Wychwod-on-the-Heath.</Paragraph>
      <Paragraph position="4"> Anotator C.</Paragraph>
      <Paragraph position="5"> The Rev. Augustus Cracklethorpe would be quiting Wychwod-on-the-Heath the folowing Monday, never to set fot [...] in the neighbourhod again. The Rev. Augustus Cracklethorpe, M.A., might posibly have ben of service to his Church in, say, [...] some mision station far advanced amid the hordes of heathendom. For the past two years the Rev. Cracklethorpe's parishioners [...] had sought to impres upon him, [.] their cordial and dailyincreasing dislike of him, both as a parson and a man. The Rev. Augustus Cracklethorpe had prepared a sermon that for plain speaking and directnes was likely to leave an impresion. The parishioners of St. Jude's, Wychwod-on-the-Heath, had their failings, as we al have. The Rev. Augustus flatered himself that he had not mised out a single one, and was loking forward with pleasurable anticipation to the sensation that his remarks, from his &amp;quot;firstly&amp;quot; to his &amp;quot;sixthly and lastly,&amp;quot; were likely to create.</Paragraph>
      <Paragraph position="6"> necessarily the case as the corpus was not compiled randomly.</Paragraph>
      <Paragraph position="7"> We hope that these three measures, although insufficient individually, provide an adequate understanding of inter-rater agreement in our evaluation. We note that the average overlap (intersection) between judges in each group is 1.8% out of 6% of summary-worthy sentences.</Paragraph>
      <Paragraph position="8"> Al of these agreement measures and, in fact, all measures based on computing sentence overlap are inherently incomplete where fiction is concerned because any two different sentences are not necessarily &amp;quot;equally different&amp;quot;. The matter is exemplified in Figure 2. It displays segments of summaries produced for the same story by three different annotators.</Paragraph>
      <Paragraph position="9"> Computing Cohen's kappa between these fragments gives agreement of 0.521 between annotators A and B and 0.470 between annotators A and C. However, a closer lok at these fragments reveals that there are more differences between summaries A and B than between summaries A and C. This is because many of the sentences in summaries A and C describe the same information (personal qualities of Rev. Cracklethorpe) even though they do not overlap. On the other hand, sentences from summaries A and B are not only distinct; they &amp;quot;talk&amp;quot; about different facts. This problem is not unique to fiction, but in this context it is more acute because literary texts exhibit more redundancy.</Paragraph>
      <Paragraph position="10"> Tables 2-4 show the results of comparing four different versions of computer-made summaries against gold-standard summaries produced by humans. The tables also display the results of two baseline algorithms. The LEAD baseline refers to the version of summaries produced by selecting the first 6% of sentences in each story. LEAD CHAR baseline is obtained by selecting first  important character. The improvements over the baselines are significant with 9% confidence in all cases.</Paragraph>
      <Paragraph position="11"> By combining summaries created by human annotators in different ways we create three distinct gold-standard summaries.</Paragraph>
      <Paragraph position="12"> The majority gold-standard summary contains all sentences that were selected by at least two judges. It is the most commonly accepted way of creating gold-standard summaries and it is best suited to give an overall picture of how similar computer-made summaries are to man-made ones.</Paragraph>
      <Paragraph position="13"> The union gold standard is obtained by considering all sentences that were judged summary-worthy by at least one judge.</Paragraph>
      <Paragraph position="14"> Union summaries provide a more relaxed measurement. Precision for the union gold standard gives one an idea of how many irrelevant sentences a given summary contains (sentences not selected by any of three judges are more likely to prove irrelevant).</Paragraph>
      <Paragraph position="15"> The intersection summaries are obtained by combining sentences that all three judges deemed to be important. Intersection gold standard is the strictest way to measure the godness of a summary. Recall for intersection gold standard tells one how many of the most important sentences were included in summaries by the system (sentences selected by all three judges are likely to be the most important ones).</Paragraph>
      <Paragraph position="16"> It should be noted, however, that the numbers in Tables 2-4 do not give a complete picture of the quality of the summaries for the same reason that the agreement measures do not reveal fuly the extent of inter-judge agreement: sentences that are not part of the reference summaries are not necessarily equally unsuitable for inclusion in the summary.</Paragraph>
    </Section>
    <Section position="4" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
3.2 Human Judgment of Computer-
</SectionTitle>
      <Paragraph position="0"> Made Sumaries: Task 2 In order to evaluate one summary in Task 2, a participant had to read it and to answer six questions using the summary as the only source of information. The participant was then required to read the original story and to answer another six questions. The questions asked before and after reading the original were the same with one exception: question Q4 was replaced by Q1 (see Table 6.) The subjects were asked not to correct the answers after the fact.</Paragraph>
      <Paragraph position="1">  How helpful was this sumary for deciding whether you would like to read the story or not? 4.52 1.37 4.6 1.21 Three of the questions were factual and three others - subjective. Table 5 displays the factual questions along with the resulting answers. The participants had to answer questions Q1 and Q2 in their own words and question Q3 was a multiple-choice question where a participant selected the century when the story tok place. Q1 and Q2 were ranked on a scale from -1 to 3. A score of 3 means that the answer was complete and correct, 2 - slightly incomplete, 1 - very incomplete, 0 - a subject could not find the answer in the text and -1 if the person answered incorrectly. Q3 was ranked on a binary scale (0 or 1).</Paragraph>
      <Paragraph position="2"> Questions Q3-Q7 asked the participants to pronounce a subjective judgment on a summary. These were multiple-choice questions where a participant needed to select a score from 1 to 6, with 1 indicating a strong negative property and 6 indicating a strong positive property. The questions and results appear in Table 6.</Paragraph>
      <Paragraph position="3"> The results displayed in Tables 5 and 6 sugest that the subjects can answer simple questions based on the summaries alone.</Paragraph>
      <Paragraph position="4"> They also seem to indicate that the subjects found the summaries quite helpful. It is interesting to note that even after reading complete stories the subjects are not always capable of answering the factual questions with perfect precision.</Paragraph>
    </Section>
    <Section position="5" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
3.3 Putting Sentence Overlap and
Human Judgment Together
</SectionTitle>
      <Paragraph position="0"> In order to check whether the two types of statistics measure the same or different qualities of the summaries, we explored whether the two are correlated.</Paragraph>
      <Paragraph position="1"> Table 7 displays the values of Spearman rank correlation coefficient between median values of answers for questions from Task 2 and measurements obtained by comparing computer-made summaries against the majority gold-standard summaries. Al questions, except Q10 (relevance) and Q1 (completeness) are those asked and answered using the summary as the only source of information. Sentence overlap values (F-score, precision and recall) were discretized (banded) in order to be used in this test. These results are based on the values obtained for 20 stories in the test set - a relatively small sample - which prohibits drawing definite conclusions. However, in most cases the correlation coefficient between human opinions and sentence overlap measurements is below the cut-off  value with 9% confidence, which is 0.57 (the exceptions are highlighted). This sugests that in our case the measurements using sentence overlap as their basis are not correlated with the opinions of subjects about the summaries.</Paragraph>
      <Paragraph position="2"> We also performed a one-way ANOVA test using human judgments as independent factors and sentence-overlap based measures as dependent variables. The results are in line with those obtained using Spearman coefficient. They are shown in Table 8. The F-values which are statistically significant with 9% confidence are highlighted (the cut-off value for questions Q4-Q12 is 4.89, for Q1 and Q2 - 6.1 and for Q3 - 8.29).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML