File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1012_evalu.xml
Size: 9,275 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1012"> <Title>Automatic Evaluation of Summaries Using Document Graphs</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> The ranking orders obtained based on sentence precisions and recalls are shown in Tables 1 and 2. The results indicate that for sentence precision and recall, the ranking order for different summarization systems is not affected by the summarization compression ratio. The ranking results for 200-word extracts and 400-word extracts are exactly the same.</Paragraph> <Paragraph position="1"> Since the comparison is between the machine generated extracts and the human created model extracts, we believe that the rankings should represent the performance of 10 different automated summarization systems, to some degree. The experiments using DGs instead of sentence matc hing give two very similar ranking orders (Spearman rank correlation coefficient [Myers and Well, 1995] is 0.988) where only systems 24 and 19 are reversed in their ranks (Tables 1 and 2). The results show that when the evaluation is based on the comparison between machine generated extracts and the model extracts, our DG-based evaluation approach will provide roughly the same ranking results as the sentence precision and recall approach. Notice that the F-factors obtained by experiments using DGs are higher than those calculated based on sentence matching. This is because our DG-based evaluation approach compares the two extracts at a more fine grained level than sentence matching does since we compare the similarity at the level of concepts/entities and their relations, not just whole sentences. The similarity of the two extracts should actually be higher than the score obtained with sentence matching because there are sentences that are equivalent in meaning but not syntactically identical.</Paragraph> <Paragraph position="2"> Since we believe that the DGs captures the semantic information content contained in the respective documents, we rank the automatic summarization systems by comparing the DGs of their extract outputs against the DGs of the orig inal documents. This approach does not need the model summaries, and hence no human involvement is needed in the evaluation. The results are shown in Tables 3 and 4. As we can see, our rankings are different from the ranking results based on comparison against the model extracts. System 28 has the largest change in rank in both 200-word and 400-word summaries. It was ranked as the worst by our DG based approach instead of number 7 (10 is the best) by the approaches comparing to the model extracts. We investigated the extract content of system 28 and found that many extracts generated summaries. Ranking results for 400 words extracts generated by system 28 included sentences that contain little information, e.g., author's names, publishers, date of public ation, etc. The following are sample extracts produced for document 120 by systems 28, 29 (the best ranked) and a human judge, at 200-words.</Paragraph> <Paragraph position="3"> [Extract for Document 120 by System 28] John Major, endorsed by Margaret Thatcher as the politician closest to her heart, was elected by the Conservative Party Tuesday night to succeed her as prime minister.</Paragraph> <Paragraph position="4"> the politician closest to her heart, was elected by the Conservative Party Tuesday night to succeed her as prime minister.</Paragraph> <Paragraph position="5"> Aides said Thatcher is &quot;thrilled&quot;.</Paragraph> <Paragraph position="6"> Hurd also quickly conceded.</Paragraph> <Paragraph position="7"> ONE year ago tomorrow, Mr John Major surprised everyone but himself by winning the general election.</Paragraph> <Paragraph position="8"> It has even been suggested that the recording of the prime minister's conversation with Michael Brunson, ITN's political editor, in which Major used a variety of four-, six - and eight-letter words to communicate his lack of fondness for certain colleagues, may do him good.</Paragraph> <Paragraph position="9"> favourite to replace Mr Major, if he is forced out. The Labour Party controls 90 local councils, whereas the Conservatives only control 13, with a sharp contrast in strength between the two sides. If he did not see the similarity, that is still more revealing.</Paragraph> <Paragraph position="10"> [Extract for Document 120 by a human judge -model extract] John Major, endorsed by Margaret Thatcher as the politician closest to her heart, was elected by the Conservative Party Tuesday night to succeed her as prime minister.</Paragraph> <Paragraph position="11"> While adopting a gentler tone on the contentious issue of Britain's involvement in Europe, he shares her opposition to a single European currency and shares her belief in tight restraint on government spending.</Paragraph> <Paragraph position="12"> FT 08 APR 93 / John Major's Year: Major's blue period - A year on from success at the polls, the prime minister's popularity has plunged.</Paragraph> <Paragraph position="13"> The past 12 months have been hijacked by internal party differences over Europe, by the debacle surrounding UK withdrawal from the exchange rates mechanism of the European Monetary System, and by a continuing, deep recession which has disappointed and alienated many traditional Tory supporters in business.</Paragraph> <Paragraph position="14"> Its Leader&quot;] [Text] In local government elections across Britain yesterday, the Conservatives suffered their worst defeat ever, losing control of 17 regional councils and 444 seats.</Paragraph> <Paragraph position="15"> Even before all of the results were known, some Tories openly announced their determination to challenge John Major's position and remove him from office as early as possible.</Paragraph> <Paragraph position="16"> The extract generated by system 28 has 8 sentences of which only one of them contained relevant information. When comparing using sentence precision and recall, all three extracts only have one sentence match which is the first sentence. If we calculate the F-factors based on the model extract shown above, system 28 has a score of 0.143 and system 29 has a lower score of 0.118. After reading all three extracts, the extract generated by system 29 contains much more relevant information than that generated by system 28. The missing information in system 28 is ---John Major and the Conservatives were losing the popularity in 1993, after John Major won the election one year ago,-- which should be the most important content in the extract. In our DG-based approach, the scores assigned to system 28 and 29 are 0.063 and 0.100, respectively; which points out that systems 29 did a better job than system 28.</Paragraph> <Paragraph position="17"> Of the 59 submitted 200-word extracts by system 28, 39 extracts suffer the problem of having less informative sentences. The number of such sentences is 103, where the total number of sentences is 406 from all the extracts for system 28. On average, each extract contains 1.75 such sentences, where each extract has 6.88 sentences. For the 400-words extracts, we found 54 extracts among the 59 submitted summaries also have this problem. The total number of such sentences was 206, and the total number of sentences was 802 sentences. So, about 3.49 sentences do not contain much information, where the average length of each extract is 13.59 sentences. Thus, a large portion of each extract does not contribute to the do example, will not be considered a good summary, either on the criterion of summary coherence or summary informativeness, where coherence is how the summary reads and informativeness is how much information from the source is preserved in the summary (Mani, 2001).</Paragraph> <Paragraph position="18"> From the results based on comparing extracts against original documents, we found that several systems perform very similarly, especially in the experiments with 400-word extracts (Table 4).</Paragraph> <Paragraph position="19"> The results show that except for systems 22 and 28 which perform significantly worse, all other systems are very similar, from the point of view of informativeness.</Paragraph> <Paragraph position="20"> Finally, we generated DGs for the model extracts and then compared them against their original documents. The average F-factors are calculated, which are listed in Table 5 along with the scores for different automatic summarization systems.</Paragraph> <Paragraph position="21"> Intuitively, a system provides extracts that contain more information than other systems will get a higher score. As we can see from the data, at 200words, the extracts generated by systems 21, 31, 24, 19, and 29 contain roughly the same amount of information as those created by humans, while the other five systems performed worse than human judges. At 400-words, when the compression ratio of the extracts is decreased, more systems perform well; only systems 22 and 28 source documents. Ranking results for 400 words extracts generated summaries that contain much less information than the model summaries.</Paragraph> </Section> class="xml-element"></Paper>