File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0507_metho.xml
Size: 8,126 bytes
Last Modified: 2025-10-06 14:08:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0507"> <Title>Text Summarization Challenge 2 Text summarization evaluation at NTCIR Workshop 3</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Participants </SectionTitle> <Paragraph position="0"> We had 4 participating systems for Task A, and 5 systems for Task B at dryrun. We have 8 participating systems for Task A and 9 systems for Task B at formal run. As group, we had 8 participating groups, which are all Japanese, of universities, governmental research institute or companies in Japan. Table 1 shows the breakdown of the groups.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Results 6.1. Results of Evaluation by ranking </SectionTitle> <Paragraph position="0"> Table 2 shows the result of evaluation by ranking for task A and Table 3 shows the result of evaluation by ranking for task B. Each score is the average of the scores for 30 articles for task A, and 30 topics for task B In Tables 2 and 3, F01* and F02* are labels for the different systems involved, respectively. In Table 2, 'TF' indicates a baseline system based on termfrequency method, and 'Human' indicates human-produced summaries that are different from the key data used in ranking judgement.</Paragraph> <Paragraph position="1"> In Table 3, 'Human' indicates human-produced summaries that are different from the key data used in ranking judgement.</Paragraph> <Paragraph position="2"> In Appendix A, we also show tables giving the fraction of time that each system beats the baseline, one human summary, or two human summaries for task A. In Appendix B, we show tables giving the fraction of time that each system beats the baseline, the benchmark, or human summary for task B.</Paragraph> <Paragraph position="3"> baseline, and benchmark) In comparison with the system results (Table 2 and Table 3), the scores for the human summaries, the baseline systems, and the benchmark system(the summaries to be compared) are shown in Table 4 and Table 6 shows the result of evaluation by revision for task A at rate 40%, and Table 7 shows the result of evaluation by revision for task A at rate 20%. Table 8 shows the result of evaluation by revision for task B long, and Table 9 shows the result of evaluation by revision for task B short. All the tables show the Table 6 Evaluation by revision (task A 40%) Please note that UIM stands for unimportant, RD for readability, IM for important, C for content in Tables 6 to 9. They mean the reason for the operations, e.g. 'unimportant' is for deletion operation due to the part judged to be unimportant, and 'content' is for replacement operation due to excess and deficiency of content.</Paragraph> <Paragraph position="4"> In Table 6 and Table 7, 'ld' means a baseline system using lead method, 'free' is free summaries produced by human (abstract type 2), and 'part' is human-produced (abstract type1) summaries, and these three are baseline and reference scores for task A.</Paragraph> <Paragraph position="5"> In Table 8 and Table 9, 'human' means human-produced summaries which are different from the key data, and 'ld' means a baseline system using lead method, 'stein' means a benchmark system using Stein method, and these three are baseline, reference, and benchmark scores for task B.</Paragraph> <Paragraph position="6"> To determine the plausibility of the judges' revision, the revised summaries were again evaluated with the evaluation methods in section 5. In Tables 6 to 9, `edit' means the evaluation results for the revised summaries. We also measure as degree of revision the number of revised characters for the three editing operations, and the number of documents that are given up revising by the judges. Please look at the detailed data at NTCIR Workshop 3 data booklet.</Paragraph> <Paragraph position="7"> Figure 1 indicates how much the scores for content and readability vary for the summaries of the same summarization rate. It shows that the readability scores tend to be higher than those for content, and it is especially clearer for 40% summarization.</Paragraph> <Paragraph position="8"> different summarization rates, i.e. 20% and 40% of task A. 'C20-C40' means the score for content 20% minus the score for content 40%. 'R20-R40' 'means the score for readability 20% minus the score for readability 40%. Table 9 Evaluation by revision (task B short)</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Figure 2 tells us that the ranking scores for 20% summarization tend to be higher than those for 40%, and this is true with the baseline system and human summaries as well.</Paragraph> <Paragraph position="1"> 7.1. Discussion for Evaluation by ranking Second, consider task B. Figure 3 shows the differences in scores for content and readability for each system for task B. 'CS-RS' means the score for content short summaries minus the score for readability short summaries. 'CL-RL' is computed in the same way for long summaries.</Paragraph> <Paragraph position="2"> We here further look into how the participating systems perform by analysing the ranking results in terms of differences in scores for content and those for readability.</Paragraph> <Paragraph position="3"> First, consider task A. Figure 1 shows the differences in scores for content and readability for each system. 'C20-R20' means the score for content 20% minus the score for readability 20%. 'C40-R40' means the score for content 40% minus the score for readability 40%. Figure 3 shows, like Figure 1, that the scores for readability tend to be higher, thence, the differences are in minus values, than those for content for both short and long summaries. In addition, the differences are larger than the differences we saw for task A, i.e. in different summarization lengths, i.e. short and long summaries of task B. 'CS-CL' means the score for content short summaries minus the score for content long summaries. 'RS-RL' means the score for readability short summaries minus the score for readability long summaries.</Paragraph> <Paragraph position="4"> Figure 4 tells us, unlike Figure2, the scores for short summaries tend to be lower than those for long summaries. This tendency is very clear for the readability ranking scores.</Paragraph> <Paragraph position="5"> Figure 1 and 3 show that when we compare the ranking scores for content and readability summaries, the readability scores tend to be higher than those for content, which means that the evaluation for readability is worse than that for content. Figure 2 and 4 shows contradicting tendencies. Figure 2 indicates that short (20%) summaries are higher in ranking scores, i.e. worse in evaluation. However, Figure 4 indicates the other way round.</Paragraph> <Paragraph position="6"> Intuitively longer summaries can have better readability since they have more words to deal with, and it is shown in Figure2. However, it is not the case with task B ranking results. Longer summaries had worse scores, especially in readability evaluation.</Paragraph> <Paragraph position="7"> 7.2. Discussion for Evaluation by revision To determine the plausibility of the judges' revision, the revised summaries were again evaluated with the evaluation methods in section 5. As Tables 6 to 9 show, the degree of the revisions for the revised summaries is rather smaller than that for the original ones and is almost same as that for human summaries.</Paragraph> <Paragraph position="8"> Tables 10 and 11 show the results of evaluation by ranking for the revised summaries at task A and B respectively. Compared with Tables 2 to 5, Tables 10 and 11 show that the scores for the revised summaries are rather smaller than those for the original ones and are almost same as those for human summaries.</Paragraph> <Paragraph position="9"> From these results, the quality of the revised summaries is considered as same as that of human summaries.</Paragraph> </Section> class="xml-element"></Paper>