File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-3007_evalu.xml

Size: 8,705 bytes

Last Modified: 2025-10-06 13:59:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-3007">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Investigations on Event-Based Summarization</Title>
  <Section position="7" start_page="39" end_page="40" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
5.1 Dataset and Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> DUC 2001 dataset is employed to evaluate our summarization approaches. It contains 30 clusters and a total of 308 documents. The number of documents in each cluster is between 3 and 20.</Paragraph>
      <Paragraph position="1"> These documents are from some English news agencies, such as Wall Street Journal. The contents of each cluster are about some specific topic, such as the hurricane in Florida. For each cluster, there are 3 different model summaries, which are provided manually. These model summaries are created by NIST assessors for the DUC task of generic summarization. Manual summaries with 50 words, 100 words, 200 words and 400 words are provided.</Paragraph>
      <Paragraph position="2"> Since manual evaluation is time-consuming and may be subjective, the typical evaluation package, ROUGE (Lin and Hovy, 2003), is employed to test the quality of summaries. ROUGE compares the machine-generated summaries with manually provided summaries, based on uni-gram overlap, bigram overlap, and overlap with long distance. It is a recall-based measure and requires that the length of the summaries be limited to allow meaningful comparison. ROUGE is not a comprehensive evaluation method and intends to provide a rough description about the performance of machine generated summary.</Paragraph>
    </Section>
    <Section position="2" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
5.2 Experimental Configuration
</SectionTitle>
      <Paragraph position="0"> In the following experiments for independent event-based summarization, we investigate the effectiveness of the approach. In addition, we attempt to test the importance of contextual information in scoring event terms. The number of associated event terms and the type of event terms are considered to set the weights of event terms. The weights parameters in the following experiments are chosen according to empirical  Weight of any verb/action noun, which is between two entities and the first entity is person or organization, is 5. Weight of any verb/action noun, which is between two entities and the first entity is not person and not organization, is 3. Weight of any verb/action noun, which is just after a person or organization, is 2. Weight of any verb/action noun, which is just before one entity, is 1. Weight of any verb/action noun, which is just after one entity and the entity is not person and not organization, is 1.</Paragraph>
      <Paragraph position="1"> In the following experiments, we investigate the effectiveness of our approaches on under different length limitation of summary. Based on the algorithm of experiment 3, we design experiment to generate summaries with length 50 words, 100 words, 200 words, 400 words. They are named Experiment 4, Experiment 5, Experiment 3 and Experiment 6.</Paragraph>
      <Paragraph position="2"> In other experiments for relevant event-based summarization, we investigate the function of relevance between events. The configurations are described as follows.</Paragraph>
      <Paragraph position="3"> Experiment 7: Event terms and event elements are identified as we discussed in Section 3. In this experiment, event elements just include named entities. Occurrences of event terms or event elements are linked with by exact matches. Finally, the PageRank is employed to select important events and then important sentences.</Paragraph>
      <Paragraph position="4"> Experiment 8: For reference, we select one of the four model summaries as the final summary for each cluster of documents. ROUGE is employed to evaluate the performance of these manual summaries.</Paragraph>
    </Section>
    <Section position="3" start_page="39" end_page="40" type="sub_section">
      <SectionTitle>
5.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The experiment results on independent event-based summarization are shown in table 1. The results for relevant event-based summarization are shown in table 3.</Paragraph>
      <Paragraph position="1"> Exp. 1 Exp. 2 Exp. 3  summarization (summary with length of 200 words) From table 1, we can see that results of Experiment 2 are better than those of Experiment 1. It proves our assumption that importance of event terms is different when these event terms occur with different number of event elements. Results of Experiment 3 are not significant better than those of Experiment 2, so it seems that the  assumption that importance of event terms is not very different when these event terms occur with different types of event elements. Another possible explanation is that after adjustment of the weight for event terms, the difference between the results of Experiment 2 and Experiment 3 may be extended.</Paragraph>
      <Paragraph position="2">  summarization (summary with different length) Four experiments of table 2 show that performance of our event based summarization are getting better, when the length of summaries is expanded. One reason is that event based approach prefers sentences with more event terms and more event elements, so the preferred lengths of sentences are longer. While in a short summary, people always condense sentences from original documents, and use some new words to substitute original concepts in documents. Then the Rouge score, which evaluates recall aspect, is not good in our event-based approach. In contrast, if the summaries are longer, people will adopt detail event descriptions in  summarization and a reference experiment (summary with length of 200 words) In table 3, we found the Rouge-score of relevant event-based summarization (Experiment 7) is better than independent approach (Experiment 1). In Experiment 1, we do not discriminate the weight of event element and event terms. In Experiment 7, we also did not discriminate the weight of event element and event terms. It is fair to compare Experiment 7 with Experiment 1 and it's unfair to compare Experiment 7 with Experiment 3. It looks like the relevance between nodes (event terms or event elements) can help to improve the performance. However, performance of both dependent and independent event-based summarization need to be improved further, compared with human performance in Experiment 8.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="40" end_page="41" type="evalu">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> As discussed in Section 2, event-based approaches are also employed in previous works.</Paragraph>
    <Paragraph position="1"> We evaluate our work in this context. As event-based approaches in this paper are similar with that of Filatovia and Hatzivassiloglou (2004), and the evaluation data set is the same one, the results are compared with theirs.</Paragraph>
    <Paragraph position="2"> Exp. 4 Exp. 5 Exp. 3 Exp. 6  the ROUGE scores according to each cluster of DUC 2001 data collection in Figure 2. In this figure, the bold line represents their event-based approach and the light line refers to tf*idf approach. It can be seen that the event-based approach performs better. The evaluation of the relevant event-based approach presented this paper is shown in Figure 3. The proposed approach achieves significant improvement on most document clusters. The reason seems that the relevance between events is exploited.</Paragraph>
    <Paragraph position="3"> Centroid is a successful term-based summarization approach. For caparison, we employ MEAD (Radev et.al., 2004) to generate Centroid-based summaries. Results show that Centroid is better than our relevant event-based approach. After comparing the summaries given by the two approaches, we found some limitation of our approach.</Paragraph>
    <Paragraph position="4">  Event-based approach does not work well on documents with rare events. We plan to discriminate the type of documents and apply event-based approach on suitable documents. Our relevant event-based approach is instance-based and too sensitive to number of instances of entities. Concepts seem better to represent meanings of events, as they are really things we care about. In the future, the event map will be build based on concepts and relationships between them. External knowledge may be exploited to refine this concept map.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML