File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-1012_concl.xml

Size: 3,117 bytes

Last Modified: 2025-10-06 13:54:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1012">
  <Title>Automatic Evaluation of Summaries Using Document Graphs</Title>
  <Section position="6" start_page="0" end_page="0" type="concl">
    <SectionTitle>
5 Discussion and Future Work
</SectionTitle>
    <Paragraph position="0"> In DUC 2002 data collection, 9 human ju dges were involved in creating model extracts; however, there are only 2 model extracts generated for each document set. The sentence precisions and recalls obtained from comparing the machine generated extracts and human generated model extracts are distributed along with raw data (DUC2002. http://www-nlpir.nist.gov/projects/duc), with the intent to use them in system performance comparison. Van Halteren (2002) argued that only two manually created extracts could not be used to form a sufficient basis for a good benchmark. To explore this issue, we obtained a ranking order for each human judge based on the extracts he/she generated. The results showed that the ranking orders obtained from 9 different judges are actually similar to each other, with the average Spearman correlation efficient to be 0.901. From this point of view, if the ranking orders obtained by sentence precision and recall based on the model extracts could not form a good basis for a benchmark, it is because of its binary nature (Jing et al., 1998), not the lack of sufficient model extracts in DUC 2002 data.</Paragraph>
    <Paragraph position="1"> Van Halteren and Teufel (2003) proposed to evaluate summaries via factoids, a pseudo-semantic representation based on atomic information units. However, sufficient manually created model summaries are need; and factoids are also manually annotated. Donaway et al. (2000) suggested that it might be possible to use content-based measures for summarization evaluation without generating model summaries. Here, we presented our approach to evaluate the summaries base on document graphs, which is generated automatically. It is not very surprising that different measures rank summaries differently. A similar observation has been reported previously (Radev, et al, 2003). Our document graph approach on summarization evaluation is a new automatic way to evaluate machine-generated summaries, which measures the summaries from the point of view of informativeness. It has the potential to evaluate the quality of summaries, including extracts, abstracts, and multi-document summaries, without human involvement. To improve the performance of our system and better represent the content of the summaries and source documents, we are working in several areas: 1) Improve the results of natural language processing to capture information more accurately; 2) Incorporate a knowledge base, such as WordNet (Fellbaum, 1998), to address the synonymy problem; and, 3) Use more heuristics in our relation extraction and generation. We are also going to extend our experiments by comparing our approach to content-based measure approaches, such as cosine similarity based on term frequencies and LSI approaches, in both extracts and abstracts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML