File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1012_metho.xml
Size: 3,062 bytes
Last Modified: 2025-10-06 14:09:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1012"> <Title>Automatic Evaluation of Summaries Using Document Graphs</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data, and Experimental Design </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> Because the data from DUC-2003 were short (~100 words per extract for multi-document task), we chose to use multi-document extracts from DUC-2002 (~200 words and ~400 words per extract for multi-document task) in our experiment. In this corpus, each of ten information analysts from the National Institute of Standards and Technology (NIST) chose one set of newswire/paper articles in the following topics (Over and Liggett, 2002): * A single natural disaster event with documents created within at most a 7-day window null * A single event of any type with documents created within at most a 7-day window * Multiple distinct events of the same type (no time limit) * Biographical (discuss a single person) Each assessor chose 2 more sets of articles so that we ended up with a total of 15 document sets of each type. Each set contains about 10 documents. All documents in a set are mainly about a specific &quot;concept.&quot; A total of ten automatic summarizers participated to produce machine-generated summaries. Two extracts of different lengths, 200 and 400 words, have been generated for each documentset. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Experimental Design </SectionTitle> <Paragraph position="0"> A total of 10 different automatic summarization systems submitted their summaries to DUC. We obtained a ranking order of these 10 systems based on sentence precision/recall by comparing the machine generated extracts to the human generated model summaries. The F-factor is calc ulated from the following equation (Rijsbergen, 1979):</Paragraph> <Paragraph position="2"> where P is the precision and R is the recall. We think this ranking order gives us some idea on how human judges think about the performance of different systems.</Paragraph> <Paragraph position="3"> For our evaluation based on DGs, we also calculated F-factors based on precision and recall, where P = Sim(DG1, DG2) and R = Sim(DG2, DG1). In the first experiment, we ranked the 10 automatic summarization systems by comparing DGs generated from their outputs to the DGs generated from model summaries. In this case, DG1 is the machine generated extract and DG2 is the human generated extract. In the second experiment, we ranked the systems by comparing machine generated extracts to the original documents. In this case, DG1 is an extract and DG2 is the corresponding original document. Since the extracts were generated from multi-document sets, we used the average of the F-factors for ranking purposes. null</Paragraph> </Section> </Section> class="xml-element"></Paper>