File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0507_intro.xml

Size: 3,793 bytes

Last Modified: 2025-10-06 14:01:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0507">
  <Title>Text Summarization Challenge 2 Text summarization evaluation at NTCIR Workshop 3</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
4 Evaluation Methods for each task
</SectionTitle>
    <Paragraph position="0"> We use summaries prepared by human as key data for evaluation. The same two intrinsic evaluation methods are used for both tasks. They are evaluation by ranking summaries and by measuring the degree of revisions. Here are the details of the two methods. We use 30 articles for task A and 30 sets of documents (30 topics) for task B at formal run evaluation.</Paragraph>
    <Paragraph position="1"> Unfortunately, due to the limitation of the budget, only an evaluator evaluates a system's result for an article(or a set).</Paragraph>
    <Paragraph position="2"> 4.1. Evaluation by ranking This is basically the same as the evaluation method used for TSC1 task A-2 (subjective evaluation). We ask human judges, who are experienced in producing summaries, to evaluate and rank the system summaries in terms of two points of views.</Paragraph>
    <Paragraph position="3">  1. Content: How much the system summary covers the important content of the original article.</Paragraph>
    <Paragraph position="4"> 2. Readability: How readable the system summary is.  The judges are given 4 types of summaries to be evaluated and rank them in 1 to 4 scale (1 is the best, 2 for the second, 3 for the third best, and 4 for the worst). For task A, the first two types are human-produced abstract-type type1 and type2 summaries. The third is system results, and the fourth is summaries produced by lead method.</Paragraph>
    <Paragraph position="5"> For task B, the first is human-produced free summaries of the given set of documents, and the second is system results. The third is the results of the baseline system based on lead method where the first sentence of each document is used. The fourth is the results of the benchmark system using Stein method  ([7]) whose procedure is as follows: 1. Produce a summary for each document.</Paragraph>
    <Paragraph position="6"> 2. Group the summaries into several clusters. The number of clusters is adjusted to be less than the half of the number of the documents.</Paragraph>
    <Paragraph position="7"> 3. Choose the most representative summary as the summary of the cluster.</Paragraph>
    <Paragraph position="8"> 4. Compute the similarity among the clusters and output the representative summaries in such order that the similarity of neighboring summaries is high.</Paragraph>
    <Paragraph position="9"> 4.2. Evaluation by revision  It is a newly introduced evaluation method in TSC2 to evaluate the summaries by measuring the degree of revision to system results. The judges read the original documents and revise the system summaries in terms of the content and readability. The revisions are made by one of three editing operations (insertion, deletion, replacement). The degree of the revision is computed based on the number of the operations and the number of revised characters. The revisers could be completely free in what they did, though they were instructed to do minimum revision.</Paragraph>
    <Paragraph position="10"> As baseline for task A, lead-method results are used. As reference for task A, human produced summaries (abstract type1 and abstract type 2) are used. And as baseline, reference, and benchmark for task B, lead-method results, human produced summaries that are different from the key data, and the results based on the Stein method are used respectively.</Paragraph>
    <Paragraph position="11"> When more than half of the document needs to be revised, the judges can 'give up' revising the document.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML