File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/x98-1022_evalu.xml

Size: 3,043 bytes

Last Modified: 2025-10-06 14:00:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1022">
  <Title>AN NTU-APPROACH TO AUTOMATIC SENTENCE EXTRACTION FOR SUMMARY GENERATION</Title>
  <Section position="7" start_page="166" end_page="166" type="evalu">
    <SectionTitle>
5. EXPERIMENT RESULTS
</SectionTitle>
    <Paragraph position="0"> In general, the results are evaluated by assessors, and then measured by recall (R), precision (P), F-score (F) and the normalized F-score (NormF). Table 1 shows the contingence table of the real answer against the assessors.</Paragraph>
    <Paragraph position="1">  Each group could provide up to two kinds of summary. One is the fixed-length summary and the other is the best summary. In order to level off the effect of length of summary, compression factor is introduced to normalize the F-score.</Paragraph>
    <Paragraph position="3"> Table 2 shows the result of our adhoc summary task. Table 3 shows the result of our categorization summary task. The NormF of the best summary and that of the fixed summary for adhoc tasks are 0.456 and 0.447, respectively. In comparison to other systems, the performance of our system is not good.</Paragraph>
    <Paragraph position="4"> One reason is that we have not developed an appropriate method to determine the threshold for selection of sentence. Besides, we are the only one team not from Indo-European language family. This maybe has some impacts on the performance.</Paragraph>
    <Paragraph position="5"> However, considering the time factor, our system perform much better than many systems.</Paragraph>
    <Paragraph position="6"> The NormF of the best summary and that of the fixed summary for categorization task are 0.4090 and 0.4023, respectively. Basically, this task is like the traditional categorization problem. Our system performs much well. However, there is no significant difference among all participating systems.</Paragraph>
    <Paragraph position="7"> Table 4 shows our system's performance against average performance of all systems. Although some measures of our performance are worse than that those of the average performance, the difference is not very significant. In categorization task, we outperform the average performance of all systems.</Paragraph>
    <Paragraph position="8"> Table 5 is the standard deviation of all systems.</Paragraph>
    <Paragraph position="9"> Essentially, the difference of all systems is not significant. Figure 3 shows each measure of performance for our system. Figure 4 shows our system against the best system.</Paragraph>
    <Paragraph position="10">  SUMMAC also conducts a series of baseline experiments to compare the system performance.</Paragraph>
    <Paragraph position="11"> From the report of these experiments, we find that for categorization task, the fixed-length summary is pretty good enough. For adhoc task, the best summary will do the better job. Another important finding is that the assessors are highly inconsistent. How to find out a fair and consistent evaluation methodology is worth further investigating.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML