XML Viewer - w97-0113

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0113_evalu.xml
Size: 4,090 bytes
Last Modified: 2025-10-06 14:00:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0113">
  <Title>Data Reliability and Its Effects on Automatic Abstracting</Title>
  <Section position="4" start_page="119" end_page="124" type="evalu">
    <SectionTitle>
3. EVALUATION AND DISCUSSION
</SectionTitle>
    <Paragraph position="0"> We discarded data sets with K &gt; 0.5 because they lacked a su~cient number of sentences for evaluation: the column-type data has only 19 sentences at 0.8 (Cf. Table 7). This had left us with nine sets of data with associated threshold values, 0.1, 0.15, 0.2~ 0.25, 0.3, 0.35, 0.4, 0.45, and 0.5. 6 Texts contained in the evaluation data ranged in length from 314 to 535 sentences. A part of a generated decision tree is given in Figure 1. See the caption for explanations.</Paragraph>
    <Paragraph position="1"> The procedure for evaluation consists in the following steps: (1) choose at random 200 cases of category &amp;quot;no&amp;quot; and 40 of category &amp;quot;yes&amp;quot; from each of the data sets to form evaluation data; (2) divide the data so chosen into a training set and a test set; (3) build a decision tree from the training set, rnnning C4.5 with the default options; and (4) evaluate its performance on the test data. Since the accuracy of evaluation can vary wildly depending on ways in which the data is divided into training and test sets, the re-sampling method of cross-validation is used here, which gives the average over possible partitions of the data into training and test sets. In particll\]~r, we use a 10-fold cross-validation method where the data are divided into 10 blocks of cases, of which 9 blocks are used for the training and the remaining one for the eData with the threshold --- 0.1, for instance, consists of coded representations of texts whose agreement rate is above or is equal to 0.1.</Paragraph>
    <Paragraph position="2">  test. Note that the method here gives a rise to 10 possible divisions and an equal number of corresponding decision tree models. The average performance of the generated models is then obtained and used as a summary estimate of the decision tree strategy for a particular set of evaluation data.</Paragraph>
    <Paragraph position="3"> Further we use information retrieval metrics, recall and precision, to quantify the performance of the decision tree approach. Precision is the ratio of cases assigned correctly to the &amp;quot;yes&amp;quot; category to the total cases assigned to the &amp;quot;yes&amp;quot; category. Recall is the ratio of cases assigned correctly to the =yes&amp;quot; category to the total &amp;quot;yes&amp;quot; cases. Furthermore, because different samplings of evaluation data from a source data set could produce wide variations in performance, we performed 50 runs of the evaluation procedure on each of the 9 data sets. Each run used a separately (and randomly) sampled set of evaluation data. Results of multiple runs of the procedure on a data set were then averaged to give a representative performance rating for that data set.</Paragraph>
    <Paragraph position="4"> Table 9 lists the average precision ratings for the nine data sets. Despite some fluctuations of the iigures, the results exhibit clear patterns (Figure 2); the kappa coefllcient is strongly correlated with performance for texts of editorial type and of news-report type, but correlation for column-type texts is only marginal. There are also marked differences in performance between text types; the decision tree method performs best on news reports and editorials, but worst on col-mug. This means that the attributes used are effective only for texts of certain types. The results suggest, further, that if attributes used are indeed a good predictor of s-mmary extracts, their strength as a predictor will be enhanced by the reliability or quality of human judgements. Thus the method's poor performance on column-type texts, despite the fact that texts are becoming increasingly reliable, suggests a need to devise a set of attributes different from those for editorials and news reports.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML