File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1204_evalu.xml
Size: 12,559 bytes
Last Modified: 2025-10-06 13:59:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1204"> <Title>Evaluation of Features for Sentence Extraction on Different Types of Corpora</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation results </SectionTitle> <Paragraph position="0"> In this section, we show our evaluation results on the three sets of data for the sentence extraction system described in the previous section.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Evaluation results for the TSC data </SectionTitle> <Paragraph position="0"> Table 1 shows the evaluation results for our system and some baseline systems on the task of sentence extraction at TSC-2001. The figures in Table 1 are values of the F-measure1. The 'System' column shows the performance of our system and its rank among the nine systems that were applied to the task, and the 'Lead' column shows the performance of a baseline system which extracts as many sentences as the threshold from the beginning of a document. Since all participants could output as many sentences as the allowed upper limit, the values of the recall, precision, and F-measure were the same.</Paragraph> <Paragraph position="1"> Our system obtained better results than the baseline systems, especially when the compression ratio was 10%. The average performance was second among the nine systems.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation results for the DUC data </SectionTitle> <Paragraph position="0"> Table 2 shows the results of a subjective evaluation in the SDS task at DUC-2001. In this subjective evaluation, assessors gave a score to each system's outputs, on a zero-to-four scale (where four is the best), as compared with summaries made by humans. The figures shown are the average scores over all documents. The 'System' column shows the performance of our system and its rank among the 12 systems that were applied to this task. The 'Lead'</Paragraph> <Paragraph position="2"> where COR is the number of correct sentences marked by the system, GLD is the total number of correct sentences marked by humans, and SYS is the total number of sentences marked by the system. After calculating these scores for each transcription, the average is calculated as the final score.</Paragraph> <Paragraph position="3"> column shows the performance of a baseline system that always outputs the first 100 words of a given document, while the 'Avg.' column shows the average for all systems. Our system ranked 5th in grammaticality and was ranked at the top for the other measurements, including the total value.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Evaluation results for the CSJ data </SectionTitle> <Paragraph position="0"> The evaluation results for sentence extraction with the CSJ data are shown in Table 3. We compared the system's results with each annotator's key data. As mentioned previously, we used 50 transcriptions for training and 10 for testing.</Paragraph> <Paragraph position="1"> These results are comparable with the performance on sentence segmentation for written documents, because the system's performance for the TSC data was 0.363 when the compression ratio was set to 10%. The results of our experiments thus show that for transcriptions, sentence extraction achieves results comparable to those for written documents, if the are well defined.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Contributions of features </SectionTitle> <Paragraph position="0"> Table 4 shows the contribution vectors for each set of training data. The contribution here means the product of the optimized weight and the standard deviation of the score for the test data. The vectors were normalized so that the sum of the components is equal to 1, and the selected function types for the features are also shown in the table. Our system used the NE-based headline function (HL (N)) for the DUC data and the word-based function (HL (W)) for the CSJ data, and both functions for the TSC data. The columns for the TSC data show the contributions when the compression ratio was 10%.</Paragraph> <Paragraph position="1"> We can see that the feature with the biggest contribution varies among the data sets. While the position feature was the most effective for the TSC and DUC data, the length feature was dominant for the CSJ data. Most of the short sentences in the lectures were specific expressions, such as &quot;This is the result of the experiment.&quot; or &quot;Let me summarize my presentation.&quot;. Since these sentences were not extracted as key sentences by the annotators, it is believed that the function giving short sentences a penalty score matched the manual extraction results.</Paragraph> <Paragraph position="2"> 5 Analysis of the summarization data In Section 4, we showed how our system, which combines major features, has performed well as compared with current summarization systems.</Paragraph> <Paragraph position="3"> However, the evaluation results alone do not sufficiently explain how such a combination of features is effective. In this section, we investigate the correlations between each pair of features. We also match feature pairs with distributions of extracted key sentences as answer summaries to find effective combination of features for sentence extraction.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Correlation between features </SectionTitle> <Paragraph position="0"> Table 5 shows Spearman's rank correlation coefficients among the four features. Significantly correlated feature pairs are indicated by '/'(fi = 0:001).</Paragraph> <Paragraph position="1"> Here, the word-based feature is used as the headline feature. We see the following tendencies for any of the data sets: + &quot;Position&quot; is relatively independent of the other features. + &quot;Length&quot; and &quot;Tf*idf&quot; have high correlation2. These results show that while combinations of these four features enabled us to obtain good evaluation results, as shown in Section 4, the features are not necessarily independent of one another.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Combination of features </SectionTitle> <Paragraph position="0"> Tables 6 and 7 show the distributions of extracted key sentences as answer summaries with two pairs of features: sentence position and the tf*idf value, and sentence position and the headline information.</Paragraph> <Paragraph position="1"> In these tables, each sentence is ranked by each of the two feature values, and the rankings are split every 10 percent. For example, if a sentence is ranked in the first 10 percent by sentence position and the last 10 percent by the tf*idf feature, the sentence belongs to the cell with a position rank of 0.1 and a tf*idf rank of 1.0 in Table 6.</Paragraph> <Paragraph position="2"> Each cell thus has two letters. The left letter is the number of key sentences, and the right letter is the ratio of key sentences to all sentences in the cell. The left letter shows how the number of sentences differs from the average when all the key sentences appear equally, regardless of the feature values. Let T be 2Here we used equation T1 for the tf*idf feature, and the score of each sentence was normalized with the sentence length. Hence, the high correlation between &quot;Length&quot; and &quot;Tf*idf&quot; is not trivial.</Paragraph> <Paragraph position="3"> the total number of key sentences, M(= T100) be the average number of key sentences in each range, and S be the standard deviation of the number of key sentences among all cells. The number of key sentences for cell Ti;j is then categorized according to one of the following letters: Similarly, the right letter in a cell shows how the ratio of key sentences differs from the average ratio when all the key sentences appear equally, regardless of feature values. Let N be the total number of sentences, m(= TN) be the average ratio of key sentences, and s be the standard deviation of the ratio among all cells. The ratio of key sentences for cell ti;j is then categorized according to one of the following letters: a: ti;j , m+2s b: m+s * ti;j < m+2s c: m!s * ti;j < m+s d: m!2s * ti;j < m!s e: ti;j < m!2s o: ti;j = 0 -: No sentences exist in the cell.</Paragraph> <Paragraph position="4"> When key sentences appear uniformly regardless of feature values, every cell is defined as 'Cc'. We show both the range of the number of key sentences and the ratio of key sentences, because both are necessary to show how effectively a cell has key sentences. If a cell includes many sentences, the number of key sentences can be large even though the ratio is not. On the other hand, when the ratio of key sentences is large and the number is not, the contribution to key sentence extraction is small. Table 6 shows the distributions of key sentences when the features of sentence position and tf*idf were combined. For the DUC data, both the number and ratio of key sentences were large when the sentence position was ranked within the first 20 percent and the value of the tf*idf feature was ranked in the bottom 50 percent (i.e., Pst. * 0.2, Tf*idf , 0.5). On the other hand, both the number and ratio of key sentences were large for the CSJ data when the sentence position was ranked in the last 10 percent and the value of the tf*idf feature was ranked after the first 30 percent (i.e., Pst. = 1.0, Tf*idf , 0.3),. When the tf*idf feature was low, the number and ratio of key sentences were not large, regardless of the sentence position values. These results show that the tf*idf feature is effective when the values are used as a filter after the sentences are ranked by sentence position.</Paragraph> <Paragraph position="5"> Table 7 shows the distributions of key sentences with the combination of the sentence position and headline features. About half the sentences did not share words with the headlines and had a value of 0 for the headline feature. As a result, the cells in the middle of the table do not have corresponding sentences. The headline feature cannot be used as a filter, unlike the tf*idf feature, because many key sentences are found when the value of the headline feature is 0. A high value of the headline feature is, however, a good indicator of key sentences when it is combined with the position feature. The ratio of key sentences was large when the headline ranking was high and the sentence was near the beginning (at Pst. * 0.2, Headline , 0.7) for the DUC data.</Paragraph> <Paragraph position="6"> For the CSJ data, the ratio of key sentences was also large when the headline ranking was within the top 10 percent (Pst. = 0.1, Headline = 1.0), as well as for the sentences near the ends of speeches.</Paragraph> <Paragraph position="7"> These results indicate that the number and ratio of key sentences sometimes vary discretely according to the changes in feature values when features are combined for sentence extraction. That is, the performance of a sentence extraction system can be improved by categorizing feature values into several ranges and then combining ranges. While most sentence extraction systems use sequential combinations of features, as we do in our system based on Equation 1, the performance of these systems can possibly be improved by introducing the categorization of feature values, without adding any new features. We have shown that discrete combinations match the distributions of key sentences in two different corpora, the DUC data and the CSJ data. This indicates that discrete combinations of corpora are effective across both different languages and different types of corpora. Hirao et al. (2002) reported the results of a sentence extraction system using an SVM, which categorized sequential feature values into ranges in order to make the features binary. Some effective combinations of the binary fea- null the combination of the sentence position (Pst.) and headline features.</Paragraph> <Paragraph position="8"> tures in that report also indicate the effectiveness of discrete combinations of features.</Paragraph> </Section> </Section> class="xml-element"></Paper>