File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1014_intro.xml

Size: 14,390 bytes

Last Modified: 2025-10-06 14:02:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1014">
  <Title>Evaluation Measures Considering Sentence Concatenation for Automatic Summarization by Sentence or Word Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Evaluation Metrics for Extraction
</SectionTitle>
    <Paragraph position="0"> In summarization through sentence or word extraction under a specific summarization ratio, the order of the sentences or words and the length of the summaries are restricted by the original documents or sentences. Metrics based on the accuracy of the components in the summary is a straight-forward approach to measuring similarities between the target and automatic summaries.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Accuracy
</SectionTitle>
      <Paragraph position="0"> In the field of speech recognition, automatic recognition results are compared with manual transcription results. The conventional metric for speech recognition is recognition accuracy calculated based on word accuracy:</Paragraph>
      <Paragraph position="2"> where Sub, Ins, Del, and Len are the numbers of substitutions, insertions, deletions, and words in the manual transcription, respectively. Although word accuracy cannot be used to directly evaluate the meanings of sentences, higher accuracy indicates that more of the original information has been preserved. Since the meaning of the original documents is generated by combining sentences, this metric can be applied to the evaluation for sentence extraction. Sentence accuracy defined by eq. (1) with words replaced by sentences represents how much the automatic result is similar to the answer and how well it preserves the original meaning.</Paragraph>
      <Paragraph position="3"> Accuracy is the simplest and most efficient metric when the target for the automatic summaries can be set as only one answer. However, there are usually multiple targets for each automatic summary due to the variation in manual summarization among humans. Therefore, it is not easy to use accuracy to evaluate automatic summaries. Subjective variation results into two problems: how to consider all possible correct answers in the manual summaries, and how to measure the similarity between the evaluation sentence and multiple manual summaries. null If we could collect all possible manual summaries, the one most similar to the automatic result could be chosen as the correct answer and used for the evaluation. The sentence or word accuracy compared with the most similar manual summary is denoted as NrstACCY. However, in real situations, the number of manual summaries that could be collected is limited. The coverage of correct answers in the collected manual summaries is unknown. When the coverage is low, the summaries are compared with inappropriate targets, and the NrstACCY obtained by such comparison does not provide an efficient measure.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 N-gram Precision
</SectionTitle>
      <Paragraph position="0"> One way to cope with the coverage problem is to use local matching of components or component strings with all the manual summaries instead of using a measure comparing a word sequence as a whole sentence, such as NrstACCY. The similarity can be measured by counting the precision, i.e., the number of sentence or word n-gram overlapping between the automatic result and all the references.</Paragraph>
      <Paragraph position="1"> Even if there are multiple targets for an automatic summary, the precision of components in each original can be used to evaluate the similarity between the automatic result and the multiple references.</Paragraph>
      <Paragraph position="2"> Precision is an efficient way of evaluating the similarity of component occurrence between automatic results and targets with a different order of components and different lengths.</Paragraph>
      <Paragraph position="3"> In the evaluation of summarization through extraction, a component occurring in a different loca-tion in the original is considered to be a different component even if it is the same component as one in the result. When an answer for the automatic result can be unified and the lengths of the automatic result and its answer are the same, accuracy counts insertion errors and deletion errors and thus has both the precision and recall characteristics.</Paragraph>
      <Paragraph position="4"> Since meanings are basically conveyed by word strings rather than single words, word string precision (Hori and Furui, 2000b) can be used to evaluate linguistic precision and the maintenance of the original meanings of an utterance. In this method, word strings of various lengths, that is n-grams, are used as components for measuring precision. The extraction ratio, pn, of each word string consisting of n words in a summarized sentence, V = v1;v2;::: ;vM, is given by</Paragraph>
      <Paragraph position="6"> un: each word string consisting of n words Un: a set of word strings consisting of n words in all manual summarizations.</Paragraph>
      <Paragraph position="7"> When n is 1, pn corresponds to the precision of each word, and when n is the same length as a summarized sentence (n = M), pn indicates the precision of the summarized sentence itself.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Summarization Accuracy: SumACCY
</SectionTitle>
      <Paragraph position="0"> Summarization accuracy (SumACCY) was proposed to cope with the problem of correct answer coverage and various references among humans (Hori and Furui, 2001). To cover all possible correct answers for summarization using a limited number of manual summaries, all the manual summaries are merged into a word network. In this evaluation method, the word sequence in the network closest to the evaluation word sequence is considered to be the target answer. The word accuracy of the automatic result is calculated in comparison with the target answer extracted from the network.</Paragraph>
      <Paragraph position="1"> Since summarization is processed by extracting words from an original; the words cannot be replaced by other words, and the order of words cannot be changed. Multiple manual summaries can be combined into a network that represents the variations. Each set of words that could be extracted from the network consists of words and word strings occurring at least once in all the manual summaries.</Paragraph>
      <Paragraph position="2"> The network made by the manual summaries can be considered to represent all possible variations of correct summaries.</Paragraph>
      <Paragraph position="3"> SUB The beautiful cherry blossoms in Japan bloom in spring A The cherry blossoms in Japan B cherry blossoms in Japan bloom C beautiful cherry bloom in spring D beautiful cherry blossoms in spring E The beautiful cherry blossoms bloom  The sentence &amp;quot;The beautiful cherry blossoms in Japan bloom in spring.&amp;quot; is assumed to be manually summarized as shown in Table 1. In this example, five words are extracted from the nine words. Therefore, the summarization ratio is 56%. The variations of manual summaries are merged into a word network, as shown in Fig. 1. We use &lt;s&gt; and &lt;/s&gt; as the beginning and ending symbols of a sentence. Although &amp;quot;Cherry blossoms bloom in spring&amp;quot; is not among the manual answers in Table 1, this sentence, which could be extracted from the network, is considered a correct answer.</Paragraph>
      <Paragraph position="4"> When references consisting of manual summaries cannot cover all possible answers and lack the appropriate answer for an automatic summary, SumACCY calculated using such a network is better than NrstACCY for evaluating the automatic result. This evaluation method gives a penalty for each word concatenation in the automatic results that is excluded in the network, so it can be used to evaluate the sentence-level appropriateness more precisely than matching each word in all the references. null</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Weighted SumACCY: WSumACCY
</SectionTitle>
      <Paragraph position="0"> In SumACCY, all possible sets of words extracted from the network of manually summarized sentences are equally used as target answers. However, the set of words containing word strings selected by many humans would presumably be better and give more reliable answers. To obtain reliability that reflects the majority of selections by humans, the summarization accuracy is weighted by a posterior probability based on the manual summarization network. The reliability of a sentence extracted from the network is defined as the product of the ratios of the number of subjects who selected each word to the total number of subjects. The weighted summarization accuracy is given by</Paragraph>
      <Paragraph position="2"> where ~P(v1 ::: vMjR) is the reliability score of a set of words v1 ::: vM in the manual summarization network, R, and M represents the total number of words in the target answer. The set of words ^v1 ::: ^v ^M represents the word sequence that maximizes the reliability score, ~P( jR), given by</Paragraph>
      <Paragraph position="4"> where vm is the m-th word in the sentence extracted from the network as the target answer, and C(x;yjR) indicates the number of subjects who selected the word connection of x and y. Here, &amp;quot;word connection&amp;quot; means an arc in the manual summarization network. HR is the number of subjects.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Evaluation Experiments
</SectionTitle>
      <Paragraph position="0"> Newspaper articles and broadcast news speech were automatically summarized through sentence extraction and word extraction respectively under the given summarization ratio, which is the ratio of the numbers of sentences or words in the summary to that in the original.</Paragraph>
      <Paragraph position="1"> The automatic summarization results were subjectively evaluated by ten human subjects. The subjects read these summaries and rated each one from 1 (incorrect) to 5 (perfect). The automatic summaries were also evaluated by using the numerical metrics SumACCY, WSumACCY, NrstACCY, and n-gram precision (1 n 5) in comparison with reference summaries generated by humans. The precisions of 1-gram, :::, 5-gram are denoted PREC1, :::, PREC5. The numerical evaluation results were averaged over the number of automatic summaries.</Paragraph>
      <Paragraph position="2"> Note that the subjects who judged the automatic summaries did not include anyone who generated the references. To examine the similarity of the human judgments and that of the manual summaries, the kappa statistics, , was calculated using eq. (A1) in the Appendix.</Paragraph>
      <Paragraph position="3"> Finally, to examine how much the evaluation measures reflected the human judgment, the correlation coefficients between the human judgments and the numerical evaluation results were calculated.</Paragraph>
      <Paragraph position="4"> Sentence extraction Sixty articles in Japanese newspaper published in 94, 95, and 98 were automatically summarized with a 30% summarization ratio. Half the articles were general news report (NEWS), and other half were columns (EDIT).</Paragraph>
      <Paragraph position="5"> The automatic summarization was performed using a Support Vector Machine (SVM) (Hirao et al., 2003), random extraction (RDM), the lead method (LEAD) extracting sentences from the head of articles. In comparison with these automatic summaries, manual summaries (TSC) was also evaluated. null These 4 types of summaries, SVM, RDM, LEAD, and TSC were read and rated 1 to 5 by 10 humans.</Paragraph>
      <Paragraph position="6"> The summaries were evaluated in terms of extraction of significance information (SIG), coherence of sentences (COH), maintenance of original meanings (SEM), and appropriateness of summary as a whole (WHOLE).</Paragraph>
      <Paragraph position="7"> To numerically evaluate the results using the objective metrics, 20 other human subjects generated manual summaries through sentence extraction. These manual summaries were set as the target set for the automatic summaries.</Paragraph>
      <Paragraph position="8"> Word extraction Japanese TV news broadcasts aired in 1996 were automatically recognized and summarized sentence by sentence (Hori and Furui, 2003b). They consisted of 50 utterances by a female announcer. The out-of-vocabulary (OOV) rate for the 20k word vocabulary was 2.5%, and the test-set perplexity was 54.5. Fifty utterances with word recognition accuracy above 90%, which was the average rate over the 50 utterances, were selected and used for the evaluation. The summarization ratio was set to 40%. Nine automatic summaries with various summarization accuracies from 40% to 70% and a manual summary (SUB) were selected as a test set. These ten summaries for each utterance were judged in terms of the appropriateness of the summary as a whole (WHOLE).</Paragraph>
      <Paragraph position="9"> To numerically evaluate the results using the objective metrics, 25 humans generated manual summaries through word extraction. These manual summaries were set as a target set for the automatic summaries, and merged into a network. Note that a set of 24 manual summaries made by other subjects was used as the target for SUB.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.6 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> Figures 2 and 3 show the correlation coefficients between the judgments of the subjects and the numerical evaluation results for EDIT and NEWS.</Paragraph>
      <Paragraph position="1"> They show that the measures based on accuracy much better reflected human judgments than those of the n-gram precisions for evaluating SIG and WHOLE for both EDIT and NEWS. On the other hand, PREC2 better reflected the human judgments for evaluating COH and SEM. These results show that measures taking into account sentence concatenations better reflected human judgments than single component precision. The precisions of longer</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML