File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0501_evalu.xml
Size: 5,404 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0501"> <Title>Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation</Title> <Section position="9" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> We conducted two evaluations. One was an informal human assessment and one was a formal automatic evaluation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 HMM Hedge </SectionTitle> <Paragraph position="0"> We compared our current system to a statistical headline generation system we presented at the 2001 DUC Summarization Workshop (Zajic et al., 2002), which we will refer to as HMM Hedge. HMM Hedge treats the summarization problem as analogous to statistical machine translation. The verbose language, articles, is treated as the result of a concise language, headlines, being transmitted through a noisy channel.</Paragraph> <Paragraph position="1"> The result of the transmission is that extra words are added and some morphological variations occur. The Viterbi algorithm is used to calculate the most likely unseen headline to have generated the seen article. The Viterbi algorithm is biased to favor headline-like characteristics gleaned from observation of human performance of the headline-construction task. Since the 2002 Workshop, HMM Hedge has been enhanced by incorporating part of speech of information into the decoding process, rejecting headlines that do not contain a word that was used as a verb in the story, and allowing morphological variation only on words that were used as verbs in the story. HMM Hedge was trained on 700,000 news articles and headlines from the TIPSTER corpus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Bleu: Automatic Evaluation </SectionTitle> <Paragraph position="0"> BLEU (Papineni et al, 2002) is a system for automatic evaluation of machine translation. BLEU uses a modified n-gram precision measure to compare machine translations to reference human translations. We treat summarization as a type of translation from a verbose language to a concise one, and compare automatically generated headlines to human generated headlines.</Paragraph> <Paragraph position="1"> For this evaluation we used 100 headlines created for 100 AP stories from the TIPSTER collection for August 6, 1990 as reference summarizations for those stories. These 100 stories had never been run through either system or evaluated by the authors prior to this evaluation. We also used the 2496 manual abstracts for the DUC2003 10-word summarization task as reference translations for the 624 test documents of that task. We used two variants of HMM Hedge, one which selects headline words from the first 60 words of the story, and one which selects words from the first sentence of the story. Table 1 shows the BLEU score using trigrams, and the 95% confidence interval for the score.</Paragraph> <Paragraph position="2"> These results show that although Hedge Trimmer scores slightly higher than HMM Hedge on both data sets, the results are not statistically significant. However, we believe that the difference in the quality of the systems is not adequately reflected by this automatic evaluation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Human Evaluation </SectionTitle> <Paragraph position="0"> Human evaluation indicates significantly higher scores than might be guessed from the automatic evaluation. For the 100 AP stories from the TIPSTER corpus for August 6, 1990, the output of Hedge Trimmer and HMM Hedge was evaluated by one human.</Paragraph> <Paragraph position="1"> Each headline was given a subjective score from 1 to 5, with 1 being the worst and 5 being the best. The average score of HMM Hedge was 3.01 with standard deviation of 1.11. The average score of Hedge Trimmer was 3.72 with standard deviation of 1.26. Using a t-score, the difference is significant with greater than 99.9% confidence.</Paragraph> <Paragraph position="2"> The types of problems exhibited by the two systems are qualitatively different. The probabilistic system is more likely to produce an ungrammatical result or omit a necessary argument, as in the examples below.</Paragraph> <Paragraph position="3"> (15) HMM60: Nearly drowns in satisfactory condition satisfactory condition.</Paragraph> <Paragraph position="4"> (16) HMM60: A county jail inmate who noticed.</Paragraph> <Paragraph position="5"> In contrast, the parser-based system is more likely to fail by producing a grammatical but semantically useless headline.</Paragraph> <Paragraph position="6"> (17) HedgeTr: It may not be everyone's idea especially coming on heels.</Paragraph> <Paragraph position="7"> Finally, even when both systems produce acceptable output, Hedge Trimmer usually produces headlines which are more fluent or include more useful information. null (18) a. HMM60: New Year's eve capsizing b. HedgeTr: Sightseeing cruise boat capsized and sank.</Paragraph> <Paragraph position="8"> (19) a. HMM60: hundreds of Tibetan students demonstrate in Lhasa.</Paragraph> <Paragraph position="9"> b. HedgeTr: Hundreds demonstrated in Lhasa demanding that Chinese authorities respect culture. null</Paragraph> </Section> </Section> class="xml-element"></Paper>