File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3256_evalu.xml
Size: 5,950 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3256"> <Title>Multi-document Biography Summarization</Title> <Section position="8" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Overview </SectionTitle> <Paragraph position="0"> Extrinsic and intrinsic evaluations are the two classes of text summarization evaluation methods (Sparck Jones and Galliers, 1996). Measuring content coverage or summary informativeness is an approach commonly used for intrinsic evaluation.</Paragraph> <Paragraph position="1"> It measures how much source content was preserved in the summary.</Paragraph> <Paragraph position="2"> A complete evaluation should include evaluations of the accuracy of components involved in the summarization process (Schiffman et al., 2001). Performance of the sentence classifier was shown in Section 4. Here we will show the performance of the resulting summaries.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Coverage Evaluation </SectionTitle> <Paragraph position="0"> An intrinsic evaluation of biography summary was recently conducted under the guidance of Document Understanding Conference (DUC2004) using the automatic summarization evaluation tool ROUGE (Recall-Oriented Understudy for Gisting Evaluation) by Lin and Hovy (2003). 50 TREC English document clusters, each containing on average 10 news articles, were the input to the system. Summary length was restricted to 665 bytes. Brute force truncation was applied on longer summaries.</Paragraph> <Paragraph position="1"> The ROUGE-L metric is based on Longest Common Subsequence (LCS) overlap (Saggion et al., 2002). Figure 2 shows that our system (86) performs at an equivalent level with the best systems 9 and 10, that is, they both lie within our system's 95% upper confidence interval. The 2class classification module was used in generating the answers. The figure also shows the performance data evaluated with lower and higher confidences set at 95%. The performance data are from official DUC results.</Paragraph> <Paragraph position="2"> Figure 3 shows the performance results of our system 86, using 10-class sentence classification, comparing to other systems from DUC by replicating the official evaluating process. Only system 9 performs slightly better with its score being higher than our system's 95% upper confidence interval.</Paragraph> <Paragraph position="3"> A baseline system (5) that takes the first 665 bytes of the most recent text from the set as the resulting biography was also evaluated amongst the peer systems. Clearly, humans still perform at a level much superior to any system.</Paragraph> <Paragraph position="4"> Measuring fluency and coherence is also important in reflecting the true quality of machine-generated summaries. There is no automated tool for this purpose currently. We plan to incorporate one for the future development of this work.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Discussion </SectionTitle> <Paragraph position="0"> N-gram recall scores are computed by ROUGE, in addition to ROUGE-L shown here. While cosine similarity and unigram and bigram overlap demonstrate a sufficient measure on content coverage, they are not sensitive on how information is sequenced in the text (Saggion et al., 2002). In evaluating and analyzing MDS results, metrics, such as ROUGE-L, that consider linguistic sequence are essential.</Paragraph> <Paragraph position="1"> Radev and McKeown (1998) point out when summarizing interesting news events from multiple sources, one can expect reports with contradictory and redundant information. An intelligent summarizer should attain as much information as possible, combine it, and present it in the most concise form to the user. When we look at the different attributes in a person's life reported in news articles, a person is described by the job positions that he/she has held, by education institutions that he/she has attended, and etc. Those data are confirmed biographical information and do not bear the necessary contradiction associated with evolving news stories. However, we do feel the need to address and resolve discrepancies if we were to create comprehensive and detailed 86 is our system with 10-class biography classification. Baseline is 5. biographies on people-in-news since miscellaneous personal facts are often overlooked and told in conflicting reports. Misrepresented biographical information may well be controversies and may never be clarified. The scandal element from our corpus study (Section 3) is sufficient to identify information of the disputed kind.</Paragraph> <Paragraph position="2"> Extraction-based MDS summarizers, such as this one, present the inherent problem of lacking the discourse-level fluency. While sentence ordering for single document summarization can be determined from the ordering of sentences in the input article, sentences extracted by a MDS system may be from different articles and thus need a strategy on ordering to produce a fluent surface summary (Barzilay et al., 2002). Previous summarization systems have used temporal sequence as the guideline on ordering. This is especially true in generating biographies where a person is represented by a sequence of events that occurred in his/her life. Barzilay et al. also introduced a combinational method with an alternative strategy that approximates the information relatedness across the input texts. We plan to use a fixed-form structure for the majority of answer construction, fitted for biographies only. This will be a top-down ordering strategy, contrary to the bottom-up algorithm shown by Barzilay et al.</Paragraph> </Section> </Section> class="xml-element"></Paper>