File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1057_evalu.xml
Size: 7,297 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1057"> <Title>A Formal Model for Information Selection in Multi-Sentence Text Extraction</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> To empirically establish the effectiveness of the presented model we ran experiments comparing evaluation scores on summaries obtained with a baseline algorithm that does not account for redundancy of information and with the two variants of greedy algorithms described in Section 4. We chose summarization as the evaluation task because &quot;ideal&quot; output (prepared by humans) and methods for scoring arbitrary system output were available for this task, but not for evaluating long answers to questions.</Paragraph> <Paragraph position="1"> Data We chose as our input data the document sets used in the evaluation of multidocument summarization during the Document Understanding Conference (DUC), organized by NIST in 2001 (Harman and Voorhees, 2001). This collection contains 30 test document sets, each containing approximately 10 news stories on different events; document sets vary significantly in their internal cohereness. For each document set 12 human-constructed summaries are provided, 3 for each of the target lengths of 50, 100, 200, and 400 words. We selected DUC 2001 because unlike later DUCs, ideal summaries are available for multiple lengths. We consider sentences as our textual units.</Paragraph> <Paragraph position="2"> Features In our experiments we used two sets of features (i.e., conceptual units). First, we chose a fairly basic and widely used set of lexical features, namely the list of words present in each input text. We set the weight of each feature to its tf*idf value, taking idf values from http://elib.cs.</Paragraph> <Paragraph position="3"> berkeley.edu/docfreq/.</Paragraph> <Paragraph position="4"> Our alternative set of conceptual units was the list of weighted atomic events extracted from the input texts. An atomic event is a triplet consisting of two named entities extracted from a sentence and a connector expressed by a verb or an event-related noun that appears in-between these two named entities.</Paragraph> <Paragraph position="5"> The score of the atomic event depends on the frequency of the named entities pair for the input text and the frequency of the connector for that named entities pair. Filatova and Hatzivassiloglou (2003) define the procedure for extracting atomic events in detail, and show that these triplets capture the most important relations connecting the major constituent parts of events, such as location, dates and participants. Our hypothesis is that using these events as conceptual units would provide a reasonable basis for summarizing texts that are supposed to describe one or more events.</Paragraph> <Paragraph position="6"> Evaluation Metric Given the difficulties in coming up with a universally accepted evaluation measure for summarization, and the fact that judgments by humans are time-consuming and labor-intensive, we adopted an automated process for comparing system-produced summaries to the ideal summaries written by humans. The ROUGE method (Lin and Hovy, 2003) is based on n-gram overlap between the system-produced and ideal summaries. As such, it is a recall-based measure, and it requires that the length of the summaries be controlled in order to allow for meaningful comparisons. Although ROUGE is only a proxy measure of summary quality, it offers the advantage that it can be readily applied to compare the performance of different systems on the same set of documents, assuming that ideal summaries are available for those documents.</Paragraph> <Paragraph position="7"> Baseline Our baseline method does not consider the overlap in information content between selected textual units. Instead, we fix the score of each sentence as the sum of tf*idf values or atomic event scores. At every step we choose the remaining sentence with the largest score, until the stopping criterion for summary length is satisfied.</Paragraph> <Paragraph position="8"> Results For every version of our baseline and approximation algorithms, and separately for the tf*idf -weighted words and event features, we get a sorted list of sentences extracted according to a particular algorithm. Then, for each DUC document set we create four summaries of each suggested length (50, 100, 200, and 400 words) by extracting accordingly the first 50, 100, 200, and 400 words from the top sentences.</Paragraph> <Paragraph position="9"> To evaluate the performance of our summarizers we compare their outputs against the human models of the corresponding length provided by DUC, using the ROUGE-created scores for unigrams. Since scores are not comparable across different document sets, instead of average scores we report the number of document sets for which one algorithm outperforms another. We compare each of our approximation algorithms (adaptive and modified greedy) to the baseline.</Paragraph> <Paragraph position="10"> Table 1 shows the number of data sets for which the adaptive greedy algorithm outperforms our baseline. This implementation of our information packing model improves the ROUGE scores in most cases when events are used as features, while the opposite is true when tf*idf provides the conceptual units. This may be partly explained because of the nature of the tf*idf -weighted word features: it is possible that important words cannot be considered independently, and that the repetition of important words in later sentence does not necessarily mean that the sentence offers no new information.</Paragraph> <Paragraph position="11"> Thus words may not provide independent enough features for our approach to work.</Paragraph> <Paragraph position="12"> Table 2 compares our modified greedy algorithm to the baseline. In that case, the model offers gains in performance when both events and words are used as features, and in fact the gains are most pronounced with the word features. For both algorithms, the gains are generally minimal for 50 word summaries and most pronounced for the longest, 400 word summaries. This validates our approach, as the information packing model has a limited opportunity to alter the set of selected sentences when those sentences are very few (often one or two for the shortest summaries).</Paragraph> <Paragraph position="13"> It is worth noting that in direct comparisons between the adaptive and modified greedy algorithm we found the latter to outperform the former. We found also events to lead to better performance than tf*idf -weighted words with statistically significant differences. Events tend to be a particularly good representation for document sets with well-defined constituent parts (such as specific participants) that cluster around a narrow event. Events not only give us a higher absolute performance when compared to just words but also lead to more pronounced improvement when our model is employed. A more detailed analysis of the above experiments together with the discussion of advantages and disadvantages of our evaluation schema can be found in (Filatova and Hatzivassiloglou, 2004).</Paragraph> </Section> class="xml-element"></Paper>