File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1037_intro.xml
Size: 2,740 bytes
Last Modified: 2025-10-06 14:01:43
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1037"> <Title>A Web-Trained Extraction Summarization System</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The task of an extraction-based text summarizer is to select from a text the most important sentences that are in size a small percentage of the original text yet still as informative as the full text (Kupiec et al., 1995).</Paragraph> <Paragraph position="1"> Typically, trainable summarization systems characterize each sentence according to a set of predefined features and then learn from training material which feature combinations are indicative of good extract sentences.</Paragraph> <Paragraph position="2"> In order to learn the characteristics of indicative summarizing sentences, a large enough collection of (summary, text) pairs must be provided to the system.</Paragraph> <Paragraph position="3"> Research in automated text summarization is constantly troubled by the difficulty of finding or constructing large collections of (extract, text) pairs.</Paragraph> <Paragraph position="4"> Usually, (abstract, text) pairs are available and can be easily obtained (though not in sufficient quantity to support fully automated learning for large domains). But abstract sentences are not identical to summary sentences and hence make direct comparison difficult.</Paragraph> <Paragraph position="5"> Therefore, some algorithms have been introduced to generate (extract, text) pairs expanded from (abstract, text) inputs (Marcu, 1999).</Paragraph> <Paragraph position="6"> The explosion of the World Wide Web has made accessible billions of documents and newspaper articles.</Paragraph> <Paragraph position="7"> If one could automatically find short forms of longer documents, one could build large training sets over time. However, one cannot today retrieve short and long texts on the same topic directly.</Paragraph> <Paragraph position="8"> News published on the Internet is an exception.</Paragraph> <Paragraph position="9"> Although it is not ideally organized, the topic orientation and temporal nature of news makes it possible to impose an organization and thereby obtain a training corpus on the same topic. We hypothesize that weekly articles are sophisticated summaries of daily ones, and monthly articles are summaries of weekly ones, as shown in Figure 1. Under this hypothesis, how accurate an extract summarizer can one train? In this paper we first describe the corpus reorganization, then in Section 3 the training data formulation and the system, the system evaluation in Section 4, and finally future work in Section 5.</Paragraph> </Section> class="xml-element"></Paper>