File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1204_metho.xml
Size: 11,950 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1204"> <Title>Evaluation of Features for Sentence Extraction on Different Types of Corpora</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Summarization data </SectionTitle> <Paragraph position="0"> The summarization data we used for this research were prepared from Japanese newspaper articles, Japanese lectures, and English newspaper articles.</Paragraph> <Paragraph position="1"> By using these three types of data, we could compare two languages and also two different types of corpora, a written corpus and a speech corpus.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Summarization data from Japanese </SectionTitle> <Paragraph position="0"> newspaper articles Text Summarization Challenge (TSC) is an evaluation workshop for automatic summarization, which is run by the National Institute of Informatics in Japan (TSC, 2001). Three tasks were presented at TSC-2001: extracting important sentences, creating summaries to be compared with summaries prepared by humans, and creating summaries for information retrieval. We focus on the first task here, i.e., the sentence extraction task. At TSC-2001, a dry run and a formal run were performed. The dry run data consisted of 30 newspaper articles and manually created summaries of each. The formal run data consisted of another 30 pairs of articles and summaries. The average number of sentences per article was 28.5 (1709 sentences / 60 articles). The newspaper articles included 15 editorials and 15 news reports in both data sets. The summaries were created from extracted sentences with three compression ratios (10%, 30%, and 50%). In our analysis, we used the extraction data for the 10% compression ratio.</Paragraph> <Paragraph position="1"> In the following sections, we call these summarization data the &quot;TSC data&quot;. We use the TSC data as an example of a Japanese written corpus to evaluate the performance of sentence extraction.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Summarization data from Japanese lectures </SectionTitle> <Paragraph position="0"> The speech corpus we used for this experiment is part of the Corpus of Spontaneous Japanese (CSJ) (Maekawa et al., 2000), which is being created by NIJLA, TITech, and CRL as an ongoing joint project. The CSJ is a large collection of monologues, such as lectures, and it includes transcriptions of each speech as well as the voice data. We selected 60 transcriptions from the CSJ for both sentence segmentation and sentence extraction. Since these transcription data do not have sentence boundaries, sentence segmentation is necessary before sentence extraction. Three annotators manually generated sentence segmentation and summarization results. The target compression ratio was set to 10%.</Paragraph> <Paragraph position="1"> The results of sentence segmentation were unified to form the key data, and the average number of sentences was 68.7 (4123 sentences / 60 speeches).</Paragraph> <Paragraph position="2"> The results of sentence extraction, however, were not unified, but were used separately for evaluation.</Paragraph> <Paragraph position="3"> In the following sections, we call these summarization data the &quot;CSJ data&quot;. We use the CSJ data as an example of a Japanese speech corpus to evaluate the performance of sentence extraction.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Summarization data from English </SectionTitle> <Paragraph position="0"> newspaper articles Document Understanding Conference (DUC) is an evaluation workshop in the U.S. for automatic summarization, which is sponsored by TIDES of the DARPA program and run by NIST (DUC, 2001).</Paragraph> <Paragraph position="1"> At DUC-2001, there were two types of tasks: single-document summarization (SDS) and multi-document summarization (MDS). The organizers of DUC-2001 provided 30 sets of documents for a dry run and another 30 sets for a formal run. These data were shared by both the SDS and MDS tasks, and the average number of sentences was 42.5 (25779 sentences / 607 articles). Each document set had a topic, such as &quot;Hurricane Andrew&quot; or &quot;Police Misconduct&quot;, and contained around 10 documents relevant to the topic. We focus on the SDS task here, for which the size of each summary output was set to 100 words. Model summaries for the articles were also created by hand and provided. Since these summaries were abstracts, we created sentence extraction data from the abstracts by word-based comparison. null In the following sections, we call these summarization data the &quot;DUC data&quot;. We use the DUC data as an example of an English written corpus to evaluate the performance of sentence extraction.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Overview of our sentence extraction </SectionTitle> <Paragraph position="0"> system In this section, we give an overview of our sentence extraction system, which uses multiple components. For each sentence, each component outputs a score. The system then combines these independent scores by interpolation. Some components have more than one scoring function, using various features. The weights and function types used are decided by optimizing the performance of the system on training data.</Paragraph> <Paragraph position="1"> Our system includes parts that are either common to the TSC, CSJ, and DUC data or specific to one of these data sets. We stipulate which parts are specific.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Features for sentence extraction 3.1.1 Sentence position </SectionTitle> <Paragraph position="0"> We implemented three functions for sentence position. The first function returns 1 if the position of the sentence is within a given threshold N from the</Paragraph> <Paragraph position="2"> The threshold N is determined by the number of words in the summary.</Paragraph> <Paragraph position="3"> The second function is the reciprocal of the position of the sentence, i.e., the score is highest for the first sentence, gradually decreases, and goes to a minimum at the final sentence:</Paragraph> <Paragraph position="5"> These first two functions are based on the hypothesis that the sentences at the beginning of an article are more important than those in the remaining part.</Paragraph> <Paragraph position="6"> The third function is the maximum of the reciprocal of the position from either the beginning or the end of the document:</Paragraph> <Paragraph position="8"> This method is based on the hypothesis that the sentences at both the beginning and the end of an article are more important than those in the middle.</Paragraph> <Paragraph position="9"> The second type of scoring function uses sentence length to determine the significance of sentences. We implemented three scoring functions for sentence length. The first function only returns the length of each sentence (Li):</Paragraph> <Paragraph position="11"> The second function sets the score to a negative value as a penalty when the sentence is shorter than a certain length (C):</Paragraph> <Paragraph position="13"> The third function combines the above two approaches, i.e., it returns the length of a sentence that has at least a certain length, and otherwise returns a negative value as a penalty:</Paragraph> <Paragraph position="15"> The length of a sentence means the number of letters, and based on the results of an experiment with the training data, we set C to 20 for the TSC and CSJ data. For the DUC data, the length of a sentence means the number of words, and we set C to 10 during the training stage.</Paragraph> <Paragraph position="16"> The third type of scoring function is based on term frequency (tf) and document frequency (df). We applied three scoring functions for tf*idf, in which the term frequencies are calculated differently. The first function uses the raw term frequencies, while the other two are two different ways of normalizing the frequencies, as follows, where DN is the number of documents given:</Paragraph> <Paragraph position="18"> For the TSC and CSJ data, we only used the third method (T3), which was reported to be effective for the task of information retrieval (Robertson and Walker, 1994). The target words for these functions are nouns (excluding temporal or adverbial nouns).</Paragraph> <Paragraph position="19"> For each of the nouns in a sentence, the system calculates a Tf*idf score. The total score is the significance of the sentence. The word segmentation was generated by Juman3.61 (Kurohashi and Nagao, 1999). We used articles from the Mainichi newspaper in 1994 and 1995 to count document frequencies. null For the DUC data, the raw term frequency (T1) was selected during the training stage from among the three tf*idf definitions. A list of stop words were used to exclude functional words, and articles from the Wall Street Journal in 1994 and 1995 were used to count document frequencies.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3.1.4 Headline </SectionTitle> <Paragraph position="0"> We used a similarity measure of the sentence to the headline as another type of scoring function. The basic idea is that the more words in the sentence overlap with the words in the headline, the more important the sentence is. The function estimates the relevance between a headline (H) and a sentence (Si) by using the tf*idf values of the words (w) in the headline:</Paragraph> <Paragraph position="2"> We also evaluated another method based on this scoring function by using only named entities (NEs) instead of words for the TSC data and DUC data.</Paragraph> <Paragraph position="3"> Only the term frequency was used for NEs, because we judged that the document frequency for an entity was usually quite small, thereby making the differences between entities negligible.</Paragraph> <Paragraph position="4"> For the DUC data, we used dependency patterns as a type of scoring function. These patterns were extracted by pattern discovery during information extraction (Sudo et al., 2001). The details of this approach are not explained here, because this feature is not among the features we analyze in Section 5. The definition of the function appears in (Nobata et al., 2002).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Optimal weight </SectionTitle> <Paragraph position="0"> Our system set weights for each scoring function in order to calculate the total score of a sentence. The total score (Si) is defined from the scoring functions (Scorej()) and weights (fij) as follows:</Paragraph> <Paragraph position="2"> We estimated the optimal values of these weights from the training data. After the range of each weight was set manually, the system changed the values of the weights within a range and summarized the training data for each set of weights. Each score was recorded after the weights were changed, and the weights with the best scores were stored.</Paragraph> <Paragraph position="3"> A particular scoring method was also selected in the cases of features with more than one defined scoring methods. We used the dry run data from each workshop as TSC and DUC training data. For the TSC data, since the 30 articles contained 15 editorials and 15 news reports, we estimated optimal values separately for editorials and news reports. For the CSJ data, we used 50 transcriptions for training and 10 for testing, as mentioned in Section 2.2.</Paragraph> </Section> </Section> class="xml-element"></Paper>