File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1103_metho.xml
Size: 8,858 bytes
Last Modified: 2025-10-06 14:07:53
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1103"> <Title>Automatic Text Categorization using the Importance of Sentences</Title> <Section position="3" start_page="2" end_page="2" type="metho"> <SectionTitle> 2. Empirical Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Data Sets and Experimental Settings </SectionTitle> <Paragraph position="0"> To test our proposed system, we used two newsgroup data sets written by two different languages: English and Korean.</Paragraph> <Paragraph position="1"> The Newsgroups data set, collected by Ken Lang, contains about 20,000 articles evenly divided among 20 UseNet discussion groups (McCallum et al., 1998). 4,000 documents (20%) were used for test data and the remaining 16,000 documents (80%) for training data. Then 4,000 documents from training data were selected for a validation set. After removing words that occur only once or on a stop word list, the vocabulary from training data has 51,018 words (with no stemming).</Paragraph> <Paragraph position="2"> The second data set was gathered from the Korean UseNet group. This data set contains a total of 10,331 documents and consists of 15 categories. 3,107 documents (30%) are used for test data and the remaining 7,224 documents (70%) for training data. The resulting vocabulary from training data has 69,793 words. This data set is uneven data set as shown in statistics for statistical feature selection (Yang et al., 1997). To evaluate our method, we implemented Naive Bayes, k-NN, Rocchio, and SVM classifier. The k in k-NN was set to 30 and a=16 and b=4 were used in our Rocchio classifier. This choice was based on our previous parameter optimization learned by validation set. For SVM, we used the linear models offered by SVM light .</Paragraph> <Paragraph position="3"> As performance measures, we followed the standard definition of recall, precision, and F measure. For evaluation performance average across categories, we used the micro-averaging method and macro-averaging method.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.2 Experimental Results </SectionTitle> <Paragraph position="0"> We tested our system through the following steps. First, using the validation set of Newsgroup data set, we set the number of feature and the constant weights (k</Paragraph> <Paragraph position="2"> the combination of two importance values in the section 1.2.3. Then, using the resulting values, we conducted experiments and compared our system with a basis system; the basis system used the conventional TF and our system used WTF by formula (4).</Paragraph> <Paragraph position="3"> First of all, we set the number of features in each classifier using validation set of training data. The number of features in this experiment was limited from 1,000 to 20,000 by feature selection. Figure 2 displays the performance curves for the proposed system and the basis system using SVM. We simply set both constant As shown in Figure 2, the proposed system achieved the better performance than the basis system over all intervals. We set the number of features for SVM to 7,000 with regard to the convergence of the performance curve and running time. By the similar method, the number of features in other classifiers was set: 7,000 for Naive Bayes, 10,000 for Rocchio, and 9,000 for k-NN. Note that, over all intervals and all classifiers, the performance of the proposed system was better than that of the basis system. In advance of the experiment for setting the constant weights, we evaluated two importance measure methods and their combination method individually; we used simply the same value for</Paragraph> <Paragraph position="5"> ) in the combination method (formula (3)). We observed the results in each interval when constant weights were changed from 0.0 to 3.0. In Figure 3, Sim(S,T) denotes the method using the title, Cen(S) the method using the importance of terms, and</Paragraph> <Paragraph position="7"> In this experiment, we used SVM as a classifier and set feature number to 7,000. We mostly obtained a best performance in the combination method.</Paragraph> <Paragraph position="8"> In order to set the constant weights k )forSVM: 1.9 and 3.0 for Naive Bayes, 2.0 and 0.0 for Rocchio, and 0.8 and 2.8 for k-NN. These constant weights of each classifier were used in the following experiments.</Paragraph> <Paragraph position="9"> In this section, we reported results in two newsgroup data sets using parameters determined above experiments.</Paragraph> <Paragraph position="10"> In both data sets, the proposed system produced the better performance in all classifiers. As a result, our proposed system can be useful for all classifiers and both two different languages.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3. Discussions </SectionTitle> <Paragraph position="0"> Salton stated that a collection of small tightly clustered documents with wide separation between individual clusters should produce the best performance (Salton et al., 1975). Hence we employed the method used by Salton et al.</Paragraph> <Paragraph position="1"> (1975) to verify our method. Then we conducted experiments in English newsgroup data set (Newsgroup data set) and observed the resulting values.</Paragraph> <Paragraph position="2"> We define the cohesion within a category and the cohesion between categories. The cohesion within a category is a measure for similarity values between documents in the same category. The cohesion between categories is a measure for similarities between categories. The former is calculated by formula (6) and the latter by formula (7): An indexing method with a high cohesion within a category and a low cohesion between categories should produce the better performance in text categorization. First, we measured the cohesion within a category in each indexing method: a basis method by the conventional TF value, a method using the title (Sim(S,T)), a method using the importance of terms (Cen(S)), and a combination method (Sim(S,T)+Cen(S)). Figure 4 shows the resulting curves in each different constant weight; we used simply the same values for k As shown in Figure 4, Cen(S) shows the highest cohesion value, but Sim(S,T) does not have an effect on the cohesion in comparison with the method by conventional TF value.</Paragraph> <Paragraph position="3"> Figure 5 displays the resulting curves of the cohesion between categories as the same manner in Figure 4.</Paragraph> <Paragraph position="4"> We obtained the lowest cohesion value in Sim(S,T).UsingCen(S), the resulting cohesion values are slightly higher than those of the method by conventional TF value. In both Figure 4 and Figure 5, the cohesion values of the combination method show middle values between Sim(S,T) and Cen(S).</Paragraph> <Paragraph position="5"> By the results in Figure 4 and Figure 5, we can observe that our proposed indexing method reforms the vector space for the better performance: high cohesion within a category and low cohesion between categories. Using the proposed indexing method, the document vectors in a category are located more closely and individual categories are separated more widely. These effects were also observed in our experiments. According to properties of each classifier, k-NN has an advantage in a vector space with the high cohesion within a category and Rocchio has an advantage in a vector space with the low cohesion between categories. We achieved the similar results in our experiments. That is, k-NN produced a better performance by using Cen(S) and Rocchio produced a better performance by using Sim(S,T). Table 4 shows the summarized results in each individual method of k-NN and Rocchio.</Paragraph> <Paragraph position="6"> In this paper, we have presented a new indexing method for text categorization using two kinds of text summarization techniques; one uses the title and the other uses the importance of terms. For our experiments, we used two different language newsgroup data sets and four kinds of classifiers. We achieved the better performance than the basis system in all classifiers and both two languages. Then we verified the effect of the proposed indexing method by measuring the two kinds of cohesion. We confirm that the proposed indexing method can reform the document vector space for the better performance in text categorization. As a future work, we need the additional research for applying the more structural information of document to text categorization techniques and testing the proposed method on other types of texts such as newspapers with fixed form.</Paragraph> </Section> class="xml-element"></Paper>