File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1103_abstr.xml
Size: 5,160 bytes
Last Modified: 2025-10-06 13:42:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1103"> <Title>Automatic Text Categorization using the Importance of Sentences</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Automatic text categorization is a problem of automatically assigning text documents to predefined categories. In order to classify text documents, we must extract good features from them. In previous research, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then a document is represented as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conducted experiments on two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers were used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observed that our new method made a significant improvement in all classifiers and both data sets.</Paragraph> <Paragraph position="1"> Introduction The goal of text categorization is to classify documents into a certain number of pre-defined categories. Text categorization is an active research area in information retrieval and machine learning. A wide range of supervised learning algorithms has been applied to this problem using a training data set of categorized documents. For examples, there are the Naive Bayes (McCallum et al., 1998; Ko et al., 2000), Rocchio (Lewis et al., 1996), Nearest Neighbor (Yang et al., 2002), and Support Vector Machines (Joachims, 1998).</Paragraph> <Paragraph position="2"> A text categorization task consists of a training phase and a text classification phase. The former includes the feature extraction process and the indexing process. The vector space model has been used as the conventional method for text representation (Salton et al., 1983). This model represents a document as a vector of features using Term Frequency (TF) and Inverted Document Frequency (IDF). This model simply counts TF without considering where the term occurs. But each sentence in a document has different importance for identifying the content of the document. Thus, by assigning a different weight according to the importance of the sentence to each term, we can achieve better results. For this problem, several techniques have been studied. First, term weights were differently weighted by the location of a term, so that the structural information of a document was applied to term weights (Murata et al., 2000). But this method supposes that only several sentences, which are located at the front or the rear of a document, have the important meaning. Hence it can be applied to only documents with fixed form such as articles. The next technique used the title of a document in order to choose the important terms (Mock et al., 1996). The terms in the title were handled importantly. But a drawback of this method is that some titles, which do not contain well the meaning of the document, can rather increase the ambiguity of the meaning. This case often comes out in documents with a informal style such as Newsgroup and Email.To overcome these problems, we have studied text summarization techniques with great interest.</Paragraph> <Paragraph position="3"> Among text summarization techniques, there are statistical methods and linguistic methods (Radev et al., 2000; Marcu et al., 1999). Since the former methods are simpler and faster than the latter methods, we use the former methods to be applied to text categorization. Therefore, we employ two kinds of text summarization techniques; one measures the importance of sentences by the similarity between the title and each sentence in a document, and the other by the importance of terms in each sentence.</Paragraph> <Paragraph position="4"> In this paper, we use two kinds of text summarization techniques for classifying important sentences and unimportant sentences.</Paragraph> <Paragraph position="5"> The importance of each sentence is measured by these techniques. Then term weights in each sentence are modified in proportion to the calculated sentence importance. To test our proposed method, we used two different newsgroup data sets; one is a well known data set, the Newsgroup data set by Ken Lang, and the other was gathered from Korean UseNet discussion group. As a result, our proposed method showed the better performance than basis system in both data sets.</Paragraph> <Paragraph position="6"> The rest of this paper is organized as follows. Section 1 explains the proposed text categorization system in detail. In section 2, we discuss the empirical results in our experiments. Section3isdevotedtotheanalysisofour method. The final section presents conclusions and future works.</Paragraph> </Section> class="xml-element"></Paper>