File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4002_intro.xml
Size: 5,744 bytes
Last Modified: 2025-10-06 14:02:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4002"> <Title>MMR-based feature selection for text categorization</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text categorization is the problem of automatically assigning predefined categories to free text documents. A growing number of statistical classification methods and machine learning techniques have been applied to text categorization in recent years [9].</Paragraph> <Paragraph position="1"> A major characteristic, or difficulty, of text categorization problems is the high dimensionality of the feature space [10]. The native feature space consists of the unique terms that occur in documents, which can be tens or hundreds of thousands of terms for even a moderate-sized text collection. This is prohibitively high for many machine learning algorithms. If we reduce the set of features considered by the algorithm, we can serve two purposes. We can considerably decrease the running time of the learning algorithm, and we can increase the accuracy of the resulting model. In this line, a number of researches have recently addressed the issue of feature subset selection [2][4][8]. Yang and Pederson found information gain (IG) and chi-square test (CHI) most effective in aggressive term removal without losing categorization accuracy in their experiments [8].</Paragraph> <Paragraph position="2"> Another major characteristic of text categorization problems is the high level of feature redundancy [11].</Paragraph> <Paragraph position="3"> While there are generally many different features relevant to classification task, often several such cues occur in one document. These cues are partly redundant. Naive Bayes, which is a popular learning algorithm, is commonly justified using assumptions of conditional independence or linked dependence [12]. However, theses assumptions are generally accepted to be false for text. To remove these violations, more complex dependence models have been developed [13].</Paragraph> <Paragraph position="4"> Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space [2][4][8]. The most popular feature selection method is IG. IG works well with texts and has often been used. IG looks at each feature in isolation and measures how important it is for the prediction of the correct class label. In cases where all features are not redundant with each other, IG is very appropriate.</Paragraph> <Paragraph position="5"> But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models.</Paragraph> <Paragraph position="6"> In this paper, for the high dimensionality of the feature space and the high level of feature redundancy, we propose a new feature selection method which selects each feature according to a combined criterion of information gain and novelty of information. The latter measures the degree of dissimilarity between the feature being considered and previously selected features.</Paragraph> <Paragraph position="7"> Maximal Marginal Relevance (MMR) provides precisely such functionality [5]. So we propose MMR-based feature selection method which strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization.</Paragraph> <Paragraph position="8"> In machine learning field, some greedy methods that add or subtract a single feature at a time have been developed for feature selection [3][14]. S. Della Pietra et al. proposed a method for incrementally constructing random field [14]. Their method builds increasingly complex fields to approximate the empirical distribution of a set of training examples by allowing features. Features are incrementally added to the field using a top-down greedy algorithm, with the intent of capturing the salient properties of the empirical sample while allowing generalization to new configurations. However the method is not simple, and this is problematic both computationally and statistically in large-scale problems.</Paragraph> <Paragraph position="9"> Koller and Sahami proposed another greedy feature selection method which provides a mechanism for eliminating features whose predictive information with respect to the class is subsumed by the other features [3]. This method is also based on the Kullback-Leibler divergence to minimize the amount of predictive information lost during feature elimination.</Paragraph> <Paragraph position="10"> In order to compare the performances of our method and greedy feature selection methods, we implemented Koller and Sahami's method, and empirically tested it in section 4.</Paragraph> <Paragraph position="11"> We also compared the performance of conventional machine learning algorithms using our feature selection method with Support Vector Machine (SVM) using all features in section 4. Previous works show that SVM consistently achieves good performance on text categorization tasks, outperforming existing methods substantially and significantly [10][11]. With its ability to generalize well in high dimensional feature spaces and high level of feature redundancy, SVM is known that it does not need any feature selection [11].</Paragraph> <Paragraph position="12"> The remainder of this paper is organized as follows.</Paragraph> <Paragraph position="13"> In section 2, we describe the Maximal Marginal Relevance, and in section 3, we describe the MMR-based feature selection. Section 4 presents the in-depth experiments and the results. Section 5 concludes the research. null</Paragraph> </Section> class="xml-element"></Paper>