File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1214_metho.xml
Size: 21,081 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1214"> <Title>Machine Learning Methods for Chinese Web Page Categorization</Title> <Section position="3" start_page="93" end_page="93" type="metho"> <SectionTitle> 2 Feature Selection and Extraction </SectionTitle> <Paragraph position="0"> A pre-requisite of text categorization is to extract a suitable feature representation of the documents. Typically, word stems are suggested as the representation units by information retrieval research. However, unlike English and other Indo-European languages, Chinese text does not have a natural delimiter between words. As a consequence, word segmentation is a major issue in Chinese document processing. Chinese word segmentation methods have been extensively discussed in the literature. Unfortunately perfect precision and disambiguation cannot be reached.</Paragraph> <Paragraph position="1"> As a result, the inherent errors caused by word segmentation always remains as a problem in Chinese information processing.</Paragraph> <Paragraph position="2"> In our experiments, a word-class bi-gram model is adopted to segment each training document into a set of tokens. The lexicon used by the segmentation model contains 64,000 words in 1,006 classes. High precision segmentation is not the focus of our work. Instead we aim to compare different classifier's performance on noisy document set as long as the errors caused by word segmentation are reasonably low.</Paragraph> <Paragraph position="3"> To select keyword features for classification, X (CHI) statistics is adopted as the ranking metric in our experiments. A prior study on several well-known corpora including Reuters-21578 and OHSUMED has proven that CHI statistics generally outperforms other feature ranking measures, such as term strength (TS), document frequency (DF), mutual information (MI), and information gain (IG) (Yang and J.P, 1997).</Paragraph> <Paragraph position="4"> During keyword extraction, the document is first segmented and converted into a keyword frequency vector (t fl, t f2,..., t.f M ) , where tfi is the in-document term frequency of keyword wi, and M is the number of the keyword features selected. A term weighting method based on inverse document .frequency (IDF) (Salton, 1988) and the L1norm~llzation are then applied on the frequency vector to produce the keyword feature</Paragraph> <Paragraph position="6"> in which xi is computed by</Paragraph> <Paragraph position="8"> where n is the number of documents in the whole training set, and ni is the number of training documents in which the keyword wi occurs at least once.</Paragraph> </Section> <Section position="4" start_page="93" end_page="96" type="metho"> <SectionTitle> 3 The Classifiers </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="93" end_page="94" type="sub_section"> <SectionTitle> 3.1 k Nearest Neighbor </SectionTitle> <Paragraph position="0"> k Nearest Neighbor (kNN) is a traditional statistical pattern recognition algorithm (Dasarathy, 1991). It has been studied extensively for text categorization (Yang and Liu, 1999). In essence, kNN makes the prediction based on the k training patterns that are closest to the unseen (test) pattern, according to a distance metric. The distance metric that measures the similarity between two normalized patterns can be either a simple LIdistance function or a L2-distance function, such as the plain Euclidean distance defined by D(a,b)=~s~. (a~-bi)2. (3) The class assignment to the test pattern is based on the class assignment of the closest k training patterns. A commonly used method is to label the test pattern with the class that has the most instances among the k nearest neighbors. Specifically, the class index y(x) assigned to the test pattern x is given by yCx) ..-.. arg'max, {n(dj, )ld.:j kNN}, (4) where n(dj, ~) is the number of training pattern dj in the k nearest neighbor set that are associated with class c4.</Paragraph> <Paragraph position="1"> The drawback of kNN is the difficulty in deciding a optimal k value. Typically it has to be determined through conducting a series of experiments using different values.</Paragraph> </Section> <Section position="2" start_page="94" end_page="94" type="sub_section"> <SectionTitle> 3.2 Support Vector Machines </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVM) is a relatively new class of machine learning techniques first introduced by Vapnik (Cortes and Vapnik, 1995). Based on the structural risk minimization principle from the computational learning theory, SVM seeks a decision surface to separate the tralning data points into two classes and makes decisions based on the support vectors that are selected as the only effective elements from the training set.</Paragraph> <Paragraph position="1"> Given a set of linearly separable points s = {x Rnli = 1,2,...,N}, each point xi belongs to one of the two classes, labeled as yiE{-1,+l}. A separating hyper-plane divides S into two sides, each side containing points with the same class label only. The separating hyper-plane can be identified by the pair (w,b) that satisfies</Paragraph> <Paragraph position="3"> for vectors w and x. Thus the goal of the SVM learning is to find the optimal separating hyper-plane (OSH) that has the maximal margin to both sides. This can be formularized as: minimize 1/2w. w subject to yi(w.xi + b)>l (7) The points that are closest to the OSH are termed support vectors (Fig. 1).</Paragraph> <Paragraph position="4"> The SVM problem can be extended to linearly non-separable case and non-linear case. Various quadratic programming algorithms have been proposed and extensively studied to solve the SVM problem (Cortes and Vapnik, 1995; Joachims, 1998; Joacbims, 1999). During classification, SVM makes decision based on the OSH instead of the whole training set. It simply finds out on which side of the OSH the test pattern is located. points on the dashed lines). The dashed lines identify the max margin.</Paragraph> <Paragraph position="5"> This property makes SVM highly competitive, compared with other traditional pattern recognition methods, in terms of computational efficiency and predictive accuracy (Yang and Liu, 1999).</Paragraph> <Paragraph position="6"> In recent years, Joachims has done much research on the application of SVM to text categorization (Joachims, 1998). His SVM zight system published via http://www-ai.cs.unidortmund.de/FORSCHUNG/VERFAHREN/ null SVM_LIGHT/svm_light.eng.html is used in our benchmark experiments.</Paragraph> </Section> <Section position="3" start_page="94" end_page="96" type="sub_section"> <SectionTitle> 3.3 Adaptive Resonance Associative Map Adaptive Resonance Associative Map </SectionTitle> <Paragraph position="0"> (ARAM) is a class of predictive serforganizing neural networks that performs incremental supervised learning of recognition categories (pattern classes) and multidimensional maps of patterns. An ARAM system can be visualized as two overlapping Adaptive Resonance Theory (ART) modules consisting of two input fields F~ and F1 b with an F2 category field (Tan, 1995; Tan, 1997) (Fig. 2). For classification problems, the F~ field serves as the input field containing the document feature vector and the F1 b field serves as the output field containing the class prediction vector. The F2 field contains the activities of the recognition categories that are used to encode the patterns.</Paragraph> <Paragraph position="1"> .. I +-/+- I x !.'1 x, ARTa A B ARTb tive Map architecture When performing classification tasks, ARAM formulates recognition categories of input patterns, and associates each category with its respective prediction. During learning, given an input pattern (document feature) presented at the F~ input layer and an output pattern (known class label) presented at the Fib output field, the category field F2 selects a winner that receives the largest overall input. The winning node selected in F2 then triggers a top-down priming on F~ and F~, monitored by separate reset mechanisms. Code stabilization is ensured by restricting encoding to states where resonance are reached in both modules.</Paragraph> <Paragraph position="2"> By synchronizing the un.qupervised categorization of two pattern sets, ARAM learns supervised mapping between the pattern sets. Due to the code stabilization mechanism, fast learning in a real-time environment is feasible.</Paragraph> <Paragraph position="3"> The knowledge that ARAM discovers during learning is compatible with IF-THEN rule-based presentation. Specifically, each node in the FF2 field represents a recognition category associating the F~ patterns with the F1 b output vectors. Learned weight vectors, one for each F2 node, constitute a set of rules that link antecedents to consequences. At any point during the incremental learning process, the system architecture can be translated into a compact set of rules. Similarly, domain knowledge in the form of IF-THEN rules can be inserted into ARAM architecture.</Paragraph> <Paragraph position="4"> The ART modules used in ARAM can be ART 1, which categorizes binary patterns, or analog ART modules such as ART 2, ART 2-A, and fuzzy ART, which categorize both binary and analog patterns. The fuzzy ARAM (Tan, 1995) algorithm based on fuzzy ART (Carpenter et al., 1991) is introduced below. Parameters: Fuzzy ARAM dynamics are determined by the choice parameters aa > 0 and ab > 0; the learning rates ~a E \[0, 1\] and ~b E \[0, 1\]; the vigilance parameters Pa E \[0, 1\] and Pb E \[0, 1\]; and the contribution parameter '7 E \[0, 1\].</Paragraph> <Paragraph position="5"> Weight vectors: Each F2 category node j is associated with two adaptive weight templates w~ and w~. Initially, all category nodes are uncommitted and all weights equal ones. After a category node is selected for encoding, it becomes committed.</Paragraph> <Paragraph position="6"> Category choice: Given the F~ and F1 b input vectors A and B, for each F2 node j, the choice function Tj is defined by</Paragraph> <Paragraph position="8"> and where the norm I-I is defined by</Paragraph> <Paragraph position="10"> for vectors p and q.</Paragraph> <Paragraph position="11"> The system is said to make a choice when at most one F2 node can become active. The choice is indexed at J where Tj = ma,x{Tj : for all F2 node j}. (11) When a category choice is made at node J, yj = 1; andyj =0 for all j ~ J.</Paragraph> <Paragraph position="12"> Resonance or reset: Resonance occurs if the match .functions, m~ and m~, meet the vigilance criteria in their respective modules:</Paragraph> <Paragraph position="14"> Learning then ensues, as defined below. If any of the vigilance constraints is violated, mismatch reset occurs in which the value of the choice function Tj is set to 0 for the duration of the i.nput presentation. The search process repeats to select another new index J until resonance is achieved.</Paragraph> <Paragraph position="15"> Learning: Once the search ends, the weight vectors w~ and w~ are updated according to the equations</Paragraph> <Paragraph position="17"> respectively. Fast learning corresponds to setting/~a =/~b = 1 for committed nodes.</Paragraph> <Paragraph position="18"> Classification: During classification, using the choice rule, only the F2 node J that receives maximal F~ ~ F2 input Tj predicts ARTb output. In simulations,</Paragraph> <Paragraph position="20"> where bi indicates the likelihood or confidence of assigning a pattern to category i.</Paragraph> <Paragraph position="21"> Rule insertion: Rule insertion proceeds in two phases. The first phase parses the rules for keyword features. When a new keyword is encountered, it is added to a keyword feature table containing keywords obtained through automatic feature selection from training documents. Based on the keyword feature table, the second phase of rule insertion translates each rule into a M-dimensional vector a and a N-dimensional vector b, where M is the total number of features in the keyword feature table and N is the number of categories. Given a rule of the following format,</Paragraph> <Paragraph position="23"> where xt,..., xm are antecedents and Yt,... ,Yn are consequences, the algorithm derives a pair of vectors a and b such that * for each index i = 1,..., M,</Paragraph> <Paragraph position="25"> where wi is the i th entry in the keyword feature table; and for each index i = 1,..., N,</Paragraph> <Paragraph position="27"> where wi is the class label of the category i.</Paragraph> <Paragraph position="28"> The vector pairs derived from the rules are then used as training patterns to initialize a ARAM network. During rule insertion, the vigilance parameters Pa and Pb are each set to 1 to ensure that only identical attribute vectors are grouped into one recognition category. Contradictory symbolic rules are detected during rule insertion when identical input attribute vectors are associated with distinct output attribute vectors.</Paragraph> </Section> </Section> <Section position="5" start_page="96" end_page="99" type="metho"> <SectionTitle> 4 Empirical Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="96" end_page="97" type="sub_section"> <SectionTitle> 4.1 The Chinese Web Corpus </SectionTitle> <Paragraph position="0"> The Chinese web corpus, colleeted in-house, consists of web pages downloaded from various Chinese web sites covering a wide variety of topics. Our experiments are based on a subset of the corpus consisting of 8 top-level categories and over 6,000 documents. For each category, we conduct binary classification experiments in which we tag the current category as the positive category and the other seven categories as the negative categories. The corpus is further partitioned into training and testing data such that the number of the training documents is at least 2.5 times of that of the testing documents (Table 1).</Paragraph> </Section> <Section position="2" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 4.2 Experiment Paradigm </SectionTitle> <Paragraph position="0"> kNN experiments used the plain Euclidean distance defined by equation (3) as the similaxity measure. On each pattern set containing a varying number of documents, different values of k ranging ~om 1 to 29 were tested and the best results were recorded. Only odd k were used to ensure that a prediction can always be made.</Paragraph> <Paragraph position="1"> SVM experiments used the default built-in inductive SVM parameter set in VM tight, which is described in detail on the web site and elsewhere (Joachims, 1999).</Paragraph> <Paragraph position="2"> ARAM experiments employed a standard set of parameter values of fuzzy ARAM. In addition, using a voting strategy, 5 ARAM systems were trained using the same set of patterns in different orders of presentation and were combined to yield a final prediction vector.</Paragraph> <Paragraph position="3"> To derive domain theory on web page classification, a varying number (ranging from 10 to 30) of trainiug documents from each category were reviewed. A set of domain knowledge consists of 56 rules with about one to 10 rules for each category was generated. Only positive rules that link keyword antecedents to positive category consequences were included (Table 2).</Paragraph> </Section> <Section position="3" start_page="97" end_page="98" type="sub_section"> <SectionTitle> 4.3 Performance Measures </SectionTitle> <Paragraph position="0"> Our experiments adopt the most commonly used performance measures, including the recall, precision, and F1 measures. Recall (R) is the percentage of the documents for a given category that are classified correctly. Precision (P) is the percentage of the predicted documents for a given category that are classifted correctly. Ft rating is one of the commonly used measures to combine R and P into a single rating, defined as</Paragraph> <Paragraph position="2"> These scores are calculated for a series of binary classification experiments, one for each category. Micro-averaged scores and macro-averaged scores on the whole corpus are then produced across the experiments. With micro-averaging, the performance measures are produced across the documents by adding up all the documents counts across the different tests, and calculating using these summed values. With macro-averaging, each category is assigned with the same weight and performance measures are calculated across the categories. It is understandable that micro-averaged scores and macro-averaged scores reflect a classifier's performance on large categories and small categories respectively (Yang and Liu, 1999).</Paragraph> </Section> <Section position="4" start_page="98" end_page="99" type="sub_section"> <SectionTitle> 4.4 Results and Discussions </SectionTitle> <Paragraph position="0"> Table 3 summarizes the three classifier's performances on the test corpus in terms of precision, recall, and F1 measures. The micro-averaged scores produced by the trio, which were predominantly determined by the classifters' performance on the large categories (such as Biz, IT, and Joy), were roughly comparable. Among the three, kNN seemed to be marginally better than SVM and ARAM.</Paragraph> <Paragraph position="1"> Inserting rules into ARAM did not have a significant impact. This showed that domain knowledge was not very useful for categories that already have a large number of training examples. The differences in the macro-averaged scores produced by the three classifiers, however, were much more significant. The macro-averaged F1 score obtained by ARAM was noticeably better than that of kNN, which in turn was higher than that of SVM. This indicates that ARAM (and kNN) tends to outperform SVM in small categories that have a smaller number of training patterns. null We are particularly interested in the classifier's learning ability on small categories. In certain applications, such as personalized content delivery, a large pre-labeled training corpus may not be available. Therefore, a classifiefs ability of learning from a small training pattern set is a major concern. The different approaches adopted by these three classifiers in learning categorization knowledge are best * seen in the light of the distinct learning peculiarities they exhibit on the small training sets.</Paragraph> <Paragraph position="2"> kNN is a lazy learning method in the sense that it does not carry out any off-line learning to generate a particular category knowledge representation. Instead, kNN performs on-line scoring to find the training patterns that are nearest to a test pattern and makes the decision based on the statistical presumption that patterns in the same category have similar feature representations. The presumption is basically true to most pattern instances.</Paragraph> <Paragraph position="3"> Thus kNN exhibits a relatively stable performanee across small and large categories.</Paragraph> <Paragraph position="4"> SVM identifies optimal separating hyper-plane (OSH) across the training data points and makes classification decisions based on the representative data instances (known as support vectors). Compared with kNN, SVM is more computationally efficient during classification for large-scale training sets. However, the OSH generated using small training sets may not be very representative, especially when the training patterns are sparsely distributed and there is a relatively narrow margin between the positive and negative patterns. In our experiments on small training sets including Art, Belief, Edu, and Sci, SVM's performance were generally lower than those of kNN and ARAM.</Paragraph> <Paragraph position="5"> ARAM generates recognition categories from the input training patterns. The incrementally learned rules abstract the major representations of the training patterns and eliminate minor inconsistencies in the data patterns. During classifying, it works in a similar fashion as kNN. The major difference is that AI:tAM uses the learned recognition categories as the similarity-scoring unit whereas kNN uses the raw in-processed training patterns as the distance-scoring unit. It follows that ARAM is notably more scalable than kNN by its pattern abstraction capability and therefore is more suitable for handling very large data sets.</Paragraph> <Paragraph position="6"> The overall improvement in predictive performance obtained by inserting rules into ARAM is also of particular interest to us.</Paragraph> <Paragraph position="7"> ARAM's performance was more likely to be improved by rule insertion in categories that are well defined and have relatively fewer numbers of training patterns. As long as a user is able to abstract the category knowledge into certain specific rule representation, domain knowledge could complement the limited knowledge acquired through a small training set quite effectively.</Paragraph> </Section> </Section> class="xml-element"></Paper>