File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1214_intro.xml

Size: 2,830 bytes

Last Modified: 2025-10-06 14:00:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1214">
  <Title>Machine Learning Methods for Chinese Web Page Categorization</Title>
  <Section position="2" start_page="0" end_page="93" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Text categorization refers to the task of automatically assigning one or multiple pre-defined category labels to free text documents. Whereas an extensive range of methods has been applied to English text categorization, relatively few have been benchmarked for Chinese text categorization. Typical approaches to Chinese text categorization, such as Naive Bayes (NB) (Zhu, 1987), Vector Space Model (VSM) (Zou et al., 1998; Zou et al., 1999) and Linear List Square Fit (LLSF) (Cao et al., 1999; Yang, 1994), have well studied theoretical basis derived from the information retrieval research, but are not known to be the best classifiers (Yang and Liu, 1999; Yang, 1999). In addition, there is a lack of publicly available Chinese corpus for evaluating Chinese text categorization systems.</Paragraph>
    <Paragraph position="1"> This paper reports our applications of three statistical machine learning methods, namely k Nearest Neighbor system (kNN) (Dasarathy, 1991), Support Vector Machines (SVM) (Cortes and Vapnik, 1995), and Adaptive Resonance Associative Map (ARAM) (Tan, 1995) to Chinese web page categorization. kNN and SVM have been reported as the top performing methods for English text categorization (Yang and Liu, 1999). ARAM belongs to a popularly known family of predictive self-organizing neural networks which until recently has not been used for document classification. The trio has been evaluated based on a Chinese corpus consisting of news articles extracted from People's Daily (He et al., 2000). This article reports the experiments of a much more challenging task in classifying Chinese web pages. The Chinese web corpus was created by downloading from various Chinese web sites covering a wide variety of topics. There is a great diversity among the web pages in terms of document length, style, and content. The objectives of our experiments are two-folded. First, we examine and compare the capabilities of these methods in learning categorization knowledge from real-fife web docllments. Second, we investigate if incorporating domain knowledge derived from the category description can enhance ARAM's predictive performance.</Paragraph>
    <Paragraph position="2"> The rest of this article is organized as follows. Section 2 describes our choice of the feature selection and extraction methods. Section 3 gives a sllrnrnary of the kNN and SVM, and presents the less familiar ARAM algorithm in more details. Section 4 presents our evaluation paradigm and reports the experi- null mental results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML