File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1026_intro.xml

Size: 8,233 bytes

Last Modified: 2025-10-06 14:01:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1026">
  <Title>Manipulating Large Corpora for Text Classification</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Classifiers
2.1 NB
</SectionTitle>
    <Paragraph position="0"> Naive Bayes(NB) probabilistic classifiers are commonly studied in machine learning(Mitchell, 1996).</Paragraph>
    <Paragraph position="1"> The basic idea in NB approaches is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document. The NB assumption is that all the words in a text are conditionally independent given the value of a classification variable. There are several versions of the NB classifiers. Recent studies on a Naive Bayes classifier which is proposed by McCallum et. al.</Paragraph>
    <Paragraph position="2"> reported high performance over some other commonly used versions of NB on several data collections(McCallum et al., 1998). We use the model of NB by McCallum et. al. which is shown in formula</Paragraph>
    <Paragraph position="4"> a87a88a89a87 refers to the number of vocabularies, a87a90a91a87 denotes the number of labeled training documents, and a87a92a93a87 shows the number of categories. a87a94a41a95a96a87 denotes document length. a61a98a97 a30 a49 is the word in position a99 of document a94 a95 , where the subscript ofa61 , a94 a95a40a100 indicates an index into the vocabulary. a101a91a102 a61a104a103a59a105 a94 a95a59a106 denotes the number of times worda61 a103 occurs in document a94a83a95 , and</Paragraph>
    <Paragraph position="6"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 SVMs
</SectionTitle>
      <Paragraph position="0"> SVMs are introduced by Vapnik(Vapnik, 1995) for solving two-class pattern recognition problems. It is defined over a vector space where the problem is to find a decision surface(classifier) that 'best' separates a set of positive examples from a set of negative examples by introducing the maximum 'margin' between two sets. The margin is defined by the distance from the hyperplane to the nearest of the positive and negative examples. The decision surface produced by SVMs for linearly separable space is a hyperplane which can be written as a117a119a118a17a120 + a8 = 0 (a120 ,</Paragraph>
      <Paragraph position="2"> a120 is an arbitrary data point, and a117 = (a61a126a125 ,a118a76a118a17a118 ,a61 a123 ) and a8 are learned from a training set of linearly separable data. Figure 1 shows an example of a simple two-dimensional problem that is linearly separable2.</Paragraph>
      <Paragraph position="3">  In the linearly separable case maximizing the margin can be expressed as an optimization problem:</Paragraph>
      <Paragraph position="5"> and a157 a95 is a label corresponding the a19 -th training example. In formula (3), each element of w, a61 a100 (1 a158  a1 ) corresponds to each word in the training examples, and the larger value of a61 a100a15a162 a163a95a41a164 a95 a157 a95 a156 a95a40a100 is, the more the word a61 a100 features positive examples.</Paragraph>
      <Paragraph position="6"> We note that SVMs are basically introduced for solving binary classification, while text classification is a multi-class, multi-label classification problem. Several methods using SVMs which were intended for multi-class, multi-label data have been proposed(Weston and Watkins, 1998). We use a165 a1a4a0 -</Paragraph>
      <Paragraph position="8"> a0a5a3a7a167 version of the SVMs model in the work. A time complexity of SVMs is known as a165a169a102 a10a130a170 a106a132a171 a165a98a102 a10a173a172 a106 , where a10 is the number of training data. We consider a time complexity of a165 a1a166a0 -</Paragraph>
      <Paragraph position="10"> a0a5a3a7a167 method. Let a10 be the number of training data with a99 categories. The average size of the training data per category is a163a100 . Let also a99a82a174a175a118a52a176a169a102 a10 a174 a106 be the time needed to train all categories, where a176a177a102 a10 a174 a106 represents the time for learning one binary classifier using a10 a174 training data, and a99 a174 is the number of binary classifier. The time for</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Hierarchical classification
</SectionTitle>
      <Paragraph position="0"> A well-known technique for classifying a large, heterogeneous collection such as web content is to use category hierarchies. Following the approaches of Koller and Sahami(Koller and Sahami, 1997), and Dumais's(Dumais and Chen, 2000), we employ a hierarchy by learning separate classifiers at each internal node of the tree, and then labeling a document using these classifiers to greedily select sub-branches until a leaf is reached.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Manipulating training data
</SectionTitle>
      <Paragraph position="0"> Our hypothesis regarding NB is that it can work well for documents which are assigned to only one category within the same category level in the hierarchical structure. We base this on some recent papers claiming that NB methods perform surprisingly well for an 'accuracy' measure which is equivalent to the standard precision under the one-category-perdocument assumption on classifiers and also equivalent to the standard recall, assuming that each document has one and only one correct category per category level(Lewis and Ringuette, 1994), (Koller and Sahami, 1997). SVMs, on the other hand, have the potential to handle more complex problems without sacrificing accuracy, even though the computation of the SVM classifiers is far less efficient than NB. We thus use NB for simple classification problems and SVMs for more complex data, i.e., the data which cannot be classified correctly by NB classifiers. We use ten-fold cross validation: All of the training data were randomly shuffled and divided into ten equal folds. Nine folds were used to train the NB classifiers while the remaining fold(held-out test data) was used to evaluate the accuracy of the classification. For each category level, we apply the following procedures. Let a101a98a179 be the total number of nine folds training documents, and a101a149a180 be the number of the remaining fold in each class level. Figure 2 illustrates the flow of our system.</Paragraph>
      <Paragraph position="1">  1. Extracting training data using NB 1-1 NB is applied to the a101a126a179 documents, and clas- null sifiers for each category are induced. They are evaluated using the held-out test data, the a101 a180 documents.</Paragraph>
      <Paragraph position="2"> 1-2 This process is repeated ten times so that each fold serves as the source of the test data once.</Paragraph>
      <Paragraph position="3"> The threshold, the probability value which produces the most accurate classifier through ten runs, is selected.</Paragraph>
      <Paragraph position="4"> 1-3 The held-out test data which could not be classified correctly by NB classifiers with the optimal parameters are extracted (a101a45a181a110a182a110a182a67a183a67a182 in Figure 2). They are used to train SVMs.</Paragraph>
      <Paragraph position="5"> The procedure is applied to each category level.</Paragraph>
      <Paragraph position="6">  2. Classifying test data 2-1 We use all the training data, a101a126a179 +a101a184a180 , to train NB classifiers and the data which is produced by procedure 1-3 to train SVMs.</Paragraph>
      <Paragraph position="7"> 2-2 NB classifiers are applied to the test data. The test data is judged to be the category a108 whose probability is larger than the threshold which is obtained by 1-2.</Paragraph>
      <Paragraph position="8"> 2-3 If the test data is not assigned to any one of the  categories, the test data is classified by SVMs classifiers. The test data is judged to be the category a108 whose distance</Paragraph>
      <Paragraph position="10"> We employ the hierarchy by learning separate classifiers at each internal node of the tree and then assign categories to a document by using these classifiers to greedily select sub-branches until a leaf is reached.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML