File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1105_metho.xml

Size: 15,436 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1105">
  <Title>Poisson Naive Bayes for Text Classification with Feature Weighting</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Naive Bayes Text Classification
</SectionTitle>
    <Paragraph position="0"> A naive Bayes classifier is a well-known and highly practical probabilistic classifier, and has been employed in many applications. It assumes that all attributes of the examples are independent of each other given the context of the class, that is, an independent assumption. Several studies show that naive Bayes performs surprisingly well in many domains(Domingos and Pazzani, 1997) in spite of its wrong independent assumption.</Paragraph>
    <Paragraph position="1"> In the context of text classification, the probability of class CR given a document CS  , which is a form of log ratio similar to the BIM retrieval model(Jones et al., 2000). It means that the linked independence assumption(Cooper et al., 1992), which explains that the strong independent assumption can be relaxed in the BIM model, is sufficient for the use of naive Bayes text classification model.</Paragraph>
    <Paragraph position="2"> With this framework, two representative naive Bayes text classification approaches are well introduced in (McCallum and Nigam, 1998). They designated the pure naive Bayes as multivariate Bernoulli model, and the unigram language model classifier as multinomial model. Instead, we introduce multivariate Poisson model to improve the pure naive Bayes text classification in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Poisson Naive Bayes Text Classification
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> The Poisson distribution is most commonly used to model the number of random occurrences of some phenomenon in a specified unit of space or time, for example, the number of phone calls received by a telephone operator in a 10-minute period. If we think that the occurrence of each term is a random occurrence in a fixed unit of space (i.e. a length of document) the Poisson distribution is intuitively suitable to model the term frequencies in a given document. For that reason, the use of Poisson model is widely investigated in the IR literature, but it is rarely used for the text classification task(Lewis, 1998). It motivates us to adopt the Poisson model for learning the naive Bayes text classification.</Paragraph>
      <Paragraph position="1"> Our model assumes that CS</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CY
</SectionTitle>
    <Paragraph position="0"> is generated by multi-variate Poisson model. In other words, a document  representing the characteristic of each class? We propose the possible answers in the next subsection. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> with a fixed length according to the definition of Poisson distribution, we should normalize the actual term frequencies in the documents with the different length. In addition, many earlier works in NLP and IR fields recommend that smoothing term frequencies is necessary in order to build a more accurate model.</Paragraph>
      <Paragraph position="1"> Thus, we estimate CU</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="2" type="metho">
    <SectionTitle>
CXCY
</SectionTitle>
    <Paragraph position="0"> as the normalized and smoothed frequency of actual term frequency DC  can be regarded as the value estimated by the following steps : 1) Add AI of all CYCE CY terms to the document CS  is the interpolation of the uniform probability and the probability proportional to the length of the document, and the interpolation is calculated as follows:</Paragraph>
    <Paragraph position="2"> from the long documents can be more reliable than those in the short documents, hence we try to interpolate between the two different probabilities with the parameter AB ranging from 0 to 1. Consequently,</Paragraph>
  </Section>
  <Section position="8" start_page="2" end_page="2" type="metho">
    <SectionTitle>
AL
CX
</SectionTitle>
    <Paragraph position="0"> is a weighted average over all training documents belonging to the class CR, and AM</Paragraph>
  </Section>
  <Section position="9" start_page="2" end_page="2" type="metho">
    <SectionTitle>
CX
</SectionTitle>
    <Paragraph position="0"> for the class AMCR can be estimated in the same manner.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Feature Weighting
</SectionTitle>
      <Paragraph position="0"> Feature selection is often performed as a preprocessing step for the purpose of both reducing the feature space and improving the classification performance. Text classifiers are then trained with various machine learning algorithms in the resulting feature space. (Yang and Pedersen, 1997) investigated some measures to select useful term features including mutual information(MI), information gain(IG), and</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="2" end_page="2" type="metho">
    <SectionTitle>
AV
BE
</SectionTitle>
    <Paragraph position="0"> -statistics(CHI), etc. On the contrary, (Joachims, 1998) claimed that there is no useless term features, and it is preferable to use all term features. It is clear that learning and classification become very efficient when the feature space is considerably reduced. However, there is no definite conclusion about the contribution of feature selection to improve overall performances of the text classification systems. It may considerably depend on the employed learning algorithm. We believe that proper external feature selection or weighting is required to improve the performances of naive Bayes since the naive Bayes has no framework of the discriminative optimizing process in itself. Of the two possible approaches, feature selection is very inefficient in case that the additional training documents are provided continuously. It is because the feature set should be redefined according to the modified term statistics in the new training document set, and classifiers should be trained again with this new feature set. For that reason, we prefer to use feature weighting to improve naive Bayes rather than feature selection.</Paragraph>
    <Paragraph position="1"> With the feature weighting method, our DE  not labeled as CR cd each term feature: information gain, AV</Paragraph>
  </Section>
  <Section position="11" start_page="2" end_page="2" type="metho">
    <SectionTitle>
BE
</SectionTitle>
    <Paragraph position="0"> -statistics and probability ratio. Information gain (or average mutual information) is an information-theoretic measure defined by the amount of reduced uncertainty given a piece of information. We use the information gain value as the weight of each term for the class CR, and the value is calculated using a document event model as follows:</Paragraph>
    <Paragraph position="2"> where, for example, D4B4CRB5 is the number of documents belonging to the class CR divided by the total number of documents, and D4B4AMDBB5 is the number of documents without the term DB divided by the total number of documents, etc.</Paragraph>
    <Paragraph position="3"> Second measure we used is AV</Paragraph>
  </Section>
  <Section position="12" start_page="2" end_page="2" type="metho">
    <SectionTitle>
BE
</SectionTitle>
    <Paragraph position="0"> - statistics developed for the statistical test of the hypothesis. In the text classification, given a two-way contingency table for each term D8</Paragraph>
    <Paragraph position="2"> where, CP,CQ,CR and CS indicate the number of documents for each cell in the above contingency table.</Paragraph>
    <Paragraph position="3"> (Yang and Pedersen, 1997) compared the various feature selection methods, and concluded that these two measures are most effective for their kNN and LLSF classification models.</Paragraph>
    <Paragraph position="4"> Finally, we introduce a new measure - probability ratio. Probability ratio is defined by,</Paragraph>
    <Paragraph position="6"> This measure calculates the sum of the ratio of two class-conditional probabilities from each class and its reciprocal. The former term and the latter term are representing the degree of predicting positive and negative class respectively. The weight using this measure always has a positive value higher than 2.</Paragraph>
    <Paragraph position="7"> We have conducted the experiments with these three measures for the feature weighting test, and the results are given in Section 4.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 Implementation Issues
</SectionTitle>
      <Paragraph position="0"> sented in Section 2 becomes impossible. However, it is trivial since most of IR systems do not have interest on exact posterior probability. In addition, all the parameters in our model is guaranteed to be calculated by the incremental way.</Paragraph>
    </Section>
  </Section>
  <Section position="13" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Experimental Result
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Data and Evaluation Measure
</SectionTitle>
      <Paragraph position="0"> Our experiments were performed on the two datasets: Reuters21578 and KoreanNews2002 collection. Reuters21578 collection is the most widely used benchmark dataset for the text categorization research. We have used &amp;quot;ModApte&amp;quot; split version, which consists of 9603 training documents and 3299 test documents. There are 90 categories, and each document has one or more of the categories.</Paragraph>
      <Paragraph position="1"> We have built another benchmark collection - KoreanNews2002 collection. KoreanNews2002 collection is composed of 15,000 news articles published during the year of 2002. The articles are collected from a number of Korean news portal websites, and each article is labeled with exactly one of the 46 classes. All the documents have date stamps attached and have been ordered according to their date stamps. With this date order, we divided them into the former 10,000 documents for training and the latter 5,000 documents for testing.</Paragraph>
      <Paragraph position="2"> The performances are evaluated using popular F1 measure, and the F1 values for each class are microaveraged(MicroF1) and macro-averaged(MacroF1) to examine the general classification performances.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Proposed Model : PNB (vs. UM)
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the performances of our new model named Poisson naive Bayes(PNB) classifiers ac- null cording to the interpolation parameter AB for estimating Poisson mean AL and AM. The baseline method is a unigram model classifier (UM) which is also referred to multinomial naive Bayes classifier described in (McCallum and Nigam, 1998). Our proposed PNB clearly outperforms the UM.</Paragraph>
      <Paragraph position="1"> Although there is no significant difference of MicroF1 values among the various AB values, the F1 value of each class is considerably affected by the AB values. Figure 2 presents the fluctuations of the F1 values for 4 classes in Reuters21578 collection.</Paragraph>
      <Paragraph position="2"> From this result, we can assume that there is no global optimal value of AB, but each class has its own optimal AB. In our experiments, many of the classes have the highest F1 value when AB is about 0.8 or 0.9 except some classes such as corn class which shows the highest F1 value at AB BP BCBMBF. Similar results are obtained in the KoreanNews2002 collections. null Table 2 and 3 shows the MicroF1 and MacroF1 values of the unigram model classifiers and our PNB on the two collections, where PNB(min) and PNB(max) are the highest and lowest values at different AB. In any cases, PNB is superior to UM.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 Feature Weighting : PNB-CUIG,CHI,PrRCV
</SectionTitle>
      <Paragraph position="0"> We have fixed the interpolation parameter AB at 0.8, and evaluated the following feature weighting methods: PNB-IG with information gain, PNB-CHI with</Paragraph>
    </Section>
  </Section>
  <Section position="14" start_page="2" end_page="2" type="metho">
    <SectionTitle>
AV
BE
</SectionTitle>
    <Paragraph position="0"> -statistic, and PNB-PrR with probability ratio. In these experiments, some important behaviors of feature weighted PNB classifiers are observed from the results. In order to explain the phenomenon, we have grouped the classes into the bins according to  The different average F1 performance of each bin is shown in Figure 3. The clear observation from this result is that feature weighting is highly effective in the bins of the classes with a small number of training documents, but hardly contributes the performances for the bins of the classes with sufficiently many training documents. In the bins with enough training documents, simple PNB classifiers show the similar performances to the PNB with feature weighting methods. This tendency is more clearly captured in the Reuters21578 collection, where a third of the classes have fewer than 10 training documents. In contrast, two thirds of the classes in the KoreanNews2002 collection have more than a hundred of training documents.</Paragraph>
    <Paragraph position="1"> Among the feature weighting methods, PNB-PrR performs stably than PNB-IG and PNB-CHI.</Paragraph>
    <Paragraph position="2"> PNB-IG or PNB-CHI somewhat degrades the performance in the classes with the large number of training documents, while PNB-PrR maintains the good performances in those classes on both of the collections. On the other hand, PNB-IG and PNB-CHI considerably improve the performances in the rare categories though the improvement is somewhat different from the two collections. For example, PNB-CHI significantly improves the simple PNB on the Reuters21578 collection while PNB-IG is very effective on the KoreanNews2002 collection. Thus, we can realize that the proper feature weighting method depends on the characteristics of the collection, and different feature weighting strategies should be adopted to improve the naive Bayes text classification.</Paragraph>
    <Paragraph position="3"> From these observations, we tested another classifier PNB  which employ different feature weighting method for each bin to obtain the near optimal performances. Table 4 and 5 show the summary of the performances including PNB  on the both collections. Our proposed model with feature weighting methods are very effective compared to the baseline UM method. Moreover, the performance of bin-optimized PNB  in Reuters21578 collection shows that Poisson naive Bayes with feature weighting methods can achieve the state-of-the-art performances achieved by SVM or kNN which are reported in (Yang and Liu, 1999; Joachims, 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML