XML Viewer - w97-0306

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0306_metho.xml
Size: 36,991 bytes
Last Modified: 2025-10-06 14:14:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0306">
  <Title>Mistake-Driven Learning in Text Categorization</Title>
  <Section position="4" start_page="55" end_page="56" type="metho">
    <SectionTitle>
2 Text Categorization
</SectionTitle>
    <Paragraph position="0"> In text categorization, given a text document and a collection of potential classes, the algorithm decides which classes it belongs to, or how strongly it belongs to each class. For example, possible classes (categories) may be {bond}, {loan}, {interest}, {acquisition}.  Documents that have been categorized by humans are usually used as training data for a text categorization system; later on, the trained system is used to categorize new documents. Algorithms used to train text categorization systems in information retrieval (IR) are often ad-hoc and poorly understood.</Paragraph>
    <Paragraph position="1"> In particular, very little is known about their generalization performance, that is, their behavior on documents outside the training data. Only recently, some machine learning techniques for training linear classifiers have been used and shown to be effective in this domain (Lewis et al., 1996; Cohen and Singer, 1996). These techniques have the advantage that they are better understood from a theoretical standpoint, leading to performance guarantees and guidance in parameter settings. Continuing this line of research we present different algorithms and focus on adjusting them to the unique characteristics of the domain, yielding good performance on the categorization task.</Paragraph>
    <Section position="1" start_page="55" end_page="55" type="sub_section">
      <SectionTitle>
2.1 Training Text Classifiers
</SectionTitle>
      <Paragraph position="0"> Text classifiers represent a document as a set of features d = {fl,f2,...fm}, where m is the number of active features in the document, that is, features that occur in the document. A feature fi may typically represent a word w, a set wl,... Wk of words (Cohen and Singer, 1996) or a phrasal structure (Lewis, 1992; Tzeras and Hartmann, 1993). The strength of the feature f in the document d is denoted by s(f, d). The strength is usually a function of the number of times f appears in d (denoted by n(f, d)). The strength may be used only to indicate the presence or absence of f in the document, in which case it takes on only the values 0 or 1, it may be equal to n(f, d), or it can take other values to reflect also the size of the document.</Paragraph>
      <Paragraph position="1"> In order to rank documents, for each category, a text categorization system keeps a function Fc which, when evaluated on d, produces a score Fc(d).</Paragraph>
      <Paragraph position="2"> A decision is then made by assigning to the category c only those documents that exceed some threshold, or just by placing at the top of the ranking documents with the highest such score. A linear text classifier represents a category as a weight vector wc = (w(fl, c), w(f2, c),.., w(fn, c)) (wl, w2,... Wn), where n is the total number of features in the domain and w(f, c) is the weight of the feature f for this category. It evaluates the score of the document by computing the dot product:</Paragraph>
      <Paragraph position="4"> The problem is modeled as a supervised learning problem. The algorithms use the training data, where each document is labeled by zero or more categories, to learn a classifier which classifies new texts. A document is considered as a positive example for all categories with which it is labeled, and as a negative example to all others.</Paragraph>
      <Paragraph position="5"> The task of a training algorithm for a linear text classifier is to find a weight vector which best classifies new text documents. While a linear text classifier is a linear separator in the space defined by the features, it may not be linear with respect to the document, if one chooses to use complex features such as conjunctions of simple features. In addition, a training algorithm may give also advice on the issue of feature selection, by reducing the weight of non-important features and thus effectively discarding them.</Paragraph>
    </Section>
    <Section position="2" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
2.2 Related Work
</SectionTitle>
      <Paragraph position="0"> Many of the techniques previously used in text categorization make use of linear classifiers, mainly for reasons of efficiency. The classical vector space model, which ranks documents using a nonlinear similarity measure (the &amp;quot;cosine correlation&amp;quot;) (Salton and Buckley, 1983) can also be recast as a linear classification by incorporating length normalization into  the weight vector and the document vector features values. State of the art IR systems determine the strength of a term based on three values: (1) the frequency of the feature in the document (t\]), (2) an inverse measure of the frequency of the feature throughout the data set (id\]), and (3) a normalization factor that takes into account the length of the document. In Sections 4.1 and 4.3 we discuss how we incorporate those ideas in our setting.</Paragraph>
      <Paragraph position="1"> Most relevant to our work are non-parametric methods, which seem to yield better results than parametric techniques. Rocchio's algorithm (Rocchio, 1971), one of the most commonly used techniques, is a batch method that works in a relevance feedback context. Typically, classifiers produced by the Rocchio algorithm are restricted to having non-negative weights. An important distinction between most of the classical non-parametric methods and the learning techniques we study here is that in the former case, there was no theoretical work that addressed the generalization ability of the learned classifter, that is, how it behaves on new data.</Paragraph>
      <Paragraph position="2"> The methods that are most similar to our techniques are the on-line algorithms used in (Lewis et al., 1996) and (Cohen and Singer, 1996). In the first, two algorithms, a multiplicative update and additive update algorithms suggested in (Kivinen and Warmuth, 1995a) are evaluated in the text categorization domain, and are shown to perform somewhat better than Rocchio's algorithm. While both these works make use of multiplicative update algorithms, as we do, there are two major differences between those studies and the current one. First, there are some important technical differences between the algorithms used. Second, the algorithms we study here are mistake-driven; they update the weight vector only when a mistake is made, and not after every example seen. The Experts algorithm studied in (Cohen and Singer, 1996) is very similar to a basic version of the BalancedWinnow algorithm which we study here. The way we treat the negative weights is different, though, and significantly more efficient, especially in sparse domains (see Section 3.1). Cohen and Singer experiment also, using the same algorithm, with more complex features (sparse n-grams) and show that, as expected, it yields better results.</Paragraph>
      <Paragraph position="3"> Our additive update algorithm, Perceptron, is somewhat similar to what is used in (Wiener, Pedersen, and Weigend, 1995). They use a more complex representation, a multi-layer network, but this additional expressiveness seems to make training more complicated, without contributing to better results.</Paragraph>
    </Section>
    <Section position="3" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
2.3 Methodology
</SectionTitle>
      <Paragraph position="0"> We evaluate our algorithms on the the Reuters22173 text collection (Lewis, 1992), one of the most commonly used benchmarks in the literature.</Paragraph>
      <Paragraph position="1"> For the experiments reported In Sections 3.2 we explore and compare different variations of the algorithms; we evaluate those on two disjoint pairs of a training set and a test set, both subsets of the Reuters collection. Each pair consists of 2000 training documents and 1000 test documents, and was used to train and test the classifier on a sample of 10 topical categories. The figures reported are the average results on the two test sets.</Paragraph>
      <Paragraph position="2"> In addition, we have tested our final version of the classifier on two common partitions of the complete Reuters collection, and compare the results with those of other works. The two partitions used are those of Lewis (Lewis, 1992) (14704 documents for training, 6746 for testing) and Apte (Apte, Damerau, and Weiss, 1994) (10645 training, 3672 testing, omitting documents with no topical category).</Paragraph>
      <Paragraph position="3"> To evaluate performance, the usual measures of recall and precision were used. Specifically, we measured the effectiveness of the classification by keeping track of the following four numbers:  In those terms, the recall measure is defines as Pl/Pl+P2, and the precision is defined as pl/pl/n2.</Paragraph>
      <Paragraph position="4"> Performance was further summarized by a break-even point - a hypothetical point, obtained by interpolation, in which precision equals recall.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="56" end_page="60" type="metho">
    <SectionTitle>
3 On-Line learning algorithms
</SectionTitle>
    <Paragraph position="0"> In this section we present the basic versions of the learning algorithms we use. The algorithms are used to learn a classifier Fc for each category c. These algorithms use the training data, where each document is labeled by zero or more categories, to learn a weight vector which is used later on, in the test phase, to classify new text documents. A document is considered as a positive example for all categories with which it is labeled, and as a negative example to all others* The algorithms are on-line and mistake-driven. In the on-line learning model, learning takes place in a sequence of trials. On each trial, the learner first makes a prediction and then receives feedback which may be used to update the current hypothesis (the vector of weights). A mistake-driven algorithm updates its hypothesis only when a mistake is made. In the training phase, given a collection of examples, we may repeat this process a few times, by iterating on the data. In the testing phase, the same process is repeated on the test collection, only that the hypothesis is not updated.</Paragraph>
    <Paragraph position="1"> Let n be the number of features of the current category. For the remainder of this section we denote a training document with rn active features  by d = (sil,si~,...si,,), where sij stands for the strength of the ij feature. The label of the document is denoted by y; y takes the value 1 if the document is relevant to the category and 0 otherwise. Notice, that we care only about the active features in the domain, following (Blum, 1992). The algorithms have three parameters: a threshold/9, and two update parameters, a promotion parameter o~ and a demotion parameter ft.</Paragraph>
    <Paragraph position="2"> Positive Winnow (Littlestone, 1988): The algorithm keeps an n-dimensional weight vector w = (wl,w2,...Wn), wi being the weight of the ith feature, which it updates whenever a mistake is made. Initially, the weight vector is typically set to assign equal positive weight to all features. (We use the value/9/d, where d is the average number of active features in a document; in this way initial scores are close to/9.) The promotion parameter is a &gt; 1 and the demotion is 0 &lt; ~ &lt; 1.</Paragraph>
    <Paragraph position="3"> For a given instance (Sil,Sia... , 8ira) the algorithm predicts 1 iff</Paragraph>
    <Paragraph position="5"> where wit is the weight corresponding to the active feature indexed by ij. The algorithm updates its hypothesis only when a mistake is made, as follows:  (1) If the algorithm predicts 0 and the label is 1  (positive example) then the weights of all the active features are promoted -- the weight wit is multiplied by o~. (2) If the algorithm predicts 1 and the received label is 0 (negative example) then the weights of all the active features are demoted -- the weight wit is multiplied by ft. In both cases, weights of inactive features maintain the same value.</Paragraph>
    <Paragraph position="6"> Perceptron (Rosenblatt, 1958) As in PositiveWinnow, in Perceptron we also keep an n-dimensional weight vector w = (wl, w2,.., wn) whose entries correspond to the set of potential features, which is updated whenever a mistake is made. As above, the initial weight vector is typically set to assign equal weight to all features. The only difference between the algorithms is that in this case the weights are updated in an additive fashion. A single update parameter c~ &gt; 0 is used, and a weight is promoted by adding c~ to its previous value, and is demoted by subtracting o~ from it. In both cases, all other weights maintain the same value.</Paragraph>
    <Paragraph position="7"> Balanced Winnow (Littlestone, 1988): In this case, the algorithm keeps two weights, w +, w-, for each feature. The overall weight of a feature is the difference between these two weights, thus allowing for negative weights. For a given instance (si~, sis ..., si~) the algorithm predicts 1 iff</Paragraph>
    <Paragraph position="9"> where w~, wi- ~ correspond to the active feature indexed by ij. In our implementation, the weights w + are initialized to 20/d and the weights w- are set to 0/d, where d is the average number of active features in a document in the collection.</Paragraph>
    <Paragraph position="10"> The algorithm updates the weights of active features only when a mistake is made, as follows: (1) In the promotion step, following a mistake on a positive example, the positive part of the weight is promoted, w~ ~ a * w~ while the negative part of the weight is demoted, wi~ ~-- ft. wij. Overall, the coefficient of sij in Eq. 1 increases after a promotion. (2) In the demotion step, following a mistake on a negative example, the coefficient ofsij in Eq. 1 is decreased: the positive part of the weight is demoted, w~ ~ j3. w~ while the negative part of the weight is promoted, m wij *- a. w~. In both cases, all other weights maintain the same value.</Paragraph>
    <Paragraph position="11"> In this algorithm (see in Eq. 1) the coefficient of the ith feature can take negative values, unlike the representation used in PositiveWinnow. There are other versions of the Winnow algorithm that allow the use of negative features: (1) Littlestone, when introducing the Balanced version, introduced also a simpler version - a version of PositiveWinnow with a duplication of the number of features. (2) A version of the Winnow algorithm with negative features is used in (Cohen and Singer, 1996). In both cases, however, whenever there is a need to update the weights, all the weights are being updated (actually, n out of the 2n). In the version we use, only weights of active features are being updated; this gives a significant computational advantage when working in a sparse high dimensional space.</Paragraph>
    <Section position="1" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
3.1 Properties of the Algorithms
</SectionTitle>
      <Paragraph position="0"> Winnow and its variations were introduced in Littlestone's seminal paper (Littlestone, 1988); the theoretical behavior of multiplicative weight-updating algorithms for learning linear functions has been studied since then extensively. In particular, Winnow has been shown to learn efficiently any linear threshold function (Littlestone, 1988). These are functions F : {0, 1} n ---~ {0, 1} for which there exist real weights wl,...,wn and a real threshold /9 such that F(sl,...,sn) = 1 iff ~i&amp;quot;=1 wisi &gt; /9. In particular, these functions include Boolean disjunctions and conjunctions on k _&lt; n variables and r-of-k threshold functions (1 &lt; r &lt; k _&lt; n). While Winnow is guaranteed to find a perfect separator if one exists, it also appears to be fairly successful when there is no perfect separator. The algorithm makes no independence or ~tny other assumptions on the features, in contrast to other parametric estimation techniques (typically, Bayesian predictors) which are commonly used in statistical NLP.</Paragraph>
      <Paragraph position="1"> Theoretical analysis has shown that the algorithm has exceptionally good behavior in the presence of  irrelevant features, noise, and even a target function changing in time (Littlestone, 1988; Littlestone, 1991; Littlestone and Warmuth, 1994; Herbster and Warmuth, 1995), and there is already some empirical support for these claims (Littlestone, 1995; Golding and Roth, 1996; Blum, 1995). The key feature of Winnow is that its mistake bound grows linearly with the number of relevant features and only logarithmically with the total number of features. A second important property is being mistake driven.</Paragraph>
      <Paragraph position="2"> Intuitively, this makes the algorithm more sensitive to the relationships among the features -- relationships that may go unnoticed by an algorithm that is based on counts accumulated separately for each attribute. This is crucial in the analysis of the algorithm as well as empirically (Littlestone, 1995; Golding and Roth, 1996).</Paragraph>
      <Paragraph position="3"> The discussion above holds for both versions of Winnow studied here, PositiveWinnow and BalancedWinnow. The theoretical results differ only slightly in the mistake bounds, but have the same flavor. However, the major difference between the two algorithms, one using only positive weights and the other allowing also negative weights, plays a significant role when applied in the current domain, as discussed in Section 4.</Paragraph>
      <Paragraph position="4"> Winnow is closely related, and has served as the motivation for a collection of recent works on combining the &amp;quot;advice&amp;quot; of different &amp;quot;experts&amp;quot;(Littlestone and Warmuth, 1994; Cesa-Bianchi et al., 1995; Cesa-Bianchi et al., 1994). The features used are the &amp;quot;experts&amp;quot; and the learning algorithm can be viewed as an algorithm that learns how to combine the classifications of the different experts in an optimal way.</Paragraph>
      <Paragraph position="5"> The additive-update algorithm that we evaluate here, the Perceptron, goes back to (Rosenblatt, 1958). While this algorithm is also known to learn the target linear function when it exists, the bounds given by the Perceptron convergence theorem (Duda and Hart, 1973) may be exponential in the optimal mistake bound, even for fairly simple functions (Kivinen and Warmuth, 1995b). We refer to (Kivinen and Warmuth, 1995a) for a thorough analysis of multiplicative update algorithms versus additive update algorithms. In particular, it is shown that the number of mistakes the additive and multiplicative update algorithms make, depend differently on the domain characteristics. Informally speaking, it is shown that the multiplicative update algorithms have advantages in high dimensional problems (i.e., when the number of features is large) and when the target weight vector is sparse (i.e., contain many weights that are close to 0). This explains the recent success in using these methods on high dimensional problems (Golding and Roth, 1996) and suggests that multiplicative-update algorithms might do well on IR applications, provided that a good set of features is selected. On the other hand, it is shown that additive-update algorithms have advantages when the examples are sparse in the feature space, another typical characteristics of the IR domain, which motivates us to study experimentally an additive-update algorithm as well.</Paragraph>
    </Section>
    <Section position="2" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
3.2 Evaluating the Basic Versions
</SectionTitle>
      <Paragraph position="0"> We started by evaluating the basic versions of the three algorithms. The features we use throughout the experiments are single words, at the lemma level, for nouns and verbs only, with minimal frequency of 3 occurrences in the corpus. In the basic versions the strength of the feature is taken to indicate only the presence or absence of f in the document, that is, it is either 1 or 0. The training algorithm was run iteratively on the training set, until no mistakes were made on the training collection or until some upper bound (50) on the number of iterations was reached.</Paragraph>
      <Paragraph position="1"> The results for the basic versions are shown in the first column of Table 1.</Paragraph>
      <Paragraph position="2"> 4 Extensions to the Basic algorithms</Paragraph>
    </Section>
    <Section position="3" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
4.1 Length Variation and Negative features
</SectionTitle>
      <Paragraph position="0"> Text documents vary widely in their length and a text classifier needs to tolerate this variation. This issue is a potential problem for a linear classifier which scores a document by summing the weights of all its active features: a long document may have a better chance of exceeding the threshold merely by its length.</Paragraph>
      <Paragraph position="1"> This problem has been identified earlier on and attracted a lot of work in the classical work on IR (Salton and Buckley, 1983), as we have indicated in Section 2.2. The treatment described there addresses at the same time at least two different concerns: length variation of documents and feature repetition. In this section we consider the first of those, and discuss how it applies to the algorithms we investigate. The second concern is discussed in Section 4.3.</Paragraph>
      <Paragraph position="2"> Algorithms that allow the use of negative features, such as BalancedWinnow and Perceptron, tolerate variation in the documents length naturally, and thus have a significant advantage in this respect.</Paragraph>
      <Paragraph position="3"> In these cases, it can be expected that the cumulative contribution of the weights and, in particular, those that are not indicative to the current category, does not count towards exceeding the threshold, but rather averages out to 0. Indeed, as we found out, no special normalization is required when using these algorithms. Their significant advantage over the unnormalized version of PositiveWinnow is readily seen in Table 1.</Paragraph>
      <Paragraph position="4"> In addition, using negative weights gives the text classifier more flexibility in capturing &amp;quot;truly negative&amp;quot; features, where the presence of a feature is indicative for the irrelevance of the document to the category. However, we found that this phenomenon  figure is an average result for two pairs of training and testing sets, each containing 2000 training documents and 1000 test documents.</Paragraph>
      <Paragraph position="5"> only rarely occurs in text categorization and thus the main use of the negative features is to tolerate the length variation of the documents.</Paragraph>
      <Paragraph position="6"> When using PositiveWinnow, which uses only positive weights, we no longer have this advantage and we seek a modification that tolerates the variation in length. As in the standard IR solution, we suggest to modify s(f, d), the strength of the feature f in d, by using a quantity that is normalized with respect to the document size.</Paragraph>
      <Paragraph position="7"> Formally, we replace the strength s(f,d) (which may be determined in several ways according to feature frequency, as explained below) by a normalized strenglh, s(f, d) sn(f, d) = E fEd s(f, d)&amp;quot; In this case (which applies, as discussed above, only for PositiveWinnow), we also change the initial weight vector and initialize all the weights to 0.</Paragraph>
      <Paragraph position="8"> Using normalization gives an effect that is similar to the use of negative weights, but to a lesser degree. The reason is that it is used uniformly; in long documents, the number of indicative features does not increase significantly, but their strength, nevertheless, is reduced proportionally to the total number of features in the document. In the long version of the paper we present a more thorough analysis of this issue.</Paragraph>
      <Paragraph position="9"> The results presented in Table 1 (second column) show the significant improvements achieved in PositiveWinnow performance, when normalization is used. In all the results presented from this point on, positive winnow is normalized.</Paragraph>
    </Section>
    <Section position="4" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
4.2 Using Threshold range
</SectionTitle>
      <Paragraph position="0"> Training a linear text classifier is a search for a weight vector in the feature space. The search is for a linear separator that best separates documents that are relevant to the category from those that are not.</Paragraph>
      <Paragraph position="1"> In general, there is no guarantee that a weight vector of this sort exists, even in the training data, but a good selection of features make this more likely.</Paragraph>
      <Paragraph position="2"> While the basic versions of our algorithms search for linear separators, we have modified those so that our search for a linear classifier is biased to look for &amp;quot;thick&amp;quot; classifiers. To understand this, consider, for the moment, the case in which all the data is perfectly linearly separable. Then there will generally be many linear classifiers that separate the training data we actually see. Among these, it seems plausible that we have a better chance of doing well on the unseen test data if we choose a linear separator that separates the positive and negative training examples as &amp;quot;widely&amp;quot; as possible. The idea of having a wide separation is less clear when there is no perfect separator, but we can still appeal to the basic intuition.</Paragraph>
      <Paragraph position="3"> Using a &amp;quot;thick&amp;quot; separator is even more important when documents are ranked rather than simply classified; that is, when the actual score produced by the classifier is used in the decision process. The reason is that if Fc(d) is the score produced by the classifier Fc when evaluated on the document d then, under some assumptions on the dependencies among the features, the probability that the document d is relevant to the category c is given by Prob(d E c) _ l+e=~;r~7 This function, known as the sigmoid function, &amp;quot;flattens&amp;quot; the decision region in a way that only scores that are far apart from the threshold value indicate that the decision is made with significant probability.</Paragraph>
      <Paragraph position="4"> Formally, among those weight vectors we would like to choose the hyper-plane with the largest &amp;quot;separating parameter&amp;quot;, where the separating parameter r is defined as the largest value for which there exists a classifier FC/ (defined by a weight vector w) such that for all positive examples d, FC/(d) &gt; 0 + r/2 and for all negative d, Fc(d) &lt; 0 - r/2.</Paragraph>
      <Paragraph position="5"> In this implementation we do not try to find the optimal r (as is done in (Cortes and Vapnik, 1995), but rather determine it heuristically. In order to find a &amp;quot;thick&amp;quot; separator, we modify, in all three algorithms, the update rule used during the training phase as follows: Rather than using a single threshold we use two separate thresholds, 0 + and 0-, such that 0 + - 0- = 7-. During training, we say that the algorithm predicts 0 (and makes a mistake, if the example is labeled positive) when the score it assigns an example is below 0-. Similarly, we say that the algorithm predicts 1 when the score exceeds 0 +. All examples with scores in the range \[0-, 0 +\] are considered mistakes. 'Parameters used: 0-=0.9, 0 + = 1.1, 0 = 1).</Paragraph>
      <Paragraph position="6">  The results presented in the third column of Table 1 show the improvements obtained when the threshold range is used. In all the results presented from this point on, all the algorithms use the threshold range modification.</Paragraph>
    </Section>
    <Section position="5" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
4.3 Feature Repetition
</SectionTitle>
      <Paragraph position="0"> Due to the bursty nature of term occurrence in documents, as well as the variation in document length, a feature may occur in a document more than once.</Paragraph>
      <Paragraph position="1"> It is therefore important to consider the frequency of a feature when determining its strength. On one hand, there are cases where a feature is more indicative to the relevance of the document to a category when it appears several times in a document. On the other hand, in any long document, there may be some random feature that is not significantly indicative to the current category although it repeats many times. While the weight of f in the weight vector of the category, w(f, c), may be fairly small, its cumulative contribution might be too large if we increase its strength, s(f, d), in proportion to its frequency in the document.</Paragraph>
      <Paragraph position="2"> As mentioned in Section 2.2, the classical IR literature has addressed this problem using the if and idf factors. We note that the standard treatment in IR suggests a solution to this problem that suits batch algorithms - algorithms that determine the weight of a feature after seeing all the examples. We, on the other hand, seek a solution that can be used in an on-line algorithm. Thus, the frequency of a feature throughout the data set, for example, cannot be taken into account and we take into account only the if term. We have experimented with three alternative ways of adjusting the value of s(f, d) according to the frequency of the feature in the document: (1) Our default is to let the strength indicate only the activity of the feature. That is, s(f, d) = 1, if the feature is present in the document (active feature) and s(f, d) = 0 otherwise. (2) s(f,d) = n(f,d), where n(f, d) is the number of occurrences of f in d; and (3) s(f, d) = ~ d) (as in (Wiener, Pedersen, and Weigend, 1995)). These three alternatives examine the tradeoff between the positive and negative impacts of assigning a strength in proportion to feature frequency. In most of our experiments, on different data sets, the choice of using ~/n(f, d) performed best. The results of the comparative evaluation appear in columns 3, 4, and 5 of Table 1, corresponding to the three alternatives above.</Paragraph>
    </Section>
    <Section position="6" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
4.4 Discarding features
</SectionTitle>
      <Paragraph position="0"> Multiplicative update algorithm are known to tolerate a very large number of features. However, it seems plausible that most categories depend only on fairly small subsets of indicative features and not on all the features that occur in documents that belong to this class. Efficiency reasons, as well as the occasional need to generate comprehensible explanations to the classifications, suggest that discarding irrelevant features is a desirable goal in IR applications. If done correctly, discarding irrelevant features may also improve the accuracy of the classifier, since irrelevant features contribute noise to the classification score.</Paragraph>
      <Paragraph position="1"> An important property of the algorithms investigated here is that they do not require a feature selection pre-processing stage. Instead, they can run in the presence of a large number of features, and allow for discarding features &amp;quot;on the fly&amp;quot;, based on their contribution to an accurate classification. This property is especially important if one is considering enriching the set of features, as is done in (Golding and Roth, 1996; Cohen and Singer, 1996); in these cases it is important to allow the algorithm to decide for itself which of the features contribute to the accuracy of the classification.</Paragraph>
      <Paragraph position="2"> We filter features that are irrelevant for the category based on the weights they were assigned in the first few training rounds.</Paragraph>
      <Paragraph position="3"> The algorithm is given as input a range of weight value which we call the filtering range. First, the training algorithm is run for several iterations, until the number of mistakes on the training data drops below a certain threshold. After this initial training, we filter out all the features whose weight lie in this filtering range. Training then continues as usual.</Paragraph>
      <Paragraph position="4"> There are various ways to determine the filtering range. The obvious one may be to filter out all features whose weight is very close to 0, but there are a few subtle issues involved due to the normalization done in the PositiveWinnow algorithm. In the results presented here we have used, instead, a different filtering range: Our filtering range is centered around the initial value assigned to the weights (as specified earlier for each algorithm), and is bounded above and below by the values obtained after one promotion or demotion step relative to the initial value. Thus, with high likelihood, we discard features which have not contributed to many mistakes - those that were promoted or demoted at most once (possibly, with additional promotions and demotions which canceled each other, though).</Paragraph>
      <Paragraph position="5"> The results of classification with feature filtering appear in the last column of Table 1. We hypothesize that the improved results are due to reduction in the noise introduced by irrelevant features. Further investigation of this issue will be presented in the long version of this paper. Typically, about two thirds of the features were filtered for each category, significantly reducing the output representation size.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="60" end_page="61" type="metho">
    <SectionTitle>
5 Summary of Experimental Results
</SectionTitle>
    <Paragraph position="0"> The study described in Section 3.2 was used to determined the version that performs best, out of those we have experimented with. Eventually, we have selected the version of the BalancedWin- null split - (Lewis, 1992), 14704 documents for training, 6746 for testing, and Apte's split - (Apte, Damerau, and Weiss, 1994), 10645 training, 3672 testing, omitting documents with no topical category. now algorithm, which incorporates the 0-range modification, a square-root of occurrences as the feature strength and the discard features modification (BalancedWinnow + in Table 2).</Paragraph>
    <Paragraph position="1"> We have compared this version with a few other algorithms which have appeared in the literature on the complete Reuters corpus. Table 2 presents break-even points for BalancedWinnow + and the other algorithms, as defined in Section 2.3.</Paragraph>
    <Paragraph position="2"> The results are reported for two splits of the complete Reuters corpus as explained in Section 2.3. The algorithm was run with iterations, threshold range, feature filtering, and frequency-square-root feature strength.</Paragraph>
    <Paragraph position="3"> The first two rows in Table 2 compare the performance of BalancedWinnow + with the two algorithms that most resemble our approach, the Experts algorithm from (Cohen and Singer, 1996) and a neural network approach presented in (Wiener, Pedersen, and Weigend, 1995). (see Section 2.2).</Paragraph>
    <Paragraph position="4"> Rocchio's algorithm is one of the classical algorithms for this tasks, and it still performs very good compared to newly developed techniques (e.g, (Lewis et al., 1996)). We also compared with the Ripper algorithm presented in(Cohen and Singer, 1996) (we present the best results for this task, with negative tests), a simple decision tree learning system and a Bayesian classifier. The last two figure are taken from (Lewis and Ringuette, 1994) where they were evaluated only on Lewis's split. The last comparison is with the learning system used by (Apte, Damerau, and Weiss, 1994), SWAP, which was evaluated only on Apte's split.</Paragraph>
    <Paragraph position="5"> Our results significantly outperform (by at least 24%) all results which appear in that table and use the same set of features (based on single words). Of the results we know of in the literature, only a version of the Experts algorithm of (Cohen and Singer, 1996) which uses a richer feature set - sparse word trigrams - outperforms our result on the Lewis split, with a break-even point of 75.3%, compared with 74.6% for the unigram-based BalancedWinnow + . However, this version achieves only 75.9% on the Apte split (compared with 83.3% of BalancedWinnow+). In the long version of this paper we plan to present the results of our algorithm on a richer feature set as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML