File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1006_metho.xml
Size: 20,682 bytes
Last Modified: 2025-10-06 14:14:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1006"> <Title>Document Classification Using a Finite Mixture Model</Title> <Section position="4" start_page="40" end_page="41" type="metho"> <SectionTitle> 3 Finite Mixture Model </SectionTitle> <Paragraph position="0"> We propose a method of document classification based on soft clustering of words. Let cl,--.,cn be categories. We first conduct the soft clustering. Specifically, we (a) define a vocabulary as a set W of words and define as clusters a number of its subsets kl,.--, k,n satisfying u'~=lk j = W; (notice that ki t3 kj = 0 (i ~ j) does not necessarily hold here, i.e., a word can be assigned to several different clusters); and (b) define for each cluster kj</Paragraph> <Paragraph position="2"> where w denotes a random variable representing any word in the vocabulary. We then define for each category ci (i = 1,..., n) a distribution of the clusters P(kj Ici), and define for each category a linear combination of P(w\]kj):</Paragraph> <Paragraph position="4"> as the distribution over its words, which is referred to as afinite mixture model(e.g., (Everitt and Hand, 1981)).</Paragraph> <Paragraph position="5"> We treat the problem of classifying a document as that of conducting the likelihood ratio test over finite mixture models. That is, we view a document as a sequence of words,</Paragraph> <Paragraph position="7"> where wt(t = 1,.-.,N) represents a word. We assume that each word is independently generated according to an unknown probability distribution and determine which of the finite mixture models P(w\[ci)(i = 1,...,n) is more likely to be the probability distribution by observing the sequence of words. Specifically, we calculate the likelihood value for each category with respect to the document by:</Paragraph> <Paragraph position="9"> We then classify it into the category having the largest likelihood value with respect to it. Hereafter, we will refer to this method as FMM.</Paragraph> <Paragraph position="10"> FMM includes WBM and HCM as its special cases. If we consider the specific case (1) in which a word is assigned to a single cluster and P(wlkj) is given by {1.</Paragraph> <Paragraph position="11"> (9) P(wlkj)= O; w~k~, where Ikjl denotes the number of elements belonging to kj, then we will get the same classification result as in HCM. In such a case, the likelihood value for each category ci becomes:</Paragraph> <Paragraph position="13"> where kt is the cluster corresponding to wt. Since the probability P(wt\]kt) does not depend on eate- N gories, we can ignore the second term YIt=l P(wt Ikt) in hypothesis testing, and thus our method essentially becomes equivalent to HCM (c.f. Eq. (3)).</Paragraph> <Paragraph position="14"> Further, in the specific case (2) in which m = n,</Paragraph> <Paragraph position="16"> the likelihood used in hypothesis testing becomes the same as that in Eq.(2), and thus our method becomes equivalent to WBM.</Paragraph> </Section> <Section position="5" start_page="41" end_page="42" type="metho"> <SectionTitle> 4 Estimation and Hypothesis </SectionTitle> <Paragraph position="0"> Testing In this section, we describe how to implement our method.</Paragraph> <Paragraph position="1"> Creating clusters There are any number of ways to create clusters on a given set of words. As in the case of hard clustering, the way that clusters are created is crucial to the reliability of document classification. Here we give one example approach to cluster creation. Ikl Iracket, stroke, shot, balll ks kick, goal, ball We let the number of clusters equal that of categories (i.e., m = n) 4 and relate each cluster ki to one category ci (i = 1,--.,n). We then assign individual words to those clusters in whose related categories they most frequently appear. Letting 7 (0 _< 7 < 1) be a predetermined threshold value, if the following inequality holds:</Paragraph> <Paragraph position="3"> then we assign w to ki, the cluster related to ci, where f(wlci) denotes the frequency of the word w in category ci, and f(w) denotes the total frequency ofw. Using the data in Tab.l, we create two clusters: kt and k2, and relate them to ct and c2, respectively.</Paragraph> <Paragraph position="4"> 4One can certainly assume that m > n.</Paragraph> <Paragraph position="5"> For example, when 7 = 0.4, we assign 'goal' to k2 only, as the relative frequency of 'goal' in c~ is 0.75 and that in cx is only 0.25. We ignore in document classification those words which cannot be assigned to any cluster using this method, because they are not indicative of any specific category. (For example, when 7 >_ 0.5 'ball' will not be assigned into any cluster.) This helps to make classification efficient and accurate. Tab. 6 shows the results of creating clusters.</Paragraph> <Section position="1" start_page="41" end_page="42" type="sub_section"> <SectionTitle> Estimating P(wlk j) </SectionTitle> <Paragraph position="0"> We then consider the frequency of a word in a cluster. If a word is assigned only to one cluster, we view its total frequency as its frequency within that cluster. For example, because 'goal' is assigned only to ks, we use as its frequency within that cluster the total count of its occurrence in all categories. If a word is assigned to several different clusters, we distribute its total frequency among those clusters in proportion to the frequency with which the word appears in each of their respective related categories. For example, because 'ball' is assigned to both kl and k2, we distribute its total frequency among the two clusters in proportion to the frequency with which 'ball' appears in cl and c2, respectively. After that, we obtain the frequencies of words in each cluster as shown in Tab. 7.</Paragraph> <Paragraph position="1"> racket stroke shot goal kick ball</Paragraph> <Paragraph position="3"> We then estimate the probabilities of words in each cluster, obtaining the results in Tab. 8. 5 Let us next consider the estimation of P(kj\[ci). There are two common methods for statistical estimation, the maximum likelihood estimation method</Paragraph> <Paragraph position="5"> where f(kj\]cl) is the frequency of the cluster kj in ci, and f(cl) is the total frequency of clusters in el.</Paragraph> <Paragraph position="6"> and the Bayes estimation method. In their implementation for estimating P(kj Ici), however, both of them suffer from computational intractability. The EM algorithm (Dempster, Laird, and Rubin, 1977) can be used to efficiently approximate the maximum likelihood estimator of P(kj \[c~). We employ here an extended version of the EM algorithm (Helmbold et al., 1995). (We have also devised, on the basis of the Markov chain Monte Carlo (MCMC) technique (e.g. (Tanner and Wong, 1987; Yamanishi, 1996)) 6, an algorithm to efficiently approximate the Bayes estimator of P(kj \[c~).) For the sake of notational simplicity, for a fixed i, let us write P(kjlci) as Oj and P(wlkj) as Pj(w).</Paragraph> <Paragraph position="7"> Then letting 9 = (01,'&quot;,0m), the finite mixture model in Eq. (6) may be written as</Paragraph> <Paragraph position="9"> For a given training sequence wl'&quot;WN, the maximum likelihood estimator of 0 is defined as the value which maximizes the following log likelihood func-</Paragraph> <Paragraph position="11"> The EM algorithm first arbitrarily sets the initial value of 0, which we denote as 0(0), and then successively calculates the values of 6 on the basis of its most recent values. Let s be a predetermined number. At the lth iteration (l -: 1,..-, s), we calculate</Paragraph> <Paragraph position="13"> simply becomes the standard EM algorithm), and 6We have confirmed in our preliminary experiment that MCMC performs slightly better than EM in document classification, but we omit the details here due to space limitations.</Paragraph> <Paragraph position="15"> After s numbers of calculations, the EM algorithm outputs 00) = (0~O,... ,0~ )) as an approximate of 0. It is theoretically guaranteed that the EM algorithm converges to a local minimum of the given likelihood (Dempster, Laird, and Rubin, 1977).</Paragraph> <Paragraph position="16"> For the example in Tab. 1, we obtain the results as shown in Tab. 9.</Paragraph> <Paragraph position="17"> Testing For the example in Tab. 1, we can calculate according to Eq. (8) the likelihood values of the two categories with respect to the document in Fig. 1 (Tab. 10 shows the logarithms of the likelihood values). We then classify the document into category c2, as log 2 L(d\]c2) is larger than log 2 L(dlcl).</Paragraph> </Section> </Section> <Section position="6" start_page="42" end_page="43" type="metho"> <SectionTitle> 5 Advantages of FMM </SectionTitle> <Paragraph position="0"> For a probabilistic approach to document classification, the most important thing is to determine what kind of probability model (distribution) to employ as a representation of a category. It must (1) appropriately represent a category, as well as (2) have a proper preciseness in terms of number of parameters. The goodness and badness of selection of a model directly affects classification results.</Paragraph> <Paragraph position="1"> The finite mixture model we propose is particularly well-suited to the representation of a category. Described in linguistic terms, a cluster corresponds to a topic and the words assigned to it are related to that topic. Though documents generally concentrate on a single topic, they may sometimes refer for a time to others, and while a document is discussing any one topic, it will naturally tend to use words strongly related to that topic. A document in the category of 'tennis' is more likely to discuss the topic of 'tennis,' i.e., to use words strongly related to 'tennis,' but it may sometimes briefly shift to the topic of 'soccer,' i.e., use words strongly related to 'soccer.' A human can follow the sequence of words in such a document, associate them with related topics, and use the distributions of topics to classify the document. Thus the use of the finite mixture model can be considered as a stochastic implementation of this process.</Paragraph> <Paragraph position="2"> The use of FMM is also appropriate from the viewpoint of number of parameters. Tab. 11 shows the numbers of parameters in our method (FMM), HCM, and WBM, where IW\] is the size of a vocabulary, Ikl is the sum of the sizes of word clusters m (i.e.,Ikl -- E~=I Ikil), n is the number of categories, and m is the number of clusters. The number of parameters in FMM is much smaller than that in WBM, which depends on IWl, a very large number in practice (notice that Ikl is always smaller than IWI when we employ the clustering method (with 7 > 0.5) described in Section 4. As a result, FMM requires less data for parameter estimation than WBM and thus can handle the data sparseness problem quite well. Furthermore, it can economize on the space necessary for storing knowledge. On the other hand, the number of parameters in FMM is larger than that in HCM. It is able to represent the differences between categories more precisely than HCM, and thus is able to resolve the two problems, described in Section 2, which plague HCM.</Paragraph> <Paragraph position="3"> Another advantage of our method may be seen in contrast to the use of latent semantic analysis (Deerwester et al., 1990) in document classification and document retrieval. They claim that their method can solve the following problems: synonymy problem how to group synonyms, like 'stroke' and 'shot,' and make each relatively strongly indicative of a category even though some may individually appear in the category only very rarely; polysemy problem how to determine that a word like 'ball' in a document refers to a 'tennis ball' and not a 'soccer ball,' so as to classify the document more accurately; dependence problem how to use dependent words, like 'kick' and 'goal,' to make their combined appearance in a document more indicative of a category.</Paragraph> <Paragraph position="4"> As seen in Tab.6, our method also helps resolve all of these problems.</Paragraph> </Section> <Section position="7" start_page="43" end_page="45" type="metho"> <SectionTitle> 6 Preliminary Experimental Results </SectionTitle> <Paragraph position="0"> In this section, we describe the results of the experiments we have conducted to compare the performance of our method with that of HCM and others.</Paragraph> <Paragraph position="1"> As a first data set, we used a subset of the Reuters newswire data prepared by Lewis, called Reuters21578 Distribution 1.0. 7 We selected nine overlapping categories, i.e. in which a document may berReuters-21578 is available at http://www.research.att.com/lewis.</Paragraph> <Paragraph position="2"> long to several different categories. We adopted the Lewis Split in the corpus to obtain the training data and the test data. Tabs. 12 and 13 give the details. We did not conduct stemming, or use stop words s. We then applied FMM, HCM, WBM , and a method based on cosine-similarity, which we denote as COS 9, to conduct binary classification. In particular, we learn the distribution for each category and that for its complement category from the training data, and then determine whether or not to classify into each category the documents in the test data. When applying FMM, we used our proposed method of creating clusters in Section 4 and set 7 to be 0, 0.4, 0.5, 0.7, because these are representative values. For HCM, we classified words in the same way as in FMM and set 7 to be 0.5, 0.7, 0.9, 0.95.</Paragraph> <Paragraph position="3"> (Notice that in HCM, 7 cannot be set less than 0.5.) Num. of doc. in training data 707 Num. of doc in test data 228 Num. of (type of) words 10902 Avg. num. of words per doe. 310.6 Num. of doc. training data 13625 Num. of doc. in test data 6188 Num. of (type of) words 50301 Avg. num. of words per doc. 181.3 As a second data set, we used the entire Reuters21578 data with the Lewis Split. Tab. 14 gives the details. Again, we did not conduct stemming, or use stop words. We then applied FMM, HCM, WBM , and COS to conduct binary classification. When applying FMM, we used our proposed method of creating clusters and set 7 to be 0, 0.4, 0.5, 0.7. For HCM, we classified words in the same way as in FMM and set 7 to be 0.5, 0.7, 0.9, 0.95. We have not fully completed these experiments, however, and here we only 8'Stop words' refers to a predetermined list of words containing those which are considered not useful for document classification, such as articles and prepositions. 9In this method, categories and documents to be classified are viewed as vectors of word frequencies, and the cosine value between the two vectors reflects similarity (Salton and McGill, 1983).</Paragraph> <Paragraph position="4"> earn,acq,crude,money-fx,gr ain interest,trade,ship,wheat,corn \] give the results of classifying into the ten categories having the greatest numbers of documents in the test data (see Tab. 15).</Paragraph> <Paragraph position="5"> For both data sets, we evaluated each method in terms of precision and recall by means of the so-called micro-averaging 10 When applying WBM, HCM, and FMM, rather than use the standard likelihood ratio testing, we used the following heuristics. For simplicity, suppose that there are only two categories cl and c2. Letting C/ be a given number larger than or equal 0, we assign a new document d in the following way: ~ (logL(dlcl) -logL(dlc2)) > e; d --* cl, (logL(dlc2) logL(dlct)) > ~; d---+ cu, otherwise; unclassify d, (is) where N is the size of document d. (One can easily extend the method to cases with a greater ~umber of categories.) 11 For COS, we conducted classification in a similar way.</Paragraph> <Paragraph position="6"> Fig s. 2 and 3 show precision-recall curves for the first data set and those for the second data set, respectively. In these graphs, values given after FMM and HCM represent 3' in our clustering method (e.g. FMM0.5, HCM0.5,etc). We adopted the break-even point as a single measure for comparison, which is the one at which precision equals recall; a higher score for the break-even point indicates better performance. Tab. 16 shows the break-even point for each method for the first data set and Tab. 17 shows that for the second data set. For the first data set, FMM0 attains the highest score at break-even point; for the second data set, FMM0.5 attains the highest. We considered the following questions: (1) The training data used in the experimentation may be considered sparse. Will a wordclustering-based method (FMM) outperform a word-based method (WBM) here? (2) Is it better to conduct soft clustering (FMM) than to do hard clustering (HCM)? (3) With our current method of creating clusters, as the threshold 7 approaches 0, FMM behaves much like WBM and it does not enjoy the effects of clustering at all (the number of parameters is as large ldegIn micro-averaging(Lewis and Ringuette, 1994), precision is defined as the percentage of classified documents in all categories which are correctly classified. Recall is defined as the percentage of the total documents in all categories which are correctly classified.</Paragraph> <Paragraph position="7"> nNotice that words which are discarded in the dustering process should not to be counted in document size. GI, &quot;&quot; &quot;HCMO.g&quot; ~... &quot;&quot;'~- .... &quot;HCMO.g5&quot; &quot;~-- &quot;'&quot;'l~ ~3~ &quot;FMMO&quot; -e. -.</Paragraph> <Paragraph position="8"> &quot;. &quot;~. ~ ..-Q degFMM0.4&quot; -+-* .... ,.:&quot; .. &quot;,,,. &quot;FMM0.5&quot; -Q-% &quot; -,~ &quot;FMM0.7 ~ &quot;-... &quot;... ~, ~ deg0, 012 0:~ 01, 0:s 0:0 0:, 0:8 01, recall as in WBM). This is because in this case (a) a word will be assigned into all of the clusters, (b) the distribution of words in each cluster will approach that in the corresponding category in WBM, and (c) the likelihood value for each category will approach that in WBM (recall case (2) in Section 3). Since creating clusters in an optimal way is difficult, when clustering does not improve performance we can at least make FMM perform as well as WBM by choosing 7 = 0. The question now is &quot;does FMM perform better than WBM when 7 is 0?&quot; In looking into these issues, we found the following: null (1) When 3' >> 0, i.e., when we conduct clustering, FMM does not perform better than WBM for the first data set, but it performs better than WBM for the second data set.</Paragraph> <Paragraph position="9"> Evaluating classification results on the basis of each individual category, we have found that for three of the nine categories in the first data set, FMM0.5 performs best, and that in two of the ten categories in the second data set FMM0.5 performs best. These results indicate that clustering sometimes does improve classification results when we use our current way of creating clusters. (Fig. 4 shows the best result for each method for the category 'corn' in the first data set and Fig. 5 that for 'grain' in the second data set.) (2) When 3' >> 0, i.e., when we conduct clustering, the best of FMM almost always outperforms that of HCM.</Paragraph> <Paragraph position="10"> (3) When 7 = 0, FMM performs better than WBM for the first data set, and that it performs as well as WBM for the second data set.</Paragraph> <Paragraph position="11"> In summary, FMM always outperforms HCM; in some cases it performs better than WBM; and in general it performs at least as well as WBM.</Paragraph> <Paragraph position="12"> For both data sets, the best FMM results are superior to those of COS throughout. This indicates that the probabilistic approach is more suitable than the cosine approach for document classification based on word distributions.</Paragraph> <Paragraph position="13"> Although we have not completed our experiments on the entire Reuters data set, we found that the results with FMM on the second data set are almost as good as those obtained by the other approaches reported in (Lewis and Ringuette, 1994). (The results are not directly comparable, because (a) the results in (Lewis and Ringuette, 1994) were obtained from an older version of the Reuters data; and (b) they used stop words, but we did not.) We have also conducted experiments on the Susanne corpus data t2 and confirmed the effectiveness of our method. We omit an explanation of this work here due to space limitations.</Paragraph> </Section> class="xml-element"></Paper>