File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1006_intro.xml

Size: 6,856 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1006">
  <Title>Document Classification Using a Finite Mixture Model</Title>
  <Section position="3" start_page="0" end_page="40" type="intro">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> Word-based method A simple approach to document classification is to view this problem as that of conducting hypothesis testing over word-based distributions. In this paper, we refer to this approach as the word-based method (hereafter, referred to as WBM).</Paragraph>
    <Paragraph position="1"> 2We borrow from (Pereira, Tishby, and Lee, 1993) the terms hard clustering and soft clustering, which were used there in a different task.</Paragraph>
    <Paragraph position="2">  Letting W denote a vocabulary (a set of words), and w denote a random variable representing any word in it, for each category ci (i = 1,...,n), we define its word-based distribution P(wIci) as a histogram type of distribution over W. (The number of free parameters of such a distribution is thus I W\[- 1). WBM then views a document as a sequence of words: d = Wl,''&amp;quot; , W N (1) and assumes that each word is generated independently according to a probability distribution of a category. It then calculates the probability of a document with respect to a category as</Paragraph>
    <Paragraph position="4"> and classifies the document into that category for which the calculated probability is the largest. We should note here that a document's probability with respect to each category is equivMent to the likelihood of each category with respect to the document, and to classify the document into the category for which it has the largest probability is equivalent to classifying it into the category having the largest likelihood with respect to it. Hereafter, we will use only the term likelihood and denote it as L(dlci).</Paragraph>
    <Paragraph position="5"> Notice that in practice the parameters in a distribution must be estimated from training data. In the case of WBM, the number of parameters is large; the training data size, however, is usually not sufficiently large for accurately estimating them. This is the data .sparseness problem that so often stands in the way of reliable statistical language processing (e.g.(Gale and Church, 1990)). Moreover, the number of parameters in word-based distributions is too large to be efficiently stored.</Paragraph>
    <Paragraph position="6"> Method based on hard clustering In order to address the above difficulty, Guthrie et.al, have proposed a method based on hard clustering of words (Guthrie, Walker, and Guthrie, 1994) (hereafter we will refer to this method as HCM). Let cl,...,c,~ be categories. HCM first conducts hard clustering of words. Specifically, it (a) defines a vocabulary as a set of words W and defines as clusters its subsets kl,..-,k,n satisfying t3~=xk j = W and ki fq kj = 0 (i j) (i.e., each word is assigned only to a single cluster); and (b) treats uniformly all the words assigned to the same cluster. HCM then defines for each category ci a distribution of the clusters P(kj \[ci) (j = 1,...,m). It replaces each word wt in the document with the cluster kt to which it belongs (t = 1,--., N). It assumes that a cluster kt is distributed according to P(kj\[ci) and calculates the likelihood of each category ci with respect to the document by</Paragraph>
    <Paragraph position="8"> racket stroke shot goal kick ball</Paragraph>
    <Paragraph position="10"/>
    <Paragraph position="12"> There are any number of ways to create clusters in hard clustering, but the method employed is crucial to the accuracy of document classification. Guthrie et. al. have devised a way suitable to documentation classification. Suppose that there are two categories cl ='tennis' and c2='soccer,' and we obtain from the training data (previously classified documents) the frequencies of words in each category, such as those in Tab. 1. Letting L and M be given positive integers, HCM creates three clusters: kl, k2 and k3, in which kl contains those words which are among the L most frequent words in cl, and not among the M most frequent in c2; k2 contains those words which are among the L most frequent words in cs, and not among the M most frequent in Cl; and k3 contains all remaining words (see Tab. 2). HCM then counts the frequencies of clusters in each category (see Tab. 3) and estimates the probabilities of clusters being in each category (see Tab. 4). 3 Suppose that a newly given document, like d in Fig. i, is to be classified. HCM cMculates the likelihood values 3We calculate the probabilities here by using the so-called expected likelihood estimator (Gale and Church, 1990): .f(kjlc, ) + 0.5 , P(k3lc~) = f-~--~-~ x m (4) where f(kjlci ) is the frequency of the cluster kj in ci, f(ci) is the total frequency of clusters in cl, and m is the total number of clusters.</Paragraph>
    <Paragraph position="13">  L(dlCl ) and L(dlc2) according to Eq. (3). (Tab. 5 shows the logarithms of the resulting likelihood values.) It then classifies d into cs, as log s L(dlcs ) is larger than log s L(dlc 1).</Paragraph>
    <Paragraph position="14"> d = kick, goal, goal, ball</Paragraph>
    <Paragraph position="16"> = 1 x log s .29 + 3 x log s .65 = -3.65 HCM can handle the data sparseness problem quite well. By assigning words to clusters, it can drastically reduce the number of parameters to be estimated. It can also save space for storing knowledge. We argue, however, that the use of hard clustering still has the following two problems: 1. HCM cannot assign a word C/0 more than one cluster at a time. Suppose that there is another category c3 = 'skiing' in which the word 'ball' does not appear, i.e., 'ball' will be indicative of both cl and c2, but not cs. If we could assign 'ball' to both kt and k2, the likelihood value for classifying a document containing that word to cl or c2 would become larger, and that for classifying it into c3 would become smaller. HCM, however, cannot do that.</Paragraph>
    <Paragraph position="17"> 2. HCM cannot make the best use of information about the differences among the frequencies of words assigned to an individual cluster. For example, it treats 'racket' and 'shot' uniformly because they are assigned to the same cluster kt (see Tab. 5). 'Racket' may, however, be more indicative of Cl than 'shot,' because it appears more frequently in cl than 'shot.' HCM fails to utilize this information. This problem will become more serious when the values L and M in word clustering are large, which renders the clustering itself relatively meaningless.</Paragraph>
    <Paragraph position="18"> From the perspective of number of parameters, HCM employs models having very few parameters, and thus may not sometimes represent much useful information for classification.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML