File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2172_metho.xml

Size: 15,414 bytes

Last Modified: 2025-10-06 14:13:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2172">
  <Title>DOCUMENT CLASSIFICATION BY MACHINE:Theory and Practice</Title>
  <Section position="4" start_page="0" end_page="1059" type="metho">
    <SectionTitle>
2 The Description of a Classifi-
cation Scheme
</SectionTitle>
    <Paragraph position="0"> Suppose that we must classify a document into one of k types. These types arc known. Here, k is any positive integer at least 2, and a typical value might be anywhere from 2 to 10. I)enote these types T1,7~,..., 7l'k.</Paragraph>
    <Paragraph position="1"> The set of words in tile language is broken into m disjoint subsets W1, W2,..., W,,. Now from a host ofdocumeuts, or a large body of literature, on subject ~/~, tile frequencies Pij of words in W i are determined. So with subject ~ we have associated the vector of frequencies (pil, Pi2, . . * ,Pim), and of cour~ pil+pi2+. . .+Pim = 1.</Paragraph>
    <Paragraph position="2"> Now, given a document on one of the possible k subjects, it is classified as follows. The document has n words in it, nl words from I4/1, n2 words from W~,..., and nm words from Win. Based on this information, a calculation is made to determine from which sub-ject the document is most likely to have come, and is so classified. This calculation is key: there arc many possible calculations on which a classification can be made, but some are better titan others. We will prove that in this situation, there is a best one.</Paragraph>
    <Paragraph position="3"> We elaborate on a specific case which ~ems to hold promise. The idea is that the frequencies (Pil, Pi2, *.., pi,,,) will be ditferent enough from i to i to distinguish between types of documents. From a document of word length n, let nj be the number of words in Wj. Titus the vector of word frequencies for that particular document is (hi/n, n2/n,..., nm/n). The word frequencies gfrom a document of type i should resemble the frequencies (Pil, Plm... ,Pim), and indeed, the classification scheme is to declare the documeut to he of type Ti if its freqnencies &amp;quot;most closely resemble&amp;quot; the frequencies (Pll, Pi2, * *., Pi,,). Intuitively, if two of tile vectors are (Pil, Pi2,...,Pim) very nearly equal, then it will be difficult to distinguish documents of those two types. Thus the success of cla.ssification depends critically on the vectors (pil, pi'~,..., pim) of frequencies. Equivalently, the sets Wj are critical, and must be chosen with great care. The particular situation we have in mind is this. Faeh of the types of documents is  on a rather special topic, calling for a somewhat specialized vocabulary. The Language is broken into k + 1 disjoint sets W1, Wu,..., Wk+l of words. For i &lt; k, the words in W/ are &amp;quot;specific&amp;quot; to subject i, and Wk+l consists of tire remaining words in the language. Now from a host of documents, or a large body of literature, on the subject T/, we determine the frequencies pij of words in W/. But first, the word sets Wi are needed, and it, is also from such bodies of text that they will be determined. Doing this in a manner that is optimal for our models is a difficult problem, but doing it in such a way that our models are very effective seems quite routine.</Paragraph>
    <Paragraph position="4"> So with subject Ti we have associated the vector of frequencies (Pil, Pi2,...,Pim), the vector being of length one more than the number of types of documents. Since the words in Wi are specific to documents of type 7\], these vectors of frequencies should be quite dissimilar and allow a sharp demarkation between document types. This particular scheme has the added advantage that m is small, being k+l, only one more than the number of document types. Further, our scheme will involve only a few hundred words, those that appear in Wl, W2,..., Wk, with the remainder appearing in Wk+l. This makes is possible to calculate the probabilities of correct classification of documents of each particular type. Such calculations are intractable for large m, even on fast machines. There are classification schemes being used with m in the thousands, making an exact mathematical calculation of probabilities of correct classification next to impossible. But with k and m small, say no more than 10, such calculations are possible.</Paragraph>
  </Section>
  <Section position="5" start_page="1059" end_page="1059" type="metho">
    <SectionTitle>
3 The Mathematical Model
</SectionTitle>
    <Paragraph position="0"> A mathematical description of the situation just described is this. We are given k multinomial populations, with the i-th having frequencies (plr, pi~,...,Pi,~). The i-th population may be envisioned to be an infinite set consisting of m types of elements, with the proportion of type j being Pij. We are given a random sample of size n from one of the populations, and are asked to determine from which of the populations it came. If the sample came from population i, then the probability that it has nj elements of type j is given by the formula n m (n!/,,, !,~!... n.~ !)(P;~'PT~&amp;quot;&amp;quot; 'P,. )' This is an elementary probabilistic fact. If a sample to be classified has nj elements of type j, we simply make this calculation for each i, and judge the sample to be from population i if the largest of the results was for the i-th population. Thus, the sample is judged to be from the i-th population if the probability of getting the particular n/'s that were gotten is the largest for that population.</Paragraph>
    <Paragraph position="1"> To determine which of (n!/nl!n2! ..n IX\[,~'~,,&amp;quot; .... p~,~,) * rn &amp;quot;\]kYil ~i2 is the largest, it is only necessary to determine which of nl n~ r'.m the (Plr Pi2 &amp;quot; &amp;quot; &amp;quot;Pin, ) is largest, and that is an easy machine calculation. All numbers are known beforehand except the n i's, which are counted from the sample.</Paragraph>
    <Paragraph position="2"> Before illustrating success rates with some calculations, some comments on our modeling of this docnmeat classification scheme are in order. The i-th multinomial population represents text of type 7~. This text consists of m types of things, namely words from each of the W i. The frequencies (pit, Pi~,... ,pin,) give the proportion of words from the classes W1, W'2,..., Wm in text of type 7~. A random sample of size n represents a document'of word length n. This last representation is arguable: a document of length n is not a random sample of n words from its type of text.</Paragraph>
    <Paragraph position="3"> It is a structured sequence of such words* The validity of the model proposed depends on a document reflecting the properties of a random sample in the frequencies of its words of each type. Intuitively, long documents will do that* Short ones may not. The success of any implementation will hinge on the frequencies (Pit, Pi2,...,Pim). These frequencies must differ enough from document type to document type so that documents ('an be distinguished on the basis of them.</Paragraph>
  </Section>
  <Section position="6" start_page="1059" end_page="1060" type="metho">
    <SectionTitle>
4 Some Calculations
</SectionTitle>
    <Paragraph position="0"> We now illustrate with some calculations for a simple case: there arc two kinds of documents, T1 and 7~, and three kinds of words. We have in mind here that Wj consists of words specific to documents of type Tz, W:2 specific to T2, and that Wa consists of the remaining words in the language. So we have the frequencies (pu, pr2, w3) and (m~, m2, m3). Of course vi.~ = ~-Pll -Pi2. Now we are given a document that we know is either of type 711 or of type 7~, and we nmst discern which type it is on the basis of its word frequencies.</Paragraph>
    <Paragraph position="1"> Suppose it has nj words of type j, j = 1,2,3. We calculate the numbers nl n2 r~3 ti = Pil Pi2 Pi3 for i = 1, 2, and declare the document to be of type 7~ if ti is the larger of the two. Now what is the probability of success? }tere is the calculation. If a document of size n is drawn from a trinomial population with parameters (p11, P12, pla), the probability of getting nl words of type l, n2 words of type 2, and n3 words of type 3 is n ! ~l In ll'~ ! nl ~ na ( ./ r. 2. 3.)(PllP12P13).</Paragraph>
    <Paragraph position="2"> Thus to calculate the probability of classifying suecessfidly a document of type 7'1 ms being of that type, we must add these expressions over all those triples (nl,n2, n3) for which tl is larger than t2. This is a  fairly easy coml)utation, and we have carried it out for a host of different p's and n's. Table I contains results of some of these calculations.</Paragraph>
    <Paragraph position="3"> Table i gives the probability of classifying a document of type T1 as of type 7~, and of classifying a document of type 7~ as of type '/~. These probabilities are labeled Prob(f) and Prob(2), respcctively. Of course, here we get for free the probability that a document of type 7'1 will be classified ms of type 7~, namely 1- Prob(1). Similarly, 1~- l'rob(2) is tile probability that a document of type 7.) will be classified as of type 7\]. The Plj are the frequencies of words from Wj for documents of type '/~, and n is the muuber of words in the document.</Paragraph>
    <Paragraph position="4">  There are several things worth noting in Table 1.</Paragraph>
    <Paragraph position="5"> The frequencies used in tile table were chosen to illustrate the behavior of the scheme, att(l not necessarily to reflect document claqsification reality, l\[owevcr, consider the first set of l?equeneies (.08, .()4, .88) and (.03, .06, .91). This represents a circnmstan(-c where documents of type T1 have eight percent of their words specific to that subjcct, and four percent specific to the other subject. Documents of type 7.) have six percent of their words specific to its subject, and three percent specific to the other sutlject. These percentages seem to be easily attainable. Our scheme correctly classifies a document of length 200 and of type q'l 95.1 percent of the time, and a docmneut of length 400 99.1 percent of the time. The last set of frequencies, (.08, .04, .88) and (.07, .04, .89) arc Mrnost alike, and as the table shows, do not serve to classify documents correctly with high probability. In general, the probabilities of success arc remarkably high, even for relatively small n, and in the experiment reported on in Section 6, it was easy to find word scts with satisfatory frequencies.</Paragraph>
    <Paragraph position="6"> It is a fact that the probability of success can be made as close to 1 as desired by taking n large enough, assuming that (Ptt, Pv~, Pro) is not identical to (P'21, P2~, P23). llowever, since for reasonable frequencies, tile probabilities of success are high for n just a few hundred, this snggests that long documents would not have to he completely tabulated in ordcr to be classified correctly with high probability. One could just use. a random sample of appropriate size from tile document.</Paragraph>
    <Paragraph position="7"> The following table give some success rates for the case where thcre are three kinds of documents and four</Paragraph>
  </Section>
  <Section position="7" start_page="1060" end_page="1061" type="metho">
    <SectionTitle>
5 Theoretical Results
</SectionTitle>
    <Paragraph position="0"> In this section, we prove our optilnality result. But first we must give it a precise mathematical formulation. 'lb say that there is no better claqsification schcme than some given one, wc nmst know not only what &amp;quot;better&amp;quot; means, we must kuow precisely what a classitication schenu~ is. The setup is as in Sec-tion 3. We have k multinomial populations with frequencies (Pil,Plu,...,plm), i = 1,2,...,k. We, are given a random sample of size n from one of the populations and are forced to assert from wMch one it camc. Tbe infi)rmation at our disposal, besides the set of frequencies (pit,pin,...,pim), is, for each j, thc number nj of elements of type j ill the sampie. So the inlormation Lfrom the sample is the tupie (hi, n=,,...,n,,). Our scheme for specifying front which population it came is to say that it came * t\/ I'll tl:~ * t~m front population i if (n!/nl !n~! .. n.~.)tpi~ Piu &amp;quot; 'Pi,,, ) is maximnm over the i's. This then, determines which (n~, nu .... , n.,) re.nits in which cla.~sification.</Paragraph>
    <Paragraph position="1"> ()ur scheme partitions the sample space, that is, the set of all the tuples (nl,n2 .... ,n.,), into k pieces,  the i-th piece being those tuples (nl,n2,...,nm) for which (n!/nl!nz!.&amp;quot;n,,,!)(p:?l~p ..... P~r~) is maximum. For a given sample (or document) size n, this leads to the definition of a scheme as any partition {A1, A~ .... , Ak} of the set of tuples (nl, n~,..., nm) for which ~i ni = n into k pieces. The procedure then is to classify a sample as coming from the i-th population if the tuple (hi, n2,..., am) gotten from the sample is in Ai. It doesn't matter how this partition is arrived at. Our method is via the probabilities rl m qi(nl, nu,..., nm) = (n!/nl!n2!.&amp;quot; nm!)(P~?P~C/ &amp;quot;&amp;quot;Pi,n ). There are many ways we could define optimality.</Paragraph>
    <Paragraph position="2"> A definition that has particular charm is to define a scheme to be optimal if no other scheme has an higher overall probability of correct classification. But in this setup, we have no way of knowing the overall rate of correct classification because we do not know what proportion of samples come from what populations. So we cannot use that definition. An alternate definition that makes sense is to define a scheme to be optimal if no other scheme has, for each population, a higher probability of correct classification of samples from that population. But our scheme is optimal in a much stronger sense. We define a scheme A1,A2,...,Ak to be optimal if for any other scheme B1, B2, * *., B~, ~_~ P(AdTi) &gt;_ ~ P(BiIT~).</Paragraph>
    <Paragraph position="3"> Proofs of the theorems ill this note will be given elsewhere, ..</Paragraph>
    <Paragraph position="4"> Theorem 1 Let T1,T2,...,Tk be multinomial populations with the i- th population having frequencies (Pil ,Pi2 ..... Pim). For a random sample of size n from one of these populations, let nj be the number of elements of type j. Let tl m qi(nl, n2 .... ,nm) = ~:n I/n./1.1n^l&amp;quot;'z. n m &amp;quot;\]~vill~('~n~ vi2&amp;quot;~n .... Pim )&amp;quot; Then space by the partition of the sample !</Paragraph>
    <Paragraph position="6"> qj(nl, n2,..., Urn) for i # j} is an optimal schente for determining from which of the populations a sample of size n came.</Paragraph>
    <Paragraph position="7"> An interesting feature of Table 1 is that for all frequencies Prob(1) + Prob(2) is greater for sample size 100 than for sample size 50. This supports our intuition that larger sample sizes should yield better results. This is indeed a fact.</Paragraph>
    <Paragraph position="8"> Theorem 2 The following inequality holds, with equality only in the trivial case that Pik ~--- Pjk for all i, j, and k, m~x(((, + 1)!/(,,!n~! * * * n.,!)pT:p~C/... P,&amp;quot;2) -&gt; n-t-1 max((n!/(nl !n~! . . . -,, . ,.i, ,'is &amp;quot; &amp;quot; Pi,, 1, n where ~n+l means to sum over those tuples (hi, n~, * * . , nm) whose sum is n+ 1, and ~n means to sum over those tuples (nl, n2, * . . , n~) whose sum is n.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML