File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1027_metho.xml

Size: 14,340 bytes

Last Modified: 2025-10-06 14:13:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1027">
  <Title>A Probabilistic Model for Text Categorization: Based on a Single Random Variable with Multiple Values</Title>
  <Section position="5" start_page="0" end_page="163" type="metho">
    <SectionTitle>
2 A Brief Survey of Probabilistic Text
Categorization
</SectionTitle>
    <Paragraph position="0"> In this section, we will briefly review three major probabilistic models for text categorization. Originally, these models have been exploited for information retrieval, but the adaptation to text categorization is straightforward.</Paragraph>
    <Paragraph position="1"> In a model of probabilistic text categorization, P(cld ) = &amp;quot;the probability that a document d is categorized into a category c&amp;quot; (1) is calculated. Usually, a set of categories is defined beforehand. For every document di, probability P(cldi ) is calculated and all the documents are ranked in decreasing order according to their probabilities. The larger P(cldi) a document di has, the more probably it will be categorized into category c. This is called the Probabilistic Ranking Principle (PRP) (Robertson, 1977).</Paragraph>
    <Paragraph position="2"> Several strategies can be used to assign categories to a document based on PRP (Lewis, 1992).</Paragraph>
    <Paragraph position="3"> There are several ways to calculate P(c\[d). Three representatives are (Robertson and Sparck Jones, 1976), (Kwok, 1990), and (Fuhr, 1989).</Paragraph>
    <Section position="1" start_page="0" end_page="162" type="sub_section">
      <SectionTitle>
2.1 Probabilistlc Relevance Weighting
</SectionTitle>
      <Paragraph position="0"> probability P(c\]d).</Paragraph>
      <Paragraph position="2"> where ~ means &amp;quot;not c&amp;quot;, that is &amp;quot;a document is not categorized into c.&amp;quot; Since this is a monotonic transformation of P(cld), PRP is still satisfied after transformation. null Using Bayes' theorem, Eq. (2) becomes</Paragraph>
      <Paragraph position="4"> Here, P(c) is the prior probability that a document is categorized into c. This is estimated from given training data, i.e., the number of documents assigned to the category c. P(dl c) is calculated as follows. If we assume that a document consists of a set of terms (usually nouns are used for the first approximation) and each term appears independently in a document, P(dlc ) is decomposed to</Paragraph>
      <Paragraph position="6"> where &amp;quot;c- d&amp;quot; is a set of terms that do not appear in d but appear in the training cases assigned to c.</Paragraph>
      <Paragraph position="7"> &amp;quot;ti&amp;quot; represents the name of a term and &amp;quot;T/ = 1, 0&amp;quot; represents whether or not the corresponding term '2i&amp;quot; appears in a document. Therefore, P(T/= 1, 0\[c) is the probability that a document does or does not contain the term ti, given that the document is categorized into c. This probability is estimated from the training data; the number of documents that are categorized into c and have the term tl. Substituting Eq. (4) into Eq. (3)</Paragraph>
      <Paragraph position="9"> We refer to Robertson and Sparck Jones' formulation as Probabilistic Relevance Weighting (PRW).</Paragraph>
      <Paragraph position="10"> While PRW is the first attempt to formalize well-known relevance weighting (Sparck Jones, 1972; Salton and McGill, 1983) by probability theory, there are several drawbacks in PRW.</Paragraph>
      <Paragraph position="11"> \[Problem 1\] no within-document term frequencies null PRW does not make use of within-document term frequencies. P(T = 1, 01c) in Eq. (5) takes into account only the existence/absence of the term t in a document. In general, frequently appearing terms in a document play an important role in information retrieval (Salton and McGill, 1983). Salton and Yang experimentally verified the importance of within-document term frequencies in their vector model (Salton and Yang, 1973). \[Problem 2\] no term weighting for target documents In the PRW formulation, there is no factor of term weighting for target documents (i.e., P(.Id)). According to Eq. (5), even if a term exists in a target document, only the importance of the term in a category (i.e., P(T = lie)) is considered for overall probability. Term weighting for target documents would also be necessary for sophisticated information retrieval (Fuhr, 1989; Kwok, 1990).</Paragraph>
      <Paragraph position="12"> \[Problem 3\] affected by having insufficient training cases In practical situations, the estimation of P(T = 1, 01c ) is not always straightforward. Let us consider the following case. In the training data, we are given R documents that are assigned to c. Among them, r documents have the term t. In this example, the straightforward estimate of P(T -- llc ) is &amp;quot;r/R.&amp;quot; If &amp;quot;r = 0&amp;quot; (i.e., none of the documents in c has t) and the target document d contains the term t, g(c\[d) becomes -0% which means that d is never categorized into c. Robertson and Sparck Jones mentioned other special cases like the above example (Robertson and Sparck Jones, 1976). A well-known remedy for this problem is to use &amp;quot;(r + 0.5)/(R + 1)&amp;quot; as the estimate of P(T = lie ) (Robertson and Sparck Jones, 1976).</Paragraph>
      <Paragraph position="13"> While various smoothing methods (Church and Gale, 1991; Jelinek, 1990) are also applicable to these situations and would be expected to work better, we used the simple &amp;quot;add one&amp;quot; remedy in the following experiments. null</Paragraph>
    </Section>
    <Section position="2" start_page="162" end_page="163" type="sub_section">
      <SectionTitle>
2.2 Component Theory (CT)
</SectionTitle>
      <Paragraph position="0"> To solve problems 1 and 2 of PRW, Kwok (1990) stresses the assumption that a document consists of terms. This theory is called the Component Theory (CT).</Paragraph>
      <Paragraph position="1"> To introduce within-document term frequencies (i.e., to solve problem 1), CT assumes that a document is completely decomposed into its constituting terms.</Paragraph>
      <Paragraph position="2"> Therefore, rather than counting the number of documents, as in PRW, CT counts the number of terms in a document for probability estimation. This leads to within-document term frequencies. Moreover, to incorporate term weighting for target documents (i.e., to solve problem 2), CT defines g(cld ) as the geometric mean probabilities over components of the target doc-</Paragraph>
      <Paragraph position="4"> For precise derivation, refer to (Kwok, 1990).</Paragraph>
      <Paragraph position="5"> Here, note that P(T = rid ) and P(T = tic ) represent the frequency of a term t within a target document d and that within a category c respectively. Therefore, CT is not subject to problems 1 and 2. However, problem 3 still affects CT. Furthermore, Fuhr (1989) pointed out that transformation, as in Eq. (6), is not monotonic of P(cld ). It follows then, that CT does not satisfy the probabilistic ranking principle (PRP) any more.</Paragraph>
    </Section>
    <Section position="3" start_page="163" end_page="163" type="sub_section">
      <SectionTitle>
2.3 Retrieval with Probabilistic Indexing (RPI)
</SectionTitle>
      <Paragraph position="0"> Fuhr (1989) solves problem 2 by assuming that a document is probabilistically indexed by its term vectors.</Paragraph>
      <Paragraph position="1"> This model is called Retrieval with Probabilistie Indexing (RPI).</Paragraph>
      <Paragraph position="2"> In RPI, a document d has a binary vector -- (T1,..., Tn) where each component corresponds to a term. 7} = 1 means that the document d contains the term ti. X is defined as the set of all possible indexings, where IX I = 2&amp;quot;. Conditioning P(cld ) for each possible indexing gives</Paragraph>
      <Paragraph position="4"> By assuming conditional independence between c and d given a~ 1 , and using Bayes' theorem, Eq. (8) becomes,</Paragraph>
      <Paragraph position="6"> Assuming that each term appears independently in a target document d and in a document assigned to c, Eq. (9) is rewritten as</Paragraph>
      <Paragraph position="8"> Here, all the probabilities are estimated from the training data using the same method described in Section 2.1.</Paragraph>
      <Paragraph position="9"> Since Eq. (10) includes the factor P(T = 1,01d) as well as P(T = 1,01c), RPI takes into account term weighting for target documents. While this in principle solves problem 2, if we use a simple estimation method counting the number of documents which have a term, P(T = 1,0\]d) reduces to 1 or 0 (i.e, binary, not weighted). For example, when a target document d has a term t, P(t = 1\]d) = 1 and when not,</Paragraph>
      <Paragraph position="11"> this binary estimation method, but non binary estimates could be used as in (Fuhr, 1989).</Paragraph>
      <Paragraph position="12"> 1More precisely, P( cld , x) = P( c\[x ) which assumes that if we know x, information for c is independent of that for d. This assumption sounds valid because x is a kind of representation of d.</Paragraph>
      <Paragraph position="13"> As far as other problems are concerned, RPI still problematic. In particular, because of problem 3, P(cld) would become an illegitimate value. In our experiments, as well as in Lewis' experiments (1992), P(cld ) ranges from 0 to more than 101deg.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="163" end_page="164" type="metho">
    <SectionTitle>
3 A Probabilistic Model Based on a
</SectionTitle>
    <Paragraph position="0"> In this section, we propose a new probabilistic model for text categorization, and compare it to the previous three models from several viewpoints. Our model is very simple, but yet solves problems 1, 2, and 3 in PRW.</Paragraph>
    <Paragraph position="1"> Document representation of our model is basically the same as CT, that is a document is a set of its constituting terms. The major difference between our model and others is the way of document characterization through probabilities. While almost all previous models assume that an event space for a document is whether the document is indexed or not by a term 2, our model characterizes a document as random sampling of a term from the term set that represents the document. For example, an event &amp;quot;T = ti&amp;quot; means that a randomly selected term from a document is ti. If we want to emphasis indexing process like other models, it is possible to interpret &amp;quot;T = ti&amp;quot; as a randomly selected element from a document being indexed by the term ti.</Paragraph>
    <Paragraph position="2"> Formally, our model can be seen as modifying Fuhr's derivation of P(cld ) by replacing an index vector with a single random variable whose value is one of possible terms. Conditioning P(cld ) for each possible event</Paragraph>
    <Paragraph position="4"> If we assume conditional independence between c and d, given T = ti, that is P(cid, T = ti) = P(c\]T = tl),</Paragraph>
    <Paragraph position="6"> Using Bayes' theorem, this becomes</Paragraph>
    <Paragraph position="8"> All the probabilities in Eq. (13) can be estimated from given training data based on the following definitions.</Paragraph>
    <Paragraph position="9"> * P(T =tilc) is the probability that a randomly selected term in a document is ti, given that the document is assigned to c. We used ~c as the estimator. NCi is the frequency of the term ti in the category c, and NC is the total frequency of terms in c.</Paragraph>
    <Paragraph position="10"> 2In section 2 explaining previous models, we simplified &amp;quot;a document is indexed by a term&amp;quot; as &amp;quot;a document contains a term&amp;quot; for ease of explanation.</Paragraph>
    <Paragraph position="11">  ,, P(T = tild) is the probability that a randomly selected term in a target document d is ti. We used ~D as the estimator. NDi is the frequency of the term ti in the document d, and ND is the total frequency of terms in d.</Paragraph>
    <Paragraph position="13"> selected term in a randomly selected document is ti. We used ~t as the estimator. Ni is the frequency of the term ti in the given training documents, and N is the total frequency of terms in the training documents.</Paragraph>
    <Paragraph position="14"> * P(c) is the prior probability that a randomly selected document is categorized into c. We used D_~ as the estimator. Dc is the frequency of docu- D meats that is categorized to c in the given training documents, and D is the frequency of documents in the training documents.</Paragraph>
    <Paragraph position="15"> Here, let us recall the three problems of PRW. Since SVMV's primitive probabilities are based on within-document term frequencies, SVMV does not have problem 1. Furthermore, SVMV does not have problem 2 either because Eq. (13) includes a factor P(T = tld), which accomplishes term weighting for a target document d.</Paragraph>
    <Paragraph position="16"> For problem 3, let us reconsider the previous example; R documents in the training data are categorized into a category c, none of the R documents has term ti, but a target document d does. If the straightforward estimate of P(T/= llc ) = 0 or P(T = tilc) = 0 is adopted, the document d would never be categorized into c in the previous models (PRW, CT, and RPI).</Paragraph>
    <Paragraph position="17"> In SVMV, the probability P(c\[d) is much less affected by such estimates. This is because P(cld ) in Eq. (13) takes the sum of each term's weight. In this example, the weight for ti is estimated to be 0 as in the other models, but this little affect the total value of P(c\[d). A similar argument applies to all other problems in (Robertson and Sparck Jones, 1976) that are caused by having insufficient training cases. SVMV is formally proven not to suffer from the serious effects (like never being assigned to a category or always being assigned to a category) by having insufficient training cases. In other words, SVMV can directly use the straightforward estimates. In addition, we experimentally verified that the value of P(dlc ) in SVMV is always a legitimate value (i.e., 0 to 1) unlike in RPI.</Paragraph>
    <Paragraph position="18">  As illustrated in the table, SVMV has better characteristics for text categorization compared to the previous models. In the next section, we will experimentally verify SVMV's superiority.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML