File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/a94-1027_evalu.xml
Size: 7,031 bytes
Last Modified: 2025-10-06 14:00:14
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1027"> <Title>A Probabilistic Model for Text Categorization: Based on a Single Random Variable with Multiple Values</Title> <Section position="7" start_page="164" end_page="166" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> This section describes experiments conducted to evaluate the performance of our model (SVMV) compared to the other three (PRW, CT, and RPI).</Paragraph> <Section position="1" start_page="164" end_page="164" type="sub_section"> <SectionTitle> 4.1 Data and Preprocessing </SectionTitle> <Paragraph position="0"> A collection of Wall Street Journal (WSJ) full-text news stories (Liberman, 1991) 3 was used in the experiments. We extracted all 12,380 articles from 1989/7/25 to 1989/11/2.</Paragraph> <Paragraph position="1"> The WSJ articles from 1989 are indexed with 78 categories (topics). Articles having no category were excluded. 8,907 articles remained; each having 1.94 categories on the average. The largest category is &quot;TENDER OFFERS, MERGERS, ACQUISITIONS (TNM)&quot; which encompassed 2,475 articles; the smallest one is &quot;RUBBER (RUB)&quot;, assigned to only 2 articles. On the average, one category is assigned to 443 articles.</Paragraph> <Paragraph position="2"> All 8,907 articles were tagged by the Xerox Part-of-Speech Tagger (Cutting et al., 1992) 4. From the tagged articles, we extracted the root words of nouns using the &quot;ispell&quot; program 5. As a result, each article has a set of root words representing it, and each element in the set (i.e. root word of a noun) corresponds to a term. We did not reduce the number of terms by using stop words list or feature selection method, etc. The number of terms amounts to 32,975.</Paragraph> <Paragraph position="3"> Before the experiments, we divided 8,907 articles into two sets; one for training (i.e., for probability estimation), and the other for testing. The division was made according to chronology. All articles that appeared from 1989/7/25 to 1989/9/29 went into a training set of 5,820 documents, and all articles from 1989/10/2 to 1989/11/2 went into a test set of 3,087 documents.</Paragraph> </Section> <Section position="2" start_page="164" end_page="165" type="sub_section"> <SectionTitle> 4.2 Category Assignment Strategies </SectionTitle> <Paragraph position="0"> In the experiments, the probabilities, P(c), P(Ti = llc), P(T =tilc), and so forth, were estimated from the 5,820 training documents, as described in the previous sections. Using these estimates, we calculated the posterior probability (P(cld)) for each document (d) of (c). The four probabilistic models are compared in this calculation.</Paragraph> <Paragraph position="1"> There are several strategies for assigning categories to a document based on the probability P(cld ). The simplest one is the k-per-doc strategy (Field, 1975) that assigns the top k categories to each document. A more sophisticated one is the probability threshold strategy, in which all the categories above a user-defined threshold are assigned to a document.</Paragraph> <Paragraph position="2"> Lewis proposed the proportional assignment strategy based on the probabilistic ranking principle (Lewis, 1992). Each category is assigned to its top scoring documents in proportion to the number of times the category was assigned in the training data. For example, a category assigned to 2% of the training documents would be assigned to the top scoring 0.2% of the test documents if the proportionality constant was 0.1, or to 10% of the test documents if the proportionality constant was 5.0.</Paragraph> </Section> <Section position="3" start_page="165" end_page="166" type="sub_section"> <SectionTitle> 4.3 Results and Discussions </SectionTitle> <Paragraph position="0"> By using a category assignment strategy, several categories are assigned to each test document. The best known measures for evaluating text categorization models are recall and precision, calculated by the following equations (Lewis, 1992): the number of categories that are Recall : correctly assigned to documents the number of categories that should be' assigned to documents the number of categories that are Precision = correctly assigned to documents the number of categories that are&quot; assigned to documents Note that recall and precision have somewhat mutually exclusive characteristics. To raise the recall value, one can simply assign many categories to each document. However, this leads to a degradation in precision; i.e., almost all the assigned categories are false. A breakeven point might be used to summarize the balance between recall and precision, the point at which they are equal. For each strategy, we calculated breakeven points by using the four probabilistic models. Table 2 shows the best breakeven points identified for the three strategies along with the used models.</Paragraph> <Paragraph position="1"> Table 2 Best breakeven points for three category assignment strategies Breakeven Pts.</Paragraph> <Paragraph position="2"> Prop. assignment 0.63 (by SVMV) Prob. thresholding 0.47 (by SVMV) k-per-doe 0.43 (by SVMV) From Table 2, we find that SVMV with proportional assignment gives the best result (0.63). The superiority of proportional assignment over the other strategies has already been reported by Lewis (1992). Our experiment verified Lewis' assumption. In addition, for any of the three strategies, SVMV gives the highest breakeven point among the four probabilistic models. Figure 1 shows the recall/precision trade off for the four probabilistic models with proportional assignment strategy. As a reference, the recall/precision curve of a well-known vector model (Salton and Yang, 1973) (&quot;TF.IDF&quot;)6 is also presented. Table 3 lists the breakeven point for each model. All the breakeven points were obtained when proportionality constant was about 1.0.</Paragraph> <Paragraph position="3"> Fig. 1 From Figure 1 and Table 3, we can see that: * as far as this dataset is concerned, SVMV with proportional assignment strategy gives the best result among the four probabilistic models, * the models that consider within-document term frequencies (SVMV, CT) are better than those that do not (PRW, RPI), Sin the model we used, each element of document vector is the &quot;term frequency&quot; multiplied by the &quot;inverted document frequency.&quot; Similarity between every pair of vectors is measured by cosine. Note that this is the simplest version of TF.IDF model, and there has been many improvements which we did not consider in the experiments.</Paragraph> <Paragraph position="4"> * the models that consider term weighting for target documents (SVMV, CT) are better than those that do not (PRW, (RPI)), and * the models that are less affected by having insufficient training cases (SVMV) are better than those that are (CT, RPI, PRW).</Paragraph> </Section> </Section> class="xml-element"></Paper>