File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1041_metho.xml

Size: 20,704 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1041">
  <Title>Feature Selection and Feature Extract ion for Text Categorization</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Data Sets and Tasks
</SectionTitle>
    <Paragraph position="0"> Our first data set was a set of 21,450 Reuters newswire stories from the year 1987 [4]. These stories have been manually indexed using 135 financial topic categories, to support document routing and retrieval. Particular care was taken in assigning categories [I]. All stories dated April 7, 1987 and earlier went into a set of 14,704 training documents, and all stories from April 8, 1987 or later went into a test set of 6,746 documents.</Paragraph>
    <Paragraph position="1"> The second data set consisted of 1,500 documents from the US. Foreign Broadcast Information Service (FBIS) that had previously been used in the MUC-3 evaluation of natural language processing systems [2]. The documents are mostly translations from Spanish to English, and include newspaper stories, transcripts of broadcasts, communiques, and other material.</Paragraph>
    <Paragraph position="2"> The MUC-3 task required extracting simulated database records (&amp;quot;templates&amp;quot;) describing terrorist incidents from these texts. Eight of the template slots had a limited number of possible fillers, so a simplification of the MUC-3 task is to view filling these slots as text categorization. There were 88 combinations of these 8 slots and legal fillers for the slots, and each was treated as a binary category. Other text categorization tasks can be defined for the MUC-3 data (see Riloff and Lehnert in this volume).</Paragraph>
    <Paragraph position="3"> We used for our test set the 200 official MUC-3 test documents, plus the first 100 training documents (DEV-MUC3-0001 through DEV-MUC3-0100). Templates for these 300 documents were encoded by the MUC-3 organizers. We used the other 1,200 MUC-3 training documents (encoded by 16 different MUC-3 sites) as our categorization training documents, Category assignments should be quite consistent on our test set, but less so on our training set.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Categorization Method
</SectionTitle>
    <Paragraph position="0"> The statistical model used in our experiments was proposed by Fuhr [5] for probabilistic text retrieval, but the adaptation to text categorization is straightforward.</Paragraph>
    <Paragraph position="1"> Figure 1 shows the formula used. The model allows the possibility that the values of the binary features for a document is not known with certainty, though that aspect of the model was not used in our experiments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1. Binary Categorization
</SectionTitle>
      <Paragraph position="0"> In order to compare text categorization output with an existing manual categorization we must replace probability estimates with explicit binary category assignments.</Paragraph>
      <Paragraph position="1"> Previous work on statistical text categorization has often ignored this step, or has not dealt with the case where documents can have zero, one, or multiple correct categories. null Given accurate estimates of P(Cj = 11 Dm), decision theory tells us that the optimal strategy, assuming all errors have equal cost, is to set a single threshold p and assign Cj to a document exactly when P(Cj = llD,) &gt;= p [6].</Paragraph>
      <Paragraph position="2"> However, as is common in probabilistic models for text classification tasks, the formula in Figure 1 makes assumptions about the independence of probabilities which do not hold for textual data. The result is that the estimates of P(Cj = llDm) can be quite inaccurate, as well as inconsistent across categories and documents.</Paragraph>
      <Paragraph position="3"> We investigated several strategies for dealing with this problem and settled on proportional assignment [4].</Paragraph>
      <Paragraph position="4"> Each category is assigned to its top scoring documents on the test set in a designated multiple of the percentage of documents it was assigned to on the training corpus.</Paragraph>
      <Paragraph position="5"> Proportional assignment is not very satisfactory from a theoretical standpoint, since the probabilistic model is supposed to already take into account the prior probability of a category. In tests the method was found to perform well as a standard decision tree induction method, however, so it is at least a plausible strategy.</Paragraph>
      <Paragraph position="6"> We are continuing to investigate other approaches.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2. Feature Selection
</SectionTitle>
      <Paragraph position="0"> A primary concern of ours was to examine the effect of feature set size on text categorization effectiveness.</Paragraph>
      <Paragraph position="1"> All potential features were ranked for each category by expected mutual information [7] between assignment of 0 WORDS-DF2: Starts with all words tokenized by parts. Capitalization and syntactic class ignored.</Paragraph>
      <Paragraph position="2"> Stopwords discarded based on syntactic tags. Tokens consisting solely of digits and punctuation removed. Words occurring in fewer than 2 training documents removed. Total terms: 22,791.</Paragraph>
      <Paragraph position="3"> 0 WC-MUTINFO-135: Starts with WORDS-DF2, and discards words occurring in fewer than 5 or more than 1029 (7%) training documents. RNN clustering used 135 metafeatures with value equal to mutual information between presence of the word and presence of a manual indexing category. Result is 1,442 clusters and 8,506 singlets, for a total of 9,948 terms.</Paragraph>
      <Paragraph position="4"> 0 PHRASE-DF2: Starts with all simple noun phrases bracketed by parts. Stopwords removed from phrases based on tags. Single word phrases discarded. Numbers replaced with the token NUM-BER. Phrases occurring in fewer than 2 training documents removed. Total terms: 32,521.</Paragraph>
      <Paragraph position="5"> 0 PC-W-GIVEN-C-44: Starts with PHRASE-DF2.</Paragraph>
      <Paragraph position="6"> Phrases occurring in fewer than 5 training documents removed. RNN clustering uses 44 metafeatures with value equal to our estimate of P(W =</Paragraph>
      <Paragraph position="8"> Reuters data set.</Paragraph>
      <Paragraph position="9"> that feature and assignment of that category. The top k features for each category were chosen as its feature set, and different values of k were investigated.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Indexing Languages
</SectionTitle>
    <Paragraph position="0"> We investigated phrasal and term clustering methods only on the Reuters collection, since the smaller amount of text made the MUC-3 corpus less appropriate for clustering experiments. For the MUC-3 data set a single indexing language consisting of 8,876 binary features was tested, corresponding to all words occurring in 2 or more training documents. The original MUC-3 text was all capitalized. Stop words were not removed.</Paragraph>
    <Paragraph position="1"> For the Reuters data we adopted a conservative approach to syntactic phrase indexing. The phrasal indexing language consisted only of simple noun phrases, i.e. head nouns and their immediate premodifiers. Phrases were formed using parts, a stochastic syntactic class tagger and simple noun phrase bracketing program [8]. Words We estimate P(Cj = llDm) by:</Paragraph>
    <Paragraph position="3"> the goal of the categorization procedure. The index j ranges over categories to be assigned.</Paragraph>
    <Paragraph position="5"> All probabilities were estimated from the training corpus using the &amp;quot;add one&amp;quot; adjustment (the Jeffreys prior).</Paragraph>
    <Paragraph position="6"> Figure 1: Probabilistic model used for text categorization.</Paragraph>
    <Paragraph position="7"> that were tagged as function words were removed from 5. Evaluation phrases, and all items tagged as numbers were replaced The effectiveness measures used were recall (number of with the NUMBER' We wed the parts segcategories correctly assigned divided by the total nummentation to define the set of words indexed on.</Paragraph>
    <Paragraph position="8"> ber of categories that should be assigned) and precision Reciprocal nearest neighbor clustering was used for clus(number ofcategorie~ correctly assigned divided by total tering features. An RNN cluster consists of two items, number of categories assigned).</Paragraph>
    <Paragraph position="9"> each of which is the nearest neighbor of the other according to the similarity metric in use. Therefore, not all items are clustered. If this stringent clustering strategy does not bring together closely related features, it is unlikely that any clustering method using the same metafeatures would do so.</Paragraph>
    <Paragraph position="10"> Clustering features requires defining a set of metafeatures on which the similarity of the features will be judged. We experimented with forming clusters from words under three metafeature definitions, and from phrases under eight metafeature definitions 141. Metafeatures were based on presence or absence of features in documents, or on the strength of association of features with categories of documents. In all cases, similarity between metafeature vectors was measured using the cosine correlation. The sets of clusters formed were examined by the author, and categorization experiments were run with the three sets of word clusters and with the two sets of phrase clusters that appeared best. Figure 2 summarizes the properties of the most effective version of each representation type used in the experiments on the Reuters data.</Paragraph>
    <Paragraph position="11"> For a set of k categories and d documents a total of n = kd categorization decisions are made. We used microaveraging, which considers all kd decisions as a single group, to compute average effectiveness 191. The proportionality parameter in our categorization method was varied to show the possible tradeoffs between recall and precision. As a single summary figure for recall precision curves we took the breakeven point, i.e. the highest value (interpolated) at which recall and precision are equal.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="216" type="metho">
    <SectionTitle>
6. Results
</SectionTitle>
    <Paragraph position="0"> We first looked at effectiveness of proportional assignment with word-based indexing languages. Figure 3 shows results for the best feature set sizes found: 10 features on Reuters and 15 features on MUC-3. A breakeven point of 0.65 on Reuters and 0.48 on MUC-3 is reached. For comparison, the operational AIR/X system uses both rule-based and statistical techniques to achieve a microaveraged breakeven point of approximately 0.65 in indexing a physics database [lo].</Paragraph>
    <Paragraph position="1"> The CONSTRUE rule-based text categorization system achieves a microaveraged breakeven of around 0.90 on  (w/ 10 features) and MUC-3 (w/ 15 features) test sets. a different, and possibly easier, testset drawn from the Reuters data \[1\]. This level of performance, the result of a 9.5 person-year effort, is an admirable target for learning-based systems to shoot for.</Paragraph>
    <Paragraph position="2"> Comparison with published results on MUC-3 are difficult, since we simplified the complex MUC-3 task. However, in earlier experiments using the official MUC-3 test-set and scoring, proportional assignment achieved performance toward but within the low end of official MUC-3 scores achieved by a variety of NLP methods. This is despite being limited in most cases to 50% the score achievable by methods that attempted cross-referencing In\].</Paragraph>
    <Section position="1" start_page="0" end_page="215" type="sub_section">
      <SectionTitle>
6.1. Feature Selection
</SectionTitle>
      <Paragraph position="0"> Figure 4 summarizes our data on feature set size. We show the breakeven point reached for categorization runs with various size sets of words, again on both the Reuters and MUC-3 data sets. The results exhibit the classic peak associated with the &amp;quot;curse of dimensionality.&amp;quot; The surprise is the small number of features found to be optimal. With 14,704 and 1,300 training examples, peaks of 10 and 15 features respectively are smaller than one would expect based on sample size considerations.</Paragraph>
      <Paragraph position="1"> Overfitting, i.e. training a model on accidental as well as systematic relationships between feature values and  sets of words on Reuters and MUC-3 test sets, and on Reuters training set.</Paragraph>
      <Paragraph position="2"> category membership, was one possible villain \[6\]. We checked for overfitting directly by testing the induced classifiers on the training set. The thicker line in Figure 4 shows the effectiveness of the Reuters classifers when tested on the 14,704 stories used to train them. Surprisingly, effectiveness reaches a peak not much higher than that achieved on the unseen test set, and even drops off when a very large feature set is used. Apparently our probabilistic model is sufficiently constrained that, while overfitting occurs, its effects are limited3 Another possible explanation for the decrease in effectiveness with increasing feature set size is that the assumptions of the probabilistic model are increasingly violated. Fuhr's model assumes that the probability of observing a word in a document is independent of the probability of observing any other word in the document, both for documents in general and for documents known to belong to particular categories. The number of opportunities for groups of dependent features to be selected as predictor features for the same category increases as the feature set size grows.</Paragraph>
      <Paragraph position="3"> Finally, since features with a higher value on expected mutual information are selected first, we intuitively expect features with lower ratings, and thus appearing only in the larger feature sets, to simply be worse features.</Paragraph>
      <Paragraph position="4"> This intuition is curiously hard to justify. Any feature has some set of conditional and uncondltionM probabilities and, if the assumptions of the statistical model hold,  test set for WORDS-DF2 words (10 features), WC-MUTINFO-135 word clusters (10 features), PHRASE-DF2 phrases (180 features), and PC-W-GIVEN-C-44 phrase clusters (90 features).</Paragraph>
      <Paragraph position="5"> will be used in an appropriate fashion. It may be that the inevitable errors in estimating probabilities from a sample are more harmful when a feature is less strongly associated with a category.</Paragraph>
    </Section>
    <Section position="2" start_page="215" end_page="216" type="sub_section">
      <SectionTitle>
6.2. Feature Extraction
</SectionTitle>
      <Paragraph position="0"> The best results we obtained for each of the four basic representations on the Reuters test set are shown in Figure 5. Individual terms in a phrasal representation have, on the average, a lower frequency of appearance than terms in a word-based representation. So, not surprisingly, effectiveness of a phrasal representation peaks at a much higher feature set size (around 180 features) than that of a word-based representation (see Figure 6).</Paragraph>
      <Paragraph position="1"> More phrases are needed simply to make any distinctions among documents. Maximum effectiveness of the phrasal representation is also substantially lower than that of the word-based representation. Low frequency and high degree of synonymy outweigh the advantages phrases have in lower ambiguity.</Paragraph>
      <Paragraph position="2"> Disappointingly, as shown in Figure 5, term clustering did not significantly improve the quality of either a word-based or phrasal representation. Figure 7 shows some representative PC-W-GIVEN-C-44 phrase clusters.</Paragraph>
      <Paragraph position="3">  sized feature sets of words and phrases on Reuters test set.</Paragraph>
      <Paragraph position="4"> (The various abbreviations and other oddities in the phrases were present in the original text.) Many of the relationships captured in the clusters appear to be accidental rather than the systematic semantic relationships hoped for.</Paragraph>
      <Paragraph position="5"> Why did phrase clustering fail? In earlier work on the CACM collection \[3\], we identified lack of training data as a primary impediment to high quality cluster formation. The Reuters corpus provided approximately 1.5 million phrase occurrences, a factor of 25 more than CACM. Still, it remains the case that the amount of data was insufficient to measure the distributional properties '8 investors service inc, &lt; amo &gt; NUMBER accounts, slate regulators NUMBER elections, NUMBER engines federal reserve chairman paul volcker, private consumption additional NUMBER dlrs, america &gt; canadian bonds, cme board denmark NUMBER, equivalent price fund government-approved equity investments, fuji bank led its share price, new venture new policy, representative offices same-store sales, santa rosa  of many phrases encountered.</Paragraph>
      <Paragraph position="6"> The definition of metafeatures is a key issue to reconsider. Our original reasoning was that, since phrases have low frequency, we should use metafeatures corresponding to bodies of text large enough that we could expect cooccurrences of phrases within them. The poor quality of the clusters formed suggests that this approach is not effective. The use of such coarse-grained metafeatures simply gives many opportunities for accidental cooccurrences to arise, without providing a sufficient constraint on the relationship between phrases (or words). The fact that clusters captured few high quality semantic relationships, even when an extremely conservative clustering method was used, suggests that using other clustering methods with the same metafeature definitions is not likely to be effective.</Paragraph>
      <Paragraph position="7"> Finally, while phrases are less ambiguous than words, they are not all good content indicators. Even restricting phrase formation to simple noun phrases we see a substantial number of poor content indicators, and the impact of these are compounded when they are clustered with better content indicators.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="216" end_page="216" type="metho">
    <SectionTitle>
7. Future Work
</SectionTitle>
    <Paragraph position="0"> A great deal of research remains in developing text categorization methods. New approaches to setting appropriate category thresholds, estimating probabilities, and selecting features need to be investigated. For practical systems, combinations of knowledge-based and statistical approaches are likely to be the best strategy.</Paragraph>
    <Paragraph position="1"> On the text representation side, we continue to believe that forming groups of syntactic indexing phrases is an effective route to better indexing languages. We believe the key will be supplementing statistical evidence of phrase similarity with evidence from thesauri and other knowledge sources, along with using metafeatures which provide tighter constraints on meaning. Clustering of words and phrases based on syntactic context is a promising approach (see Strzalkowski in this volume).</Paragraph>
    <Paragraph position="2"> Pruning out of low quality phrases is also likely to be important.</Paragraph>
  </Section>
  <Section position="8" start_page="216" end_page="216" type="metho">
    <SectionTitle>
8. Summary
</SectionTitle>
    <Paragraph position="0"> We have shown a statistical classifier trained on manually categorized documents to achieve quite effective performance in assigning multiple, overlapping categories to documents. We have also shown, via studying text categorization effectiveness, a variety of properties of indexing languages that are difficult or impossible to measure directly in text retrieval experiments, such as effects of feature set size and performance of phrasal representations in isolation from word-based representations.</Paragraph>
    <Paragraph position="1"> Like text categorization, text retrieval is a text classification task. The results shown here for text categorization, in particular the ineffectiveness of term clustering with coarse-grained metafeatures, are likely to hold for text retrieval as well, though further experimentation is necessary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML