File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0806_metho.xml
Size: 17,752 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0806"> <Title>Integrating a Lexical Database and a Training Collection for Text Categoriza tion</Title> <Section position="4" start_page="39" end_page="40" type="metho"> <SectionTitle> categories 3 Integrating Resources in the Vector </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="39" end_page="39" type="sub_section"> <SectionTitle> SpaceModel </SectionTitle> <Paragraph position="0"> The Vector Space Model (VSM) \[Salton and McGill, 1983\] is a very suitableenvironment for expressing our approaches to TC: it is supported by many experiences in textretrieval \[Lewis, 1992; Salton, 1989\];' it allows the seamless integratlonof multiple knowledge sources for text classification, and it makes it easyto identify the role of every knowledge source involved in the classification operation In the nextsections we present a straightforward adaptation of the VSM for TC, and theway we use the chosen resources for calculating several model elements.</Paragraph> </Section> <Section position="2" start_page="39" end_page="39" type="sub_section"> <SectionTitle> 3.1 Vector SpaceModei for Text Catego- </SectionTitle> <Paragraph position="0"> rization The bulk of the VSM for Information Retrieval (IR) is representing naturallanguage expressions as term weight vectors. Each weight measures theimportance of a term in a natural language expression, which can be adocument or a query. Semantic closeness between documents and queries is computed by the cosine of the anglebetween document and query vectors.</Paragraph> <Paragraph position="1"> Exploiting an obvious analogy between queries and categories,the latters can be represented by term weight vectors Then, a category canbe assigned to a document when the cosine similarity between them exceeds acertaln threshold, or when the category is highly ranked. In a closer look,and given three sets of N terms, M documents and Lcategories, the weight vector for document j is (wdl.l,Wd2j ..... wdNl) and the weight vector for category k is (WC-lk, WC2k-,. ,WCNk). The similarity between document j and category k is obtained with the formula:</Paragraph> <Paragraph position="3"> Term weights for document vectors can be computed making use of wellknown formulae based on term frequency. We use the following one from\[Salton, 1989\]: M wd v = ~ &quot; log2 ~-Where ~/is the frequency of term t in documentj, and dfl is the-number of documents m which term : occurs Now, only weights for category vectors are to be obtained. Next we will show how to do it depending on the resource used.</Paragraph> </Section> <Section position="3" start_page="39" end_page="39" type="sub_section"> <SectionTitle> 3.2 Direct Approach </SectionTitle> <Paragraph position="0"> This approach to TC makes no use of any resource apart to the documents tobe classified It tests the intuition that the name of content-basedcategories is a good predictor for the occurrence of these categories.</Paragraph> <Paragraph position="1"> For instance, the occurrence of the word &quot;barley&quot; in adocument suggests that this one should be classified in the barley z category. All the following examples are taken from the Reuters categoryset and involve words that actually occur in the documents, category.</Paragraph> <Paragraph position="2"> Wehave taken exactly the categories names, although classification in moregeneral categories like strategtcmetal should rather relay on the occurrence of more specificwords like '&quot;gold&quot; or &quot;zinc.&quot; In this approach, the terms used for the representation are justthe categories themselves. The weight of term t m the vector forcategory j is 1 tf i = j and 0 in other cases. Multiword categories imply the use of multiwordterms. For example, the expression &quot;balance of payments&quot; is considered as one term. When categories consist of several synonyms(like zron-steel), all of them are used in the representation. Since the number ofcategories m Reuters is 135, and two of them are composite, these approachproduces 137-component vectors.</Paragraph> </Section> <Section position="4" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 3.3 WordNet based Approach </SectionTitle> <Paragraph position="0"> Lexical databases contain many kinds of information (concepts; synonymy andother lexical relations; hyponymy and other conceptual relations; etc.),For instance, WordNet represents concepts as synonyms sets, or synsets. We haveselected this synonymy information, performing a &quot;categoryexpansion&quot; simdar to query expansion in IR. For any category,the synset it belongs to is selected, and any other term belonging to it is added to therepresentation. This technique increases the amount of evidence used topredict category occurrence.</Paragraph> <Paragraph position="1"> Unfortunately, the disambiguation of categories with respect toWordNet concepts is required. We have performed this task manually, becausethe small number of categories in the test collection made it affordable. We are currently designing algorithms for automating this operation.</Paragraph> <Paragraph position="2"> After locating categories in WordNet, a term set containing allthe category's synonyms has been built.</Paragraph> <Paragraph position="3"> For the 135 categories used in thisstudy, we have produced 368 terms. Although some meaningless terms occur and could bedeleted, we have developed no automatic criteria for this at the moment.</Paragraph> <Paragraph position="4"> Let us take a look to one example. The fuel category hasdriven us to the addition of the terms &quot;combustible&quot; and &quot;combustible material,&quot; since they belong to the same synset in WordNet. In general, the termweight vector for category k Is 1 for every synonym of the category an0 for any other term.</Paragraph> </Section> <Section position="5" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 3.4 Training Collection Approach </SectionTitle> <Paragraph position="0"> The key asumption when using a training collection is that a term often occurring within a category and rarely within others is a good predictorfor that category. A set of predictors is typically computed from term tocategory co-ocurrence statistics, as a training step. The computation depends on the approach and algorithmselected. As Lewis \[1992\] has done before, we have replicated in the VSMearly Bayesian experiments that had reported good results.</Paragraph> <Paragraph position="1"> Terms are selected according to the number of times they occur withincategories. Those terms which co-occur at least with the 1% and at mostwith the 10% of the categories are taken. Among them, those 286 withhighest document frequency are selected. We work the weights out in the same way as in documents vectors: = O~k &quot;log2 ~- WCtk Where t~k is the number of times that term zoccurs within documents assigned to category k, and cfiis the number of categories within term # occurs. For example, aRer selecting and weighting categories, the high-frequency term&quot; export&quot; shows its largest weight for category trade, but it also shows large weights for grain or wheat, andsmall weights for belgtan-franc and wool. A less frequent term typically provides evidence for asmaller number of categories. For example, &quot;private&quot; has a large weight only for acq (acquisition), and medium for earn (earnings) and trade.</Paragraph> </Section> <Section position="6" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 3.5 Integrating WordNet and a Train- </SectionTitle> <Paragraph position="0"> ingColleetion Several ways of integrating WordNet and Reuters have occurred to us. Asensible one is to use concepts instead of terms as representatives.However, and although promising, Voorhees \[1993\] reported no improvements with this idea.On the other side, we have realized that the shortcomings in training canbe corrected using WordNet to provide better forecast of low frequencycategories.</Paragraph> <Paragraph position="1"> In general, we have linked WordNet weight vectors to training weigth vectors. First we have removed those WordNet terms not ocurring in thetraining collection. Then we have normahzed both WordNet vectors andtraining vectors to separately add up across each category. This way we have smoothed training weights (much larger than WordNetones), giving equal influence to each kind of term weight. This techniqueresults in 461 term weights vectors, 185 coming from WordNet, and 286 fromtraining. Weights for terms ocurring in both sets have been summed.Examples of terms coming from training are &quot;import&quot; or&quot;government,&quot; with high weights for highly frequent categories, like acq. Examplesof terms coming from WordNet are &quot;petroleum&quot; or&quot; peanut,&quot; with wezghts only for the correspondingcategories crude and groundnut respectively.</Paragraph> <Paragraph position="2"> We can clearly identify the role of each resource in this TCapproach. WordNet supplies information on the semantic relatedness of termsand categories when training data is no longer available or reliable It directly contributes with part of the terms used in the vector representation. On the other side, the training collection supplies terms for those categories that are better trained The problem of unavailabilityof training data is then overcome through the use of an extern resource.</Paragraph> </Section> </Section> <Section position="5" start_page="40" end_page="41" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> Evaluation of TC and other text classification operations exhibits greatheterogeneity. Several metrics and test collections have been used fordifferent approaches or works. This results in a lack of comparability among the approaches,forcing to replicate experiments from other researchers. Trying to minimize this problem, we havechosen a set of very extended metrics and a frequently used free testcollection for our work. The metrics are recall and precision, and the testcollection is, as introduced before, Reuters-22173. Before stepping into the actual results, we provide acloser look to these elements.</Paragraph> <Section position="1" start_page="40" end_page="41" type="sub_section"> <SectionTitle> 4.1 Evaluation metrics </SectionTitle> <Paragraph position="0"> The VSM promotes recall and precision based evaluation, but there are several ways of calculating or even defining them. Wefocus on recall, being the discussion analogous for precismn. First,definition can be given regarding categories or documents \[Larkey andCroft, 1996\]. Second, computation can be done macro-averaging or micro-averaging \[Lewis, 1992\].</Paragraph> <Paragraph position="1"> _ Recall can be defined as the number of correctly assigned documents to a category over the number of documents to becorrectly assigned to the category.</Paragraph> <Paragraph position="2"> But a document-oriented definition is also possible: the number of correctly assigned categories to adocument over the number of correct categories to be assigned to thedocument. This later definition is more coherent with the task, but theformer allows to identify the most problematic categories.</Paragraph> <Paragraph position="3"> _ Macro-averaging consists of computing recall and precision for every item (document or category) in one of both previous ways, and averaging aRer it.</Paragraph> <Paragraph position="4"> Micro-averaging is adding up all numbers of correctly assigned items, items assigned, and items to be assigned, and calculate only one value of recall and precision. When micro-averaging, no distinction about document or category orientation can be made. Macro-averaging assigns equal weight to every category, while micro-averaging is influenced by most frequent categories.</Paragraph> <Paragraph position="5"> Evaluation depends finally on the category assignment strategy: probabihty thresholding, k-per-doe assignment, etc. Strategies define the way to produce recall/precision tables. For instance, if similarities are normalized to the \[0,1\] interval, eleven levels of prob-</Paragraph> </Section> </Section> <Section position="6" start_page="41" end_page="41" type="metho"> <SectionTitle> ORGS: END-ORGS EXCHANGES: END-EXCHANGES COMPANIES: END-COMPANIES ITALIAN BALANCE OF PAYMENTS IN DEFICIT IN MAY </SectionTitle> <Paragraph position="0"> ROME, June 18 - Italy's overall balance of payments showed a deficit of 3,211 bllllon izre in May compared with a surplus of 2,040 billion in April, provxsional Bank of Italy figures how.</Paragraph> <Paragraph position="1"> The May deflclt compares with a surplus of 1,555 billion lire an the corresponding month of 1986.</Paragraph> <Paragraph position="2"> For the flrst five months of 1987, the overall balance of payments showed a surplus of 299 billlon lire agalnst a deficit of 2,854 billlon in the corresponding 1986 perlod.</Paragraph> </Section> <Section position="7" start_page="41" end_page="42" type="metho"> <SectionTitle> REUTER </SectionTitle> <Paragraph position="0"> ability threshold can be set to0.0, 0.1, and so. When the system performs k-per-doe assignment, the value of k is ranged from 1 to a reasonable maximum.</Paragraph> <Paragraph position="1"> Figure 1 We must assign an unknown number of categories to each document in Reuters. So, the probabdity thresholding approach seems the most sensible one. We have then computed recall and precision for eleven ,levels of threshold, both macro and micro-averaging. When macro-averaging, we have used the category-oriented definition of recall and precision. After that, we have calculated averages of those eleven values in order to get single figures for comparison.</Paragraph> <Section position="1" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 4.2 The Test Collection </SectionTitle> <Paragraph position="0"> The Reuters-22173 collection consists of 22,173 newswire articles from Reuters collected during 1987.</Paragraph> <Paragraph position="1"> Documents in Reuters deal with financial topics, and were classified in several sets of financial categories by personnel from Reuters Ltd. and Carnegie Group Inc. Documents vary in length and number of categories assigned, from 1 line to more than 50, and from none categories to more than 8. There are five sets of categories: TOPICS, ORGANIZATIONS, EXCHANGES, PLACES, and PEOPLE. As others before, we have selected the 135TOPICS for our experiments. An example of news article classified in bop (balance of payments) and trade is shown in Figure 1. Some spurious formatting has been removed from it.</Paragraph> <Paragraph position="2"> eral partitions have been suggested for Reuters \[Lewis, 1992\], among which ones we have opted for the most general and difficult one. First 21,450news stories are used for training, and last 723 are kept for testing. We summarize significant differences between test and training sets in Table 2. These differences can bring noise into categorization, because training relies on similarity between training and test documents. Nevertheless, this 21,450/723 partition has been used before \[Lewis, 1992; Hayes and Weinstein, 1990\] and involves the general case of documents with no categories assigned.</Paragraph> <Paragraph position="3"> We have worked with raw data provided in the Reuters distribution. Control characters, numbers and several separators like&quot;/&quot; have been removed, and categories different from the TOPICS set have been ignored. For disambiguating categories with respect to WordNet senses, we first had to acquire their meaning, not always self-evident This task has been performed by direct examination of training documents.</Paragraph> </Section> <Section position="2" start_page="41" end_page="42" type="sub_section"> <SectionTitle> 4.3 Results and Interpretation </SectionTitle> <Paragraph position="0"> The results of our first series of experiments are summarized in Table 3.This table shows recall and precision averages calculated both macro and micro-averaging for a threshold-based assignment strategy.</Paragraph> <Paragraph position="1"> Values for the integrated approach show some general advantage over WordNet and training approaches, but results are not decisive. Training results are comparable with those from Lewis \[1992\], and the WordNet approach is roughly equivalent to the training one.</Paragraph> <Paragraph position="2"> On one hand, the integrated approach shows a better performance than the WordNet one in general, although a problem of precision is detected when macroaveraging. The influence of low precision training has produced this effect. We are planning to strengthen WordNet influence to overcome this problem. On the other hand, the integrated approach reports better general performance than the training approach.</Paragraph> <Paragraph position="3"> As expected, WordNet and training both beat the direct approach. When comparing WordNet and training approaches, we observe that the former produces better results with categories of low frequency, while the latter performs better in highly frequent categories. However, both exhibit the same overall behaviour.</Paragraph> <Paragraph position="4"> Differences in categories are noticed by the fact that micro-averaging is influenced by highly frequent elements, while macro-averaging depends on the results of many elements of low frequency.</Paragraph> </Section> </Section> class="xml-element"></Paper>