XML Viewer - w98-0706

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0706_metho.xml
Size: 13,360 bytes
Last Modified: 2025-10-06 14:15:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0706">
  <Title>i Text Classification Using WordNet Hypernyms</Title>
  <Section position="3" start_page="46" end_page="46" type="metho">
    <SectionTitle>
3. The Hypernym Density Representation
</SectionTitle>
    <Paragraph position="0"> The algorithm for computing hypernym density requires three passes through the corpus.</Paragraph>
    <Paragraph position="1"> a) During the first pass, the Brill tagger \[Brill 92\] assigns a part of speech tag to each word in the corpus.</Paragraph>
    <Paragraph position="2"> b) During the second pass, all nouns and verbs are looked up in WordNet and a global list of all synonym and hypernym synsets is assembled.</Paragraph>
    <Paragraph position="3"> Infrequently occurring synsets are discarded, and those that remain form the feature set. (A synset is defined as infrequent if its frequency of occurrence over the entire corpus is less than 0.05N, where N is the number of documents in the corpus.) c) During the third pass, the density of each synset (defined as the number of occurrences of a synset in the WordNet output divided by the number of words in the document) is computed for each example resulting in a set of numerical feature vectors.</Paragraph>
    <Paragraph position="4"> The calculations of frequency and density are influenced by the value of a parameter h that controls the height of generalization. This parameter can be used to limit the number of steps upward through the hypernym hierarchy for each word. At height h=O only the synsets that contain the words in the corpus will be counted. At height h&gt;O the same synsets will be counted as well as all the hypernym synsets that appear up to h steps above them in the hypernym hierarchy. A special value of h=max is defined as the level in which all hypernym synsets are counted, no matter how far up in the hierarchy they appear.</Paragraph>
    <Paragraph position="5"> In the new representation, each feature represents a set of either nouns or verbs. At h=max, features corresponding to synsets higher up in the hypernym hierarchy represent supersets of the nouns or verbs represented by the less general features. At lower values of h, the nouns and verbs represented by a feature (synset) will be those that map to synsets up to h steps below it in the hypernym hierarchy. The best value of h for a given text classification task will depend on characteristics of the text such as use of terminology, similarity of topics, and breadth of topics. It will also depend on the characteristics of WordNet itself. In general, if the value for h is too small, the learner will be unable to generalize effectively. If the value for h is too large, the learner will suffer from overgeneralization because of the overlap between the features.</Paragraph>
    <Paragraph position="6"> Note that no attempt is made at word sense disambiguation during the computation of hypernym density. Instead all senses returned by WordNet are judged equally likely to be correct, and all of them are included in the feature set. The use of the density measurement is an attempt to capture some measure of relevancy. The learner is aided by the fact that many different but synonymous or hyponymous words will map to common synsets, thus raising the densities of the &amp;quot;more relevant&amp;quot; synsets. In other words, a relatively low value for a feature indicates that little evidence was found for the meaningfulness of that synset to the document.</Paragraph>
    <Paragraph position="7"> (In the \[Rodrfguez et al. 97\] text classification paper, word sense disambiguation was performed by manual inspection. This approach was feasible in the context</Paragraph>
    <Paragraph position="9"> of that study because of the small number of words involved. In the current study, the words number in the tens of thousands, making manual disambiguation unfeasible. Automatic disambiguation is possible and often obtains good results as in \[Yarowski 95\] or \[Li et al. 95\], but this is left as future work.) Clearly the change of representation process leaves a lot of room for inaccuracies to be introduced to the feature set. Some sources of potential error are: a) the tagger, b) the lack of true word sense disambiguation, c) words missing from WordNet, and d) the shallowness of WordNet's semantic hierarchy in some knowledge domains.</Paragraph>
  </Section>
  <Section position="4" start_page="46" end_page="49" type="metho">
    <SectionTitle>
4. Experiments and results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="46" end_page="48" type="sub_section">
      <SectionTitle>
4.1. Accuracy
</SectionTitle>
      <Paragraph position="0"> The new hypernym density representation differs in three important ways from the bag-of-words: a) all words are discarded except nouns and verbs, b) filtered normalized density vectors replace binary vectors, and c) hypernym synsets replace words. To show convincingly that improvements in accuracy are derived solely from the use of synsets, two normalizing experiments were performed using the following representations: a) bag-of-words using only nouns and verbs, and b) filtered normalized density vectors for nouns and verbs.</Paragraph>
      <Paragraph position="1"> The results of these runs were compared to the bag-of-words approach using 10-fold cross-validation (see table 2) and in no case was any statistically significant difference found, leading to the conclusion that any improvements in accuracy derive mainly from the use of hypernyms.</Paragraph>
      <Paragraph position="2"> For the main experiments, average error rates over 10-fold cross-validation were compared for all six classification tasks using hypernym density representations with values of h ranging from 0 to 9 and h--max. For each classification task the same 10 partitions were used on every run so the results could be tested for significance using a paired-t test. Table 3 shows a comparison of three error rates: bag-ofwords, hypemym density with h=max, and finally hypernym density using the best value for h.</Paragraph>
      <Paragraph position="3"> In the case of the Reuters tasks, no improvements over bag-of-words were expected and none were observed. On the other hand, a dramatic reduction in  lO-fold cross-validation for the six data sets in the study. Statistically significant improvements over bag-of-words are shown in Italics.</Paragraph>
      <Paragraph position="4"> the error rate was seen for SONGI (47% drop in number of errors for h=9) and USENETI (34% drop for h=9). For the SONG2 and USENET2 data sets, the use of hypernyms produced error rates comparable to bag-of-words. The discussion of these results is left to section 4.3.</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
4.2 Comprehensibility
</SectionTitle>
      <Paragraph position="0"> In the machine learning community, increasing weight is given to the idea that classification hypotheses should be comprehensible to the user.</Paragraph>
      <Paragraph position="1"> Rule induction systems like Ripper are known for producing more comprehensible output than, say multi-layer perceptrons. A systematic investigation of the comprehensibility of rules produced using hypernym density versus bag-of-words is beyond the scope of this work. However, we often see evidence of the better comprehensibility of the rules produced from the hypernym density representation. Figure 1 shows a comparison of the rules learned by Ripper on the USENETI data set. The results for both bag-of-words and h=max hypernym density are shown for the same fold of data.</Paragraph>
      <Paragraph position="2"> In the case of hypernyms, Ripper has learned a simple rule saying that if the synset possession has a low density, the document probably belongs in the history  for a document D, (&amp;quot;tax&amp;quot; ~ D &amp; '~istory&amp;quot; C/ D) OR (',\[tax&amp;quot; ~ D &amp; &amp;quot;s&amp;quot; ~ D &amp; &amp;quot;any&amp;quot; C/ D)OR ( tax&amp;quot; ~ D&amp; &amp;quot;is&amp;quot; ~ D&amp; &amp;quot;and&amp;quot; ~ D&amp; &amp;quot;if&amp;quot; ~ D &amp; &amp;quot;roth&amp;quot; ~ D) OR (&amp;quot;century&amp;quot; C/ D) OR (&amp;quot;great&amp;quot; E D) OR (&amp;quot;survey&amp;quot; C/ D) OR Ru/e teared us/rig &amp;quot;war&amp;quot; C/ D) ~ soc.history bag of words  Ripper using hypernym density with h=rravi (top) and bog of words (bottom) on a single fold of the USENETI data. The bottom rule produced twice as many errors on the testing set.</Paragraph>
      <Paragraph position="3"> category. Over the 10 folds of the data, seven folds produced a rule almost identical to the one shown. For the remaining three folds, the possession hypernym appeared along with other synsets in slightly different rules. The hyponyms of possession include words such as ownership, asset, and liability the sorts of words often used during discussions about taxes, but rarely during discussions about history. On the other hand, the rules learned on the bag-of-words data seem less comprehensible: they are more elaborate and less semantically clear. Furthermore, the rules tended to vary widely across the 10 folds, suggesting that they were less robust and more dependent on the specifics of the training data.</Paragraph>
    </Section>
    <Section position="3" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
4.3 Discussion
</SectionTitle>
      <Paragraph position="0"> Hypernym density has been observed to greatly improve classification accuracy in some cases, while in others the improvements are not particularly spectacular. In the case of the Reuters tasks, the lack of improvement is not a particular worry. It is very unlikely that any change of representation could have improved on the accuracy of bag-of-words for these tasks. But the cases of the SONG2 and USENET2 tasks are worth looking at in more detail.</Paragraph>
      <Paragraph position="1"> In the SONG2 task, the main problem seems to be that the classes (political and religion) are more closely semantically related than their class labels suggest. Visual inspection of these songs revealed that many of the political songs contain statements about religion, make references to religious concepts, or frame their messages in religious terminology.</Paragraph>
      <Paragraph position="2">  This was the source of the higher error rate reported in section 2 when these songs were classified by hand. Inspection of Ripper's output revealed that bag-of-words rules make heavy use of religious words such as Jesus, lord, and soul, while the hypernym density rules at h=max mostly contained highly abstract political synsets such as social group and political unit. It is possible that overgeneralization occurred when subtle differences in religious terminology (for instance between gospel hymns and political parodies of religion) were mapped to common synsets in WordNet.</Paragraph>
      <Paragraph position="3"> In the case of USENET2 the problem is two-fold.</Paragraph>
      <Paragraph position="4"> The classes are semantically closely related (microbiology and neuroscience) and the writers tend to use highly technical terms that are not found in WordNet 1.5. Some examples of missing words include neuroscientist, haemocytometer, HIV, kinase, neurobiology, and retrovirus 3. An attempt was made to add the missing words manually into the WordNet hierarchy, but even then the extended semantic hierarchy was not fine-grained enough to allow meaningful generalizations. Because of the shallowness of the hierarchy, overgeneralization quickly becomes a problem as the height of generalization increases. This is why the best error rate for USENET2 using hypernym density was found at h=2.</Paragraph>
      <Paragraph position="5"> Clearly the change of representation to hypernym density works best only with an appropriate value for the parameter h. We have introduced a new parameter into the learning task that must somehow be set by the user. This is certainly not unheard of in the machine learning community. All currently available machine learning systems contain a large number of parameters. The only difference is that h modifies the feature set rather than the learning algorithm itself. Nevertheless, it is worth addressing the question of how this parameter could be set in practice.</Paragraph>
      <Paragraph position="6"> \[Kohavi &amp; John 95\] describe a &amp;quot;wrapper&amp;quot; method for learning algorithms that automatically selects appropriate parameters. In their system, the set of parameters is treated as a vector space that can be searched for an optimal setting. The sets of parameters are evaluated using 10-fold cross-validation on the training data, and a best-first search strategy is employed to search for the parameter set that minimizes the average error rate. This system</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="49" end_page="49" type="metho">
    <SectionTitle>
3 Some of these terms do appear in WordNet 1.6
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> could easily be adapted to include a parameter such as h that modifies the feature set. Indeed \[Kohavi &amp; .Iohn 97\] have already extended their method to the related problem of finding optimal feature subsets for learning.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML