File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1116_metho.xml
Size: 19,375 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1116"> <Title>A Maximum Entropy Approach to HowNet-Based Chinese Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. An Introduction to HowNet </SectionTitle> <Paragraph position="0"> HowNet is a bilingual general knowledge base that encodes inter-concept semantic relations and the inter-attribute semantic relations. In contrast to WordNet (Miller, 1990), HowNet adopts a constructive approach of meaning representation (Miller, 1993). Basic meaning units called sememes, which cannot be decomposed further, combine to construct concepts in HowNet. So far, there are 65,000 Chinese concepts and 75,000 English equivalents defined with a set of 1503 sememes.</Paragraph> <Paragraph position="1"> NO.=the record number of the lexical entries W_X=concept of the language X E_X=example of W_X G_X=Part-of-speech of the W_X DEF=Definition, which is constructed by sememes and language and each language has three specific items: W_X, E_X and G_X. The current version of HowNet has entries in two languages (Chinese and English) with the possibility of extending it to other languages. Therefore, W_C, E_C and G_C would be entries for the words, the examples and the parts-of-speech respectively in Chinese, whereas W_E, E_E and G_E are the corresponding entries for English.</Paragraph> <Paragraph position="3"> entered in HowNet. As mentioned in Miller (1993), the definition of a common noun typically consists of (i) its immediate superordinate term and (ii) some distinguishing features. HowNet represents this with pointers3 and the order of the sememes in concept definitions. In the example above, the sememe appearing in the first position 'human|a14 ' is called the categorical attribute. It names the hypernym or the superordinate term, which gives a general classification of the concept. The sememes appearing in other positions: 'occupation|a15a17a16 ', 'gather|a18a20a19 ', 'compile|a21a20a22 ', 'news|a23a25a24 ' are additional attributes, which provide more specific, distinguishing features. Two types of pointers are used in this concept. The pointer &quot;#&quot; means &quot;related&quot; and thus '#occupation|a15a26a16 ' shows that there is a relation between the word &quot;journalist&quot; and occupations. The pointer &quot;*&quot; means 'agent', and thus, '*gather|a18a27a19 ' and '*compile|a21a20a22 ' tell us that &quot;journalist&quot; is the agent of 'gather|a18a17a19 ' and 'compile|a21a17a22 '. The sememe '#news|a23a28a24 ' that follows tells us that the function of &quot;journalist&quot; is to compile and gather news.</Paragraph> </Section> <Section position="4" start_page="0" end_page="8" type="metho"> <SectionTitle> 3 The function of pointers is to describe various </SectionTitle> <Paragraph position="0"> inter-concept and inter-attribute relations. Please refer to HowNet's homepage (http://www.keenage.com) or Gan and Wong (2000) for details.</Paragraph> <Section position="1" start_page="0" end_page="8" type="sub_section"> <SectionTitle> 2.1. Classification of content words </SectionTitle> <Paragraph position="0"> Concepts of content words in HowNet are classified into six categories: Entity, Event, Attribute, Quantity, Attribute Value and Quantity value. The sememes in each category are organized hierarchically in an ontology tree. The six categories can be grouped into four main types: (i) Entity, (ii) Event, (iii) Attribute and Quantity, (iv) Attribute Value and Quantity Value. Most nominal concepts, such as &quot;journalist&quot;, belong to the Entity category and some of them belong to the Attribute category. Verbal concepts always belong to the Event category whereas adjectives are Attribute Values.</Paragraph> <Paragraph position="1"> 2.1.1. Convention of meaning representation of content words The first sememe in concept definitions indicates which of the four categories the concept belongs to, and it is therefore called the categorical attribute. For Attribute, Quantity, Attribute Value and Quantity Value, the first sememe clearly names the categories, as illustrated in (iii) and (iv) of Table 1. Table 2 shows an example entry: the category of &quot;a29a31a30 &quot; (brightness) is indicated by the first sememe 'Attribute|a32a34a33 '. The second sememe is a node in the hierarchy of Attribute or Quantity that names the subcategory. For example, 'brightness| a35a37a36 ' is a node under the ontological hierarchy of 'Attribute|a32a38a33 ' 4 , and can be viewed as a subcategory of Attribute.</Paragraph> <Paragraph position="2"> in concept definitions of HowNet Sememes in concept definitions For the categories of Entity and Event, it is not necessary to name the main categories, because this information is conveyed by their subcategories. Table 3 shows two examples. The first sememe of &quot;a23a25a24 &quot; (letter paper) is 'paper|a26a25a27 ', a node in the Entity hierarchy and its function is to indicate the subcategory of Entity. 'a28a30a29 |SetAside', as the first sememe of the concept &quot;a29a32a31 &quot; (deposit money), names the subcategory of Event.</Paragraph> <Paragraph position="3"> The categories of Attribute and Attribute Value share parallel subcategories. As an example, Table</Paragraph> <Paragraph position="5"> It is optional in some cases.</Paragraph> <Paragraph position="6"> identify only the subcategory when dealing with Attributes or Attribute Values. That is why these two categories (along with Quantity and Quantity Value) use the first two sememes for the subcategorization of concepts, whereas Entity and Event can achieve this by using the first sememe only. We call such types of sememes &quot;categorical attributes&quot;.</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.2. Function Words </SectionTitle> <Paragraph position="0"> Unlike WordNet, HowNet has a sense inventory for function words, and thus our WSD system includes both content words and function words.</Paragraph> <Paragraph position="1"> For function words such as prepositions, pronouns and conjunctions, the sememes in the definitions are marked by curly brackets in order to distinguish senses of function words from those of content words. For example, the pronoun &quot;a51 &quot; (he) is defined as {ThirdPerson|a51 ,male|a52 }.</Paragraph> </Section> </Section> <Section position="5" start_page="8" end_page="8" type="metho"> <SectionTitle> 3. Task Description </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.1. Preprocessing of the corpus </SectionTitle> <Paragraph position="0"> The HowNet corpus is written in XML format, and contains the part-of-speech, sense and semantic dependency relation information for each word.</Paragraph> <Paragraph position="1"> There are 30,976 word tokens and 3,178 sentences 9 in the HowNet corpus, which is divided into two sets in the experiment: 2,400 sentences (23,191 word tokens) are reserved for training, and 778 sentences (7,785 word tokens) for testing. Since off-the-shelf software systems usually have a default cut-off value that may not be appropriate for such a small corpus, we create a larger corpus by concatenating 3 copies of the training data. As a result, the final training corpus consists of 7,200 sentences (69,573 words).</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.2. Experiments </SectionTitle> <Paragraph position="0"> The goal of this work is to investigate the possibility of applying standard POS taggers to identify word sense tags. For this work, an off-the-shelf maximum entropy tagger 10 (Ratnaparkhi, 1996) was used. Each word is therefore tagged with a sememe (categorical attribute), which is treated equivalently to a POS tag by the tagger, whose goal it is to generate a sense tag dictionary from the training data. In the following subsections, we will first explain the semantic tags used in the current research, its limitations and suggestion for resolving the problem, and then illustrate how to build the tag dictionary for the MaxEnt sense tagger.</Paragraph> <Paragraph position="1"> 3.2.2. Using categorical attributes as semantic tags As illustrated in section 2, there are about 65,000 concepts in HowNet dictionary, defined by 17216 sense definitions. The number of definitions will still increase in future, but the closed set of 1503 sememes is not likely to expand. Definitions are represented by a sequence of sememes in HowNet. It is possible to use the whole sequences of sememes as semantic tags, but the complexity can be greatly reduced by using the 1503 sememes as semantic tags.</Paragraph> <Paragraph position="2"> As illustrated earlier, in HowNet, the category for a particular word concept is determined by the first sememe (for Entities and Events) or the first two sememes (for Attributes, Quantities or Attribute Values). These sememes are thus referred to as categorical attributes. On observation, it became apparent that just picking the categorical attribute would be enough to differentiate one sense from the other. For example, none of the 27 senses for the polysemous word &quot;a0 &quot; (hit) in Chinese share the same first sememe.</Paragraph> <Paragraph position="3"> Using sememes as semantic tags has an advantage over using a simple sense id. Assigning a sense id such as a0 1, a0 2....a0 27 to each sense of the word &quot;a0 &quot; can distinguish different senses but will not give us any idea of the meanings of the ambiguous words. Sememes convey meanings while helping to differentiate senses. For example, the first sense is 'associate|a1a3a2 ', which indicates an association with friends or partners. The second sense is 'build|a4a6a5 ', which is self-explanatory.</Paragraph> <Paragraph position="4"> 3.2.3. Limitation of the semantic tags There is a limitation to this strategy. It is found that this strategy can discriminate the senses for about 90% of the words in the corpus. The remaining 10% of the words are still ambiguous Table 5 shows the senses for the word &quot;a7 &quot; (one). Since all the senses are Quantities (qValue|a8a6a9a6a10 ) and Attribute Value (aValue|a11a13a12a14a10 ) types, the categorical attribute is defined as the first two sememes. However, there is still ambiguity to be resolved for two of the senses.</Paragraph> <Paragraph position="6"> 3.2.4. Mapping categorical attribute to sense definition In this work, the ambiguity problem is solved by building a mapping table which maps the (word ; categorical attribute) pairs to sense definitions. First a frequency table is built, which accounts for the frequency of occurrence that a (word ; categorical attribute) pair should be mapped to a sense in the training corpus. Table 5 shows the categorical attributes for the word &quot;a7 &quot; (one). The</Paragraph> <Paragraph position="8"> times. In this work, we simply disregard the second sense for this situation, and assume that when the word &quot;a7 &quot; (one) is tagged with the categorical attribute 'qValue|a8a6a9a6a10 ,amount|a37a3a38 ', it corresponds to the 'qValue|a8a14a9a14a10 ,amount|a37 a38 ,cardinal|a40 ' sense in all contexts. There is a one-to-one direct mapping of the categorical attributes to the 3rd, 4th and the 5th senses, so frequency information is not needed for them.</Paragraph> <Paragraph position="9"> 3.2.5. Sense Tag dictionary for MaxEnt Tagger Section 3.2.4 illustrates the mapping of a sense tag to a sense definition, and this section will briefly describe the building of the tag dictionary. There are two sources for the sense tag dictionary. One comes from the training corpus and one from the HowNet dictionary. The MaxEnt tagger automatically creates a tag dictionary from the training corpus. By default, this dictionary only includes words that appear more than four times in the training corpus (total 753 word types). 11 Another source is the HowNet dictionary, which has the information of semantic tags for 51275 word types. The two sources of information are combined in the sense tag dictionary for the maximum entropy tagger.</Paragraph> </Section> <Section position="3" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.3. Testing results </SectionTitle> <Paragraph position="0"> The input of the testing component is the testing corpus, which is already segmented. The output is the most likely senses of words given by the WSD systems.</Paragraph> <Paragraph position="1"> 3.3.1. Baseline system As a baseline system, the most frequent sense (MFS) of a word is chosen as the correct sense. The frequency of word senses is calculated from the occurrences of the word senses in the training corpus, with ties broken randomly. For all instances of unknown words, the baseline system just tags them with the most frequent sense for the rare words (that is, 'human|a0 ,ProperName|a1 ' as shown in Table 7).</Paragraph> <Paragraph position="2"> 3.3.2. Maximum entropy The model first checks if the word in context can be found in HowNet dictionary. In case the word has only one sense in the dictionary, there is no need to perform disambiguation for this word and the system returns this sense as the answer.</Paragraph> <Paragraph position="3"> For words with more than one sense, the maximum entropy model chooses one (categorical attribute) from the closed set of sememes. The categorical attribute is mapped to the 11 Words occurring less than 5 times in the training corpus are treated as rare words. The tagging of rare words are illustrated in section 3.3.</Paragraph> <Paragraph position="4"> corresponding sense according to the mapping table.</Paragraph> <Paragraph position="5"> Table 6 shows the results for both the baseline and the maximum entropy system. It can be seen that the MaxEnt tagger achieves an accuracy of 88.94%, which outperforms that of the baseline system. An upper bound can also be calculated by imagining that we could employ an oracle system that would indicate, for each ambiguous semantic tag (described in Section 3.2.4), the correct sense of the word. In that case, the performance of the maximum entropy tagger would improve to 89.73%.</Paragraph> <Paragraph position="6"> Even though it does not look like the maximum entropy tagger outperforms the baseline system by much, it should be noted that the nature of the corpus makes the task simple for the baseline system. Since the corpus is composed of a collection of news stories, certain senses of polysemous words will tend to appear more often in the corpus --- indeed, it was observed that more than half of the word tokens appearing in the training and testing corpus have only one sense. The average sense per word token is 1.14 and 1.09 in the training and the testing sets, respectively. However, it should be noted that the MaxEnt model performs much better on polysemous words and unknown words, which bodes well for using the MaxEnt model with more diverse corpora.</Paragraph> <Paragraph position="7"> One of the strengths of maximum entropy lies in its ability to use contextual information to disambiguate polysemous words and predict the senses of unknown words. The following shows an unknown word &quot; a0a2a1 &quot; with the context information: The MaxEnt tagger defines a set of feature patterns including the previous word, the next word, the previous tag, the prefix and the suffix of current word. In this example, the features extracted from the context are shown above. Accordingly, the MaxEnt tagger predicts 'qValue|a28a15a29a15a30 ,sequence| a17a15a18 ' as the most likely sense tag for the word &quot;a31</Paragraph> </Section> </Section> <Section position="6" start_page="8" end_page="8" type="metho"> <SectionTitle> 4. Previous Work </SectionTitle> <Paragraph position="0"> To our knowledge, there currently exist three previous studies of word sense disambiguation using HowNet. Yang et al (2000) pioneered this work by using sememe co-occurrence information in sentences from a large corpus to achieve an accuracy of 71%. Yang and Li (2002), collecting sememe co-occurrence information from a large corpus, transferred the information to restricted rules for sense disambiguation. They reported a precision rate of 92% and 82% for lexical disambiguation and structural disambiguation, respectively.</Paragraph> <Paragraph position="1"> Wang (2002) pioneered the work of sense pruning using the hand-coded knowledge base of HowNet.</Paragraph> <Paragraph position="2"> Unlike sense disambiguation, sense pruning seeks to narrow down the possible senses of a word in a text. Using databases of features such as information structure and object-attribute relations which were compiled from HowNet, Wang reported a recall rate of 97.13% and a per sentence complexity reduction rate of 47.63%.</Paragraph> <Paragraph position="3"> The current study and Wang (2002) used the sense tagged HowNet corpus with different approaches.</Paragraph> <Paragraph position="4"> There is one similarity between our work and Wang (2002), though. Wang applied a sense pruning method to reduce the complexity of word senses. The strategy of the current study reduces the complexity of sense tagging by using the categorical attributes (first or the first two sememes) as semantic tags. About 10% of the words are still ambiguous, but the ambiguity can be reduced in future studies which extend to the tagging of the sememes in the third and the thereafter position of concept definitions. It is also interesting to see if the ambiguity can be resolved by integrating a diverse set of various knowledge sources, such as HowNet knowledge bases, sememe cooccurrence database and the tagged corpus.</Paragraph> </Section> <Section position="7" start_page="8" end_page="8" type="metho"> <SectionTitle> 5. Conclusion </SectionTitle> <Paragraph position="0"> This paper has presented the method of maximum entropy to perform word sense disambiguation in Chinese with HowNet senses. The closed set of sememes is treated as semantic tags, similar to parts-of-speech tagging in the model. Our system performs better than the baseline system that chooses the most frequent sense. Our strategy of sememe tagging reduces the complexity of semantic tagging in spite of some limitations.</Paragraph> <Paragraph position="1"> Some possible ways to resolve the limitations are also suggested in the paper. Unlike the work of Yang et al (2000) and Wang (2002) that applied unsupervised methods using sense definitions in HowNet, the paper is the first study to use a supervised learning method with the availability of the HowNet sense tagged corpus. Much research remains to be done on the corpus and the HowNet knowledge base to get further improvement on the WSD task.</Paragraph> </Section> <Section position="8" start_page="8" end_page="8" type="metho"> <SectionTitle> 6. Acknowledgement </SectionTitle> <Paragraph position="0"> Our thanks go to Dr. Grace Ngai for her helpful comments. This work was supported and funded</Paragraph> </Section> class="xml-element"></Paper>