File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1098_intro.xml
Size: 4,913 bytes
Last Modified: 2025-10-06 14:06:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1098"> <Title>Combining a Chinese Thesaurus with a Chinese Dictionary</Title> <Section position="2" start_page="0" end_page="600" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Both ((TongYiOi CiLin)) (Mei. et al, 1983) and ((XianDai HanYu CiDian)) (1978) are important Chinese resources, and have been widely used in various Chinese processing systems (e.g., Zhang et al, 1995). As a thesaurus, ((TongYiCi CiLin)) defines semantic categories for words, however, it doesn't specify which sense of a polysemous word is involved in a semantic category. On the other hand, ((XianDai HanYu CiDian)) is an ordinary dictionary which provides definitions of senses while not giving any information about their semantic classification.</Paragraph> <Paragraph position="1"> A manual effort has been made to build a resource for English, i.e., WordNet, which contains both definition and classification information (Miller et al., 1990), but such resources are not available for many other languages, e.g. Chinese. This paper presents an automatic method to combine the Chinese thesaurus with the Chinese dictionary into such a resource, by tagging the entries in the thesaurus with appropriate senses in the dictionary, meanwhile assigning appropriate semantic codes, which stand for semantic categories in the thesaurus, to the senses in the dictionary.</Paragraph> <Paragraph position="2"> D.Yarowsky has considered a similar problem to link Roget's categories, an English thesaurus, with the senses in COBUILD, an English dictionary (Yarowsky, 1992). He treats the problem as a sense disambiguation one, with the definitions in the dictionary taken as a kind of contexts in which the headwords occur, and deals with it based on a statistical model of Roget's categories trained on large corpus. In our opinion, the method, for a specific word, neglects the difference between its definitions and the ordinary contexts: definitions generally contain its synonyms, hyponyms or hypernyms, etc., while ordinary contexts generally its collocations. So the trained model on ordinary contexts may be not appropriate for the disambiguation problem in definition contexts.</Paragraph> <Paragraph position="3"> A seemingly reasonable method to the problem would be common word strategy, which has been extensively studied by many researchers (e.g., Knight, 1993; Lesk, 1986). The solution would be, for a category, to select those senses whose definitions hold most number of common words among all those for its member words. But the words in a category in the Chinese thesaurus may be not similar in a strict way, although similar to some extend, so their definitions may only contain some similar words at most, rather than share many words. As a result, the common word strategy may be not appropriate for the problem we study here.</Paragraph> <Paragraph position="4"> In this paper, we extend the idea of common word strategy further to a similar word method based on the intuition that definitions for similar senses generally contain similar words, if not the same ones. Now that the words in a category in the thesaurus are similar to some extent, some of their definitions should contain similar words. We see these words as marks of the category, then the correct sense of a word involved in the category could be identified by checking whether its definition contains such marks. So the key of the method is to determine the marks for a category.</Paragraph> <Paragraph position="5"> Since the marks may be different word tokens, it may be difficult to make them out only based on their frequencies. But since they are similar words, they would belong to the same category in the thesaurus, or hold the same semantic code, so we can locate them by checking their semantic codes. In implementation, for any category, we first compute a salience value for each code with respect to it, which in fact provides the information about the marks of the category, then compute distances between the category and the senses of its member words, which reflect whether their definitions contain the marks and how many, finally select those senses as tags by checking whether their distances from the category fall within a threshold.</Paragraph> <Paragraph position="6"> The remainder of this paper is organized as the following: in section 2, we give a formal setting of the problem and present the tagging procedure; in section 3, we explore the issue of threshold estimation for the distances between senses and categories based on an analysis of the distances between the senses and categories of univocal words; in section 4, we report our experiment results and their evaluation; in section 5, we present some discussions about our methodology; finally in section 6, we give some conclusions.</Paragraph> </Section> class="xml-element"></Paper>