File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1098_metho.xml
Size: 7,093 bytes
Last Modified: 2025-10-06 14:14:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1098"> <Title>Combining a Chinese Thesaurus with a Chinese Dictionary</Title> <Section position="3" start_page="600" end_page="601" type="metho"> <SectionTitle> 2. Problem Setting </SectionTitle> <Paragraph position="0"> The Chinese dictionary provides sense distinctions for 44,389 Chinese words, on the other hand, the Chinese thesaurus divides 64,500 word entries into 12 major, 94 medium and 1428 minor categories, which is in fact a kind of semantic classification of the words t. Intuitively, there should be a kind of correspondence between the senses and the entries. The main task of combining the two resources is to locate such kind of correspondence.</Paragraph> <Paragraph position="1"> Suppose X is a category 2 in the thesaurus, for any word we X, let Sw be the set of its senses in the dictionary, and Sx = U Sw, for any se Sx, let w~X DW, be the set of the definition words in its definition, DW,= UDW ~ , and DW~ UDW w, sC/S w we X for any word w, let CODE(w) be the set of its semantic codes that are given in the thesaurus 3,</Paragraph> <Paragraph position="3"> ' The electronic versions of the two resources we use now only contain part of the words in them, see section 4.</Paragraph> <Paragraph position="4"> We generally use &quot;category&quot; to refer to minor categories in the following text, if no confusion is involved. Furthermore, we also use a semantic code to refer to a category.</Paragraph> <Paragraph position="5"> , A category is given a semantic code, a word may belong to several categories, and hold several codes.</Paragraph> <Paragraph position="6"> define its definition salience with respect to X in 1).</Paragraph> <Paragraph position="7"> I{wIw ~ X, c e CODEw }\[ I) Sail(c, X)= \[Xl For example, 2) lists a category Ea02 in the thesaurus, whose members are the synonyms or antonyms of word i~j~(/gaoda/; high and big) 4. 2) ~ ~,J, ~ ~ ~:~ i~: ~ I~ ~ i~ ~i~)t, ~ IE~ ~ ~ ~...</Paragraph> <Paragraph position="8"> 3) lists some semantic codes and their definition salience with respect to the category.</Paragraph> <Paragraph position="10"> To define a distance between a category X and a sense s, we first define a distance between any two categories according to the distribution of their member words in a corpus, which consists of 80 million Chinese characters.</Paragraph> <Paragraph position="11"> For any category X, suppose its members are w~, w2 ..... w,, for any w, we first compute its mutual information with each semantic code according to their co-occurrence in a corpus s, then select 10 top semantic codes as its environmental codes', which hold the biggest mutual information with wi. Let NC~ be the set of w/s environmental codes, Cr be the set of all the semantic codes given in the thesaurus, for any ce Cr, we define its context salience with respect to X in 4).</Paragraph> </Section> <Section position="4" start_page="601" end_page="601" type="metho"> <SectionTitle> 4) Sal,(c, X)'-- </SectionTitle> <Paragraph position="0"> ' &quot;/gaoda/&quot; is the Pinyin of the word, and &quot;high and big '' is its English translation.</Paragraph> <Paragraph position="1"> 5 We see each occurrence of a word in the corpus as one occurrence of its codes. Each co-occurrence of a word and a code falls within a 5-word distance.</Paragraph> <Paragraph position="2"> 6 The intuition behind the parameter selection (10) is that the words which can combined with a specific word to form collocations fall in at most 10 categories in the thesaurus. We build a context vector for X in 5), where k=lCTI.</Paragraph> <Paragraph position="3"> 5) CVx=<Salz(ct, X), Salz(cz, X) ..... Sal2(c,, X)> Given two categories X and Y, suppose CVx and cvr are their context vectors respectively, we define their distance dis(X, Y) as 6) based on the cosine of the two vectors.</Paragraph> <Paragraph position="4"> 6) dis(X, Y)=l-cos(cvx, cvr) Let c~ CODEx, we define a distance between c and a sense s in 7).</Paragraph> <Paragraph position="5"> 7) dis(c, s)= Min dis(c, c') c'~ CODE~ Now we define a distance between a category X and a sense s in 8).</Paragraph> <Paragraph position="7"> Intuitively, if CODEs contains the salient codes with respect to X, i.e., those with higher salience with respect to X, dis(X, s) will be smaller due to the fact that the contribution of a semantic code to the distance increases with its salience, so s tends to be a correct sense tag of some word.</Paragraph> <Paragraph position="8"> For any category X, let w~X and seSw, if dis(X, s)<T, where T is some threshold, we will tag w by s, and assign the semantic code X to s.</Paragraph> </Section> <Section position="5" start_page="601" end_page="602" type="metho"> <SectionTitle> 3. Parameter Estimation </SectionTitle> <Paragraph position="0"> Now we consider the problem of estimating an appropriate threshold for dis(X, s) to distinguish between the senses of the words in X. To do so, we first extract the words which hold only one code in the thesaurus, and have only one sense in the dictionary T, then check the distances between these senses and categories. The number of such words is 22,028.</Paragraph> <Paragraph position="1"> , This means that the words are regarded as univocal ones by both resources.</Paragraph> <Paragraph position="2"> Tab.1 lists the distribution of the words with respect to the distance in 5 intervals.</Paragraph> <Paragraph position="3"> Tab. I. The distribution of univocal words with respect to dis(X, s) From Tab.l, we can see that for most univocal words, the distance between their senses and categories lies in \[0, 0.4\].</Paragraph> <Paragraph position="4"> Let Wv be the set of the univocal words we consider here, for any univocal word we Wv, let sw be its unique sense, and Xw be its univocal category, we call DEN<a. a> point density in interval \[tj, t2\] as 9), where O<tj<t2<l.</Paragraph> <Paragraph position="6"> We define 10) as an object function, and take t&quot; which maximizes DEN, as the threshold.</Paragraph> <Paragraph position="8"> The object function is built on the following inference. About the explanation of the words which are regarded as univocal by both Chinese resources, the two resources tend to be in accordance with each other. It means that for most univocal words, their senses should be the correct tags of their entries, or the distance between their categories and senses should be smaller, falling within the under-specified threshold. So it is reasonable to suppose that the intervals within the threshold hold a higher point density, furthermore that the difference between the point density in \[0, t*\], and that in It', 1 \] gets the biggest value.</Paragraph> <Paragraph position="9"> With t falling in its value set {dis(X, s)}, we get t deg as 0.384, when for 18,653 (84.68%) univocal words, their unique entries are tagged with their unique senses, and for the other univocal words, their entries not tagged with their senses.</Paragraph> </Section> class="xml-element"></Paper>