File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1209_metho.xml
Size: 8,493 bytes
Last Modified: 2025-10-06 14:07:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1209"> <Title>The Research of Word Sense Disambiguation Method Based on Co-occurrence Frequency of Hownet*</Title> <Section position="4" start_page="0" end_page="60" type="metho"> <SectionTitle> 2. A Brief Introduction Of Hownet </SectionTitle> <Paragraph position="0"> Hownet is a knowledge base which was released recently on Intemet. In Hownet, the concept which were represented by Chinese or English words were described and the relations between concepts and the attributes of concepts were revealed. In this paper, we use Chinese knowledge base, which is an important p.art of Hownet, as the resource of our disambiguafion.</Paragraph> <Paragraph position="1"> The format of this file is as follow: W_X =word E_X = some examples of this word G X= the pos of this word DEF= the definition of this word &quot;This research project is supported by a grant from Shanxi Natural Science Foundation of China A important concept used in Hownet that we must introduce is sememe. In Hownet, sememes refer to some basic unit of senses.</Paragraph> <Paragraph position="2"> They are used to descnbe all the entries in Hownet and there are more than 1,500 sememe all together.</Paragraph> </Section> <Section position="5" start_page="60" end_page="63" type="metho"> <SectionTitle> 3. Sense Co-occurrence Frequency </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="60" end_page="60" type="sub_section"> <SectionTitle> Database </SectionTitle> <Paragraph position="0"> It is well known that some words tend to co-occur frequently with some words than with others\[6\]. Similarly, some meaning of words tend to co-occur more often with some meaning of words than with others. If we can got the relations of word meanings quantitatively, it would have some help on word sense disambiguafion. In Hownet, all words are defined with limited sememes and the combination of sememes is fixed. If we make statistic on the co-occurrence frequency of sememe so as to reflect the co-occurrence of words, the problem of data sparseness would be solved to a large degree. Based on the above thought, we built a sense co-occurrence frequency database to disambiguate word senses.</Paragraph> </Section> <Section position="2" start_page="60" end_page="60" type="sub_section"> <SectionTitle> 3.1 The Preproeessing Of Hownet </SectionTitle> <Paragraph position="0"> The Hownet we downloaded from Intemet is in the form of plain text. It is not convenient for computer to use and it must been converted into a database. In the database, each lexical entry is converted into a record. The formalization description of the records is as follow: <lexical entry> ::= <NO.><morphology> <part-of-speech><definifion> Where NO. is the corresponding number of this lexical entry in Hownet. And the definition is composed of several sememes (short for SU) which were divided by comma. In addition, we have deleted the Engfish sememees in order to saving space and speeding up the processing.</Paragraph> <Paragraph position="1"> Here are some examples after preprocessing:</Paragraph> </Section> <Section position="3" start_page="60" end_page="61" type="sub_section"> <SectionTitle> 3.2 The Creation Of Sememe Co-occurrence Frequency Database </SectionTitle> <Paragraph position="0"> The sememe co-occurrence frequency database is the basic of sense disambiguafion. Now we will introduce it briefly.</Paragraph> <Paragraph position="1"> The sememe co-occurrence frequency database is a table of two dimension. Each item corresponding to the co-occurrence frequency of a pair of sememes.</Paragraph> <Paragraph position="2"> Before introducing the sememe co-occurrence frequency database, we gave the following definition: Definiton: suppose word W has m sense items in hownet, and the corresponding definition of each sense item is: Yn, Y\]2, .... Y1(,1); Y21, Y22, .... Y2(,a); ...; Ym\],Ym2, .... Y~,~> respectively. We call \[Yu,Y~ .... Yioada sememe set of W(short for SS), and call \[{ ym YI2, .... Yl(,a)},{ Y21, Y22, .... Y2(,a)}, .... \[ Yml.Ym2, .... y.c~m)}}the sememe expansion of W (short for SE).</Paragraph> <Paragraph position="3"> For example, in the above mentioned example, the word &quot;~fl'&quot; has only one sense item. The corresponding sememe set of this sense item is {\]~'\]~i,~.l.l:,~,~} and the sememe expansion of &quot;~1&quot;&quot; is {()~'l~i, ~.1.1:,@, ~ } } . The word &quot;~&quot; has four sense items, and the corresponding sememe set of each item is {)~i~,~,~,~}, {~.~-}, {~ } and { ~3~,,'~ } respectively. The sememe expansion of word &quot;~&quot; is {{)~'l~,~ff~; ~,~}, {~:}, {:~}, {C/,,~,:~}}.</Paragraph> <Paragraph position="4"> When building the sememe co-occurrence frequency database, the corpus is segmented first and each word is tagged with its sememe expansion in Hownet. Then for each unique pair of words co-occurred in a sentence (here a sentence is a string of characters delimited by punctuations.), the co-occurrence data of sememes which belong to the definition of each words respectively were collect, when coUecting co0occurrence data, we adopt a principle that every pair of word which co-occurred in a sentence should have equal contribution to the sememe co-occurrence data regardless of the number of sense items of this word and the length of the definition. Moreover, the contribution of a word should be evenly distributed between all the senses of a word and the contribution of a sense should been evenly distributed between all the sememe in a sense. The algorithm is as follow: It can be concluded from the above algorithm that the SCFD are symmetrical. In order to saving space and speeding up the processing, we only save those cells (SUi,SUj) that satisfying SUi~<SUj.</Paragraph> </Section> <Section position="4" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 3.3 The Sememe Co-occurrence Frequency Database Based Disambiguafion Method 3.3.1 The Sememe Co-occurrence Frequency Based Scoring Method </SectionTitle> <Paragraph position="0"> When disambiguate a polysemous word, we given the following equation as the score of a sense item of the polysemous word and the context containing this polysemous word. The context of the word is the sentence containing this word.</Paragraph> <Paragraph position="2"> the corresponding sememe set of S, C' is the set of sememe expansion of words in C and GlobalSS is the sememe set that containing all of the sememe defined in Hownet.</Paragraph> <Paragraph position="4"> for any sememe set SS and sememe SU'.</Paragraph> <Paragraph position="5"> score( SU , SU') = I ( SU, SU') (6) for any sememe SU and SU'.</Paragraph> <Paragraph position="7"> Where f(SU,SU') is the co-occurrence frequency corresponding to sememe pair (SU, SU' ) in SCFD. And for g(SU) and N, we have the following equation:</Paragraph> <Paragraph position="9"> In equation (7), the mutual-informationlike measure deviated from the stardard mutual-information measure by multiple a extra multiplicative factor N, this is because that the scale of the corpus is not large enough that the mutual-information of some sememes pairs would be negtive if it was not normalized by a extra multiplicative factor N. In equation (9), the sum of f(SU, SU') was divided by 2, this is because for each pair of sememes, ~ f (SU, SU') is increaseby2.</Paragraph> <Paragraph position="10"> VSU,VSU&quot; When disambiguation, we tag the sememe T that satisfying the following equation to polysemous word W.</Paragraph> <Paragraph position="12"/> </Section> <Section position="5" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 3.3.2 The Creation Of Mutual Information Database </SectionTitle> <Paragraph position="0"> We have created a mutual information database according to (7),(8) and(9) Here is some examples: The examples in table 1 have a high mutual information. The sememe pairs in this table have certain semantic relations. While the examples in table 2 have a low mutual information. And the sememe pairs in this table have no patency semantic relations.</Paragraph> <Paragraph position="1"> Table 1 example of sememe pairs which have a high mutual information Sememe 1 Sememe 2 Mutual-Infdegrml Sememe 1 Sememe 2 Mutual-Informa ation \[ tion the tightness of semantic relations.</Paragraph> </Section> </Section> class="xml-element"></Paper>