File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1069_metho.xml
Size: 17,722 bytes
Last Modified: 2025-10-06 14:14:12
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1069"> <Title>An Automatic Clustering of Articles Using Dictionary Definitions</Title> <Section position="4" start_page="406" end_page="408" type="metho"> <SectionTitle> 3 Framework </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="406" end_page="406" type="sub_section"> <SectionTitle> 3.1 Word-Sense Disambiguation </SectionTitle> <Paragraph position="0"> Every sense of words in artMes which should be (:lustered is automatically disambiguated in advance. Word-sense dismnl)iguation (WSD in short) is a serious problem for NLP, and a wlri('ty of al)l)roaches have been 1)roposed for solving it (Ih'own, 1991), (Yarowsky, 1992).</Paragraph> <Paragraph position="1"> Our disalnbiguation method is based on Niwa's method which used the similarity 1)etween a sentenee containing a t)olysemous noun and a sen= tence of dictionary-definition. Let x be a t)olysemous noun and a sentence X be</Paragraph> <Paragraph position="3"> Here, Mu(x, y) is the v',due of mutual information proposed by (Church, 1991). oj,...,om (We call them basic words) are selected the 1000th most frequent words in the reference Collins English Dictionary (Lil)erman, 1990).</Paragraph> <Paragraph position="4"> Let word x have senses sl,s2,...,sp and the dictionary-definition of si be Ysi: &quot;&quot;,Y-n,'&quot;,Y-I,Y, Yt,&quot; &quot;',Yn,&quot; &quot;&quot; The similarity of X and }~i is measured t)y the imter l)roduct of their normalised vectors and is detined as follows:</Paragraph> <Paragraph position="6"> We infer that the sense of word x in X is si if Hi're(X, };i) is maximnm alnong t'~ ,...,}~p.</Paragraph> <Paragraph position="7"> Giw:n ml article, the procedure for WSD is applied to each word (noun) in an article, i.e. the sense of each noun is estimated using formula (1) and the word is rel)laced 1)y its sense. Tat)le 1 shows samI)le of the results of our disambiguation nn'thod.</Paragraph> <Paragraph position="8"> Tabh~ 1: The results of the WSD lnethod Input A munber of major aMines adopted continental aMines' .-.</Paragraph> <Paragraph position="9"> Output A number5 of major airlinesl adopted continental2 airlines2 ... _ In Tal)le I, underline signifies polysenmus nolln. '()utlmt.' shows that ea('h noun is rel)laced l)y a syml)ol word which corresl)onds to each sense of a word. We call 'Inlmt' and ~()utput' in Table 1, mt (rriginal artMe and a new artMe, respectively. position iv a sequence.</Paragraph> <Paragraph position="10"> He was lieu (HIe of Ollr nllllll)er.</Paragraph> <Paragraph position="11"> A telel)hOnC numl)er.</Paragraph> <Paragraph position="12"> ~h(&quot; was nuInber seVelt ill tit(, ra(,c. A large nmnber of people.</Paragraph> <Paragraph position="13"> Table 2 shows the definition of hmml)er' in the Collins English, Dictionary. 'numl)erl' ~ 'nuntl)er5' are symbol words and show different senses Of 'llunlber'.</Paragraph> </Section> <Section position="2" start_page="406" end_page="407" type="sub_section"> <SectionTitle> 3.2 Linking Nouns with their Semantically Similar Nouns </SectionTitle> <Paragraph position="0"> Our method for classification of articles uses the results of dismnbiguation method. The problems here are: 1. The frequency of ewwy disambiguated noun in new articles is lower than that of every polysemous noun in oriqinal articles. For exalnple, the frequency of 'nulnber5' in Table 1 is lower than that of 'number 't. Furthermore, some nouns in articles may be semantically similar with each other. For example, 'num- null ber5' in Table 2 and 'sum4' in Table 3 arc ahnost the saine sense.</Paragraph> <Paragraph position="1"> 2. A phr~sal lexicon which Walker suggested in his method gives a negatiw~ influence for classification. null 1If all 'mlmber' are used ~s ~nunJ)er5' sense, the flequency of 'number' is the same as 'numl)erS'. The result of the addition of num- ~? l'S, lie or inore eohntlns or rows of numbers to be added. The limit of the first n terms of a converging infinite series as ,~ tends to infinity.</Paragraph> <Paragraph position="2"> He borrows ellorlnoltS sluns.</Paragraph> <Paragraph position="3"> The essence or gist of a matter.</Paragraph> <Paragraph position="4"> In order to cope with these prol)lems, we linked nouns in new articles with their semantically similar nouns. The procedm'es for linking are the following five stages.</Paragraph> </Section> <Section position="3" start_page="407" end_page="408" type="sub_section"> <SectionTitle> Stage One: Calculating Mu </SectionTitle> <Paragraph position="0"> The first stage for linking nouns with their semanticMly sinfilar nouns is to calculate Mu between noun pMr x and y in new articles. In order to get a reliable statistical data, we merged every new article into one and used it to calculate Mu.</Paragraph> <Paragraph position="1"> The results are used in the following stages.</Paragraph> <Paragraph position="2"> Stage Two: Representing every noun as a vector The goal of this stage is to rel)resent every noun in a new article as a vector. Using ~t term weighting method, nouns iI |a new article would be represented by vector of the form</Paragraph> <Paragraph position="4"> where wl is the element of a new arti(:le and corresponds to the weight of the noun wl. In our method, the weight of wi is the wdue of Mu between v and wi which is calculated in Stage One.</Paragraph> <Paragraph position="5"> Stage Three: Measuring similarity between vectors Given a vector representation of nouns in new articles ~s ill forlnula (2), a dissimilarity between two words (noun) vl, v2 in an article would be obtMned by using formula (3). A dissimilarity measure is the degree of deviation of the grout) in an n-dimensionM Euclidean space, where 'n is the number of nouns which co-occur with t~ 1 and</Paragraph> <Paragraph position="7"> is the length of it. A group with a smMler value of (3) is considered semantically less deviant.</Paragraph> <Paragraph position="8"> Stage Four: Clustering method For a set of nouns Wl~ 'W2~ &quot;'', w,~ of a new article, we calculate the semantic devi~ttion value of all possible pairs of nouns.</Paragraph> <Paragraph position="9"> Table 4 shows sample of the results of nouns with their semantic deviation values.</Paragraph> <Paragraph position="10"> Iu Table 4, 'BBK' shows the topic of the article which is tagging in the WSJ, i.e. 'Buybacks'. The value of Table 4 shows the semantic deviation vMue of two nollnS 2.</Paragraph> <Paragraph position="11"> The. clustering algorithm is applied to the sets shown in Table 4 and produced a set of semantic clusters, which are. ordered in the as('ending order of their semantic deviation wdues. We adopted non-overlal)ping , group average method in our clustering technique (Jardine, 1991). The sample results of clustering is shown in Table 5.</Paragraph> <Paragraph position="12"> We selected different 49 artMes from 1988, 1989 WSJ, and applied to Stage One ~ Four. From these results, we manuMly selected (:lusters which are judged to be semantically similar. For the selected chtsters, if there is a noun which belongs to several clusters, these clusters are grouped together. As a result, each cluster is added to a sequential number. The sample of the results are shown in Tal)le 6.</Paragraph> <Paragraph position="14"> eIn Table 4, there m'c some nouns which are not added to the number, '1' ~,, '5', e.g. 'giorgio', 'di'.</Paragraph> <Paragraph position="15"> This shows that for these words, there is only one meaning in the dictionary.</Paragraph> <Paragraph position="16"> 'Seq. hum' in Table 6 shows a sequential numt)er, 'wordl', ...,'word,,' whi(:h are added to the grou 1) of semantically similar nouns 3. Tal)le 6 shows, for examl)le , 'new2' and 'york2' are xemanti('ally similar and form a phn~sal lexicon.</Paragraph> </Section> <Section position="4" start_page="408" end_page="408" type="sub_section"> <SectionTitle> 3.3 Clustering of Articles </SectionTitle> <Paragraph position="0"> According to Table 6, freqllen('y of every word in new artMes ix counted, i.e. if a word in a ne, w article t)ehmgs to the gron l) shown ill Tal)h' 6, the word is rel)laced t)y its rel)resentative mmfl)('r 'wordi' and th(' fre(luency of 'word/' is count('d.</Paragraph> <Paragraph position="1"> For (,xalnlJe, 'l)ank3' and 'banks3' in a new ~Lrticle are rel)laced by 'wordi', aud the frequen('y of 'wordi' equals to the total nulnl)er of fr('quency of 'bank3' and q)anks3'.</Paragraph> <Paragraph position="2"> Using a term weighting method, articles wouhl be represented 1)y vectors of the form</Paragraph> <Paragraph position="4"> where Wi (:orresl)on(ls to the weight of the noun i. The weight is used to the fr('(lu(mcy of noun.</Paragraph> <Paragraph position="5"> Given the vector rel)resentations of articles as in formula (4), a similarity between Ai and Aj are caJculated using formula (1). The greater the wtlue of Sim(Ai, Aj) is, the ntore xinfilar these two articles are. The ('lustering Mgorithm whh:h is described in Stage Four is appticd to each 1)ah ' of articles, and t)roduces a set of ('lusters whh'h are ordered in the des(:ending order of ~heir semantic similarity wdues.</Paragraph> </Section> </Section> <Section position="5" start_page="408" end_page="409" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We have conducted flmr ('xl)eriments, i.e. q!'req', 'Dis', 'Link', and 'Method' in order to exanline how WSD me, thod and linking words with their semantically similar words(linking method in short) atfect the clustering results. 'Fl'eq' is fl'equencyt)a~sed exlmriment, i.e. we use word frequency for weighting and do not use WSD and linking methods. 'Dis' is con(:erned with disambiguationd)ased experim(mt, i.e. the (:lustering algorithm is applied to new artMes. 'lAnk' ix con/:erned with linking-l)~used experiment, i.e. we applied linking method to original artMes. 'Method' shows our proposed method.</Paragraph> <Section position="1" start_page="408" end_page="408" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> The training tort)us we have used ix the 1988, 1!)89 WSJ ill ACL/DCI CD-I{OM whi(.h ('onsists of al)out 280,000 1)art-of-spee('h tagged sentences (Brill, 1992). From this eorlmx, we seh,cted at random 49 (lifferent articles for test data, each of which (-onsixts of 3,500 sentences and has different tel)it ilallle wlfich is tagging in the WS,I. We classified 49 artMes into eight categories, e,g.</Paragraph> <Paragraph position="1"> Sin our experiments, m equals to 238.</Paragraph> <Paragraph position="2"> 'market news', 'food. restaurant', etc. The di('tionary we have used is Collins English Dictionary in ACL/DCI CD-ROM.</Paragraph> <Paragraph position="3"> in WSD nwthod, the (:o-occurrence of x and y f'or cah:ulating Mu is that the two words (x,y) al)pear in the training (:orl)uS in this order in a window of 100 words, i.e. a: is folh)wed by y within a 100-word distance. This is because, the larger win(h)w sizes might be ('onsidered to be useful for extra('ting s(unanti(' relationshil)s between ltOltllS. Basic words are sele('te(l the lO00th most fre(luent words in the reference Collins English, Dictionary.</Paragraph> <Paragraph position="4"> '\['he length of a selltetl(:(, ~&quot; which contains a 1)ol yxemons n(mn and the h'ngth of a sentence of di('tionary-defilfition are maximuln 20 words. For ea('h t)olysemous nmm, we selected the tirst top 5 definitions in the (lictionary.</Paragraph> <Paragraph position="5"> In linking m(,thod, a window size of the c(> o('('urren('e of .c and y for ('ah'ulating Mu is the same as that in WSD method, i.e. ~L window of 100 words. W(' xeh,cted 969 ~ 9128 different (noun, nomt) pairs for each article, 377 ~ 1259 tilt%rell |llOllllS Oil condition that frequ(,ncies and Mu ~,,,..or 1,,w (f(.,, :j) _> 5, M,,(., v) _> a) t,, permit a relial)le statistical analysis 4. As a result of Stage Four, we nlanually selected (:lusters wlfich are judged to 1)e semanti(:ally similar. As a result, w(' sele('te(l clusters on (:ondition that the threshold value for similarity wax 0.475. For the seh'cted ('lusters, if there ix a noun which belongs to xev('ral ehtsters, thex(, chlsters are grouped together. As a r(,sult, we obtahwd 238 clusters in all.</Paragraph> </Section> <Section position="2" start_page="408" end_page="409" type="sub_section"> <SectionTitle> 4.2 Resnlts of the experiments </SectionTitle> <Paragraph position="0"> The results are showit in TM)le 7.</Paragraph> <Paragraph position="1"> In Tal)le 7, 'Article' means the munber of articles which are sele(:ted from test data. ~Nltnl' iiIPalls the, nunlber for each 'Article', i.e. we selected 1(I sets for each 'Article'. 'Freq', 'Link', 'Dis', and <Method' show the nulnlmr of sets which are clustered (:orrectly in ea(:h experiment.</Paragraph> <Paragraph position="2"> The samph' results of 'Article = 20' fl)r each (,xperiment is shown in Figure 1, 2, 3, and 4.</Paragraph> <Paragraph position="3"> In Figure 1, 2, 3, mid 4, the X-axis is the similarity wdue. A1)l)reviation words in each Figure and categories are shown in Talile 8.</Paragraph> <Paragraph position="4"> 4 Her(', f(x, y) is the munl)er of total co-occurrences of words :e and y in this order in st window size of 100 words.</Paragraph> </Section> </Section> <Section position="6" start_page="409" end_page="409" type="metho"> <SectionTitle> 5 Discussion 1. WSD method </SectionTitle> <Paragraph position="0"> According to Table 7, there, are 24 sets which could be (:lustered correctly in 'Dis', while 21 sets in 'Freq'. Examining the results shown in Figure 3, 'BVG' and 'HRD' are correctly classified into 'food * restaurant' and 'market news', respectively. However, the results of 'Freq' (Figure 1) shows that they are classified incorre<'tly. Table HEA: health care providers, medicine MTC: medicM and biotechnology CMD: commodity news, farm products 9 shows different senses of word ill 'BVG', and 'HRD' which could be discriminated in 'Dis'.</Paragraph> <Paragraph position="1"> In Table 9, for example, 'security' is high freqtlenties and used ill 'being secure' sense ill 'BVG' artMe, while 'security' is 'certificate of creditorshiI)' sense in 'HRD'. One possible cause that the results of 'Freq' is worse than 'Dis' is that these polyselnous words which are high-frequencies are not recognised polysemy in 'Freq'.</Paragraph> </Section> <Section position="7" start_page="409" end_page="410" type="metho"> <SectionTitle> 2. Linking method </SectionTitle> <Paragraph position="0"> As shown in Table 7, there are 23 sets which could be clustered correctly in 'Link', while 21 sets ill 'Freq'. For example, 'ERN' and 'HRD' are both concerned with 'market news'. In Figure 2, they are clustered with high similarity wflue(0.943), while in Figure 1, they are not(0.260).</Paragraph> <Paragraph position="1"> Exalnilfing the results, there are 811 nouns in 'ERN' article, and 714 nouns in 'HRD', and of these, 'shares', 'stock', and 'share' which are semantically similar ~re included. In linking method, there are 251 nmmn in 'ERN' and 492 nouns in 'HRD' whi('h ~tre repl~tccd for representative words. However, in 'Freq', each noun corresponds different coordinate, and regards to different meaning. As a result, these tol)ics are clustered with low similarity wdue.</Paragraph> </Section> <Section position="8" start_page="410" end_page="410" type="metho"> <SectionTitle> 3. Our method </SectionTitle> <Paragraph position="0"> Tit('. results of 'Method' show tha,t 31 out of 40 sets are cbLssified correctly, att(I the per('entage attained was 77.5%, while 'Freq', 'Link', and 'Din' ext)eriment att,~tined 52.5%, 57.5%, 6().0%, renl)e(:tively. This shows the effe(-tivelmss of our method. In Figure 4, the ~u'ticles ,tre judged to ('l,tssify into eight categories. Examining 'ERN', 'CEO' and 'CMD' in Figure 1, 'CE()' and 'CM1)' are grouped together, while they have (lifferent c~,tegories with each other. On the other hand, in Figure 3, 'ERN' and 'CE()' ar(, groul)ed together corre('tly. Examining the nouns which arc 1)elonging to 'ERN' mid 'CE()', 'plant'(factory and food senses), 'oil'(petrohmnl and food), 'order'(colmn~nd ;md dema.nd), and 'interent'(del)t and curiosity) whi(:h are high frequencies ~re correctly dismnbiguated. Furthermore, in Figure 4, 'ERN' mM 'CEO' are classified into 'market news', and 'CMD' are cb~ssilied into 'fm:m', correctly. For example, 'plant' which is used in 'factory' sense in linked with semanti('~lly silnib~r words, 'ntanuf;wturing', 'factory', 'production', or 'job' et('.. In a simibtr way, 'i)bmt' which in uned in flood' sense is linked with 'environmeltt', 'forest'. As a result, the articles are classified correctly.</Paragraph> <Paragraph position="1"> As shown in Table 7, there arc 9 nets which could not 1)e clustered correctly in our method. A possible improwmmnt is that we use all definitions of words in the dictionary. We s(qeeted the first top 5 definitions in the dictionary for each noun and used theln in the cxperilnent. However, there are some words of which the memfings are not included these selected definitions. Thin (:~mses the fact theft it is hard to get a higher percentage of correct clustering. Another interesting t)ossibil ity in to use ml altermttive weighting policy, such a,s the widf ( weigh.te, d invcr.sc docwmcnt fre, qucncy) (Tokunaga, 1994). The widf is reported to have a marked ~ulwmtage over the idf ( invers~ ~. document frequency) for the text categoris~Ltion tank.</Paragraph> </Section> class="xml-element"></Paper>