File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1087_metho.xml
Size: 12,600 bytes
Last Modified: 2025-10-06 14:14:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1087"> <Title>A Probabilistic Approach to Compound Noun Indexing in Korean Texts</Title> <Section position="4" start_page="514" end_page="525" type="metho"> <SectionTitle> 3 Probabilistic Compound Noun </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="514" end_page="525" type="sub_section"> <SectionTitle> Indexing </SectionTitle> <Paragraph position="0"> In this section, we describe the algorithm to recognize and evaluate candidate index terms from compound nouns. Figure 1 summarizes the algorithm. The tokenizer produces a list of simple and compound nouns by utilizing the noun dictionary and the basic stermning rules. The noun dictionary is used to identify whether a noun is simple or compound, and the basic stemming rules are used to differentiate non-final words from others such as function words and verbs. The noun dictionary is automatically constructed from the obserw~tion on the document set. The compound noun analyzer inw:stigates if the components of compound nouns are appropriate as indexes. The index terrns that include simple nouns prodnced as a result of compound noun analysis are weighted, which finishes the indexing.</Paragraph> <Paragraph position="1"> l,et PS' and C denote the sets of simple and compound nouns, respectively. Simple nouns are, by definition, those that do not have any of their substrings as a noun according to the dictionary.</Paragraph> <Paragraph position="2"> Compound nouns are those one or more sub-strings of which are recognized as nouns. Let</Paragraph> <Paragraph position="4"> simple and compound nouns of a document set.</Paragraph> <Paragraph position="5"> Also, let I) = {D1,Du,...,Dg} be the set of all documents. A document is represented as a list of term-weight (2~, Wi) pairs.</Paragraph> <Paragraph position="6"> For a compound noun Ci of a document, a de- null composition is a sequence of nouns (!T'~7~ ...Tk.). lit inany cases, there are there than one de('olllposition, but only a few of thenl are sensible with respect to tim conte.xt of the document. Indiscreet use of the component nouns lnay bring about the improvement of reall, but can lead to the significant decrease of i)recision. In the following discussions, we dest;ril)e the details of the algorithm to select useful coinponent nouns from eOliipound Houns.</Paragraph> </Section> <Section position="2" start_page="525" end_page="525" type="sub_section"> <SectionTitle> 3.1 Dictionary |)uildu I) </SectionTitle> <Paragraph position="0"> It is very difficult to provide an \[t{ systeni with the suf\[icient list of \[iOtlllS. tleeause the nominals o/itnlllnber {tnd grow faster than other categories of words, it is more elticient to halidle non-nominal words mamlally. We consider buihl.</Paragraph> <Paragraph position="1"> ing noun dictionary by identifying the remaining string as a no/ill ai%er eliminating non-llOiliil-lal part of a word. The non-nominals are verbs, adverbs, adjectives, prelixes, aud suf\[ixes.</Paragraph> <Paragraph position="2"> The words in non-nominal dictionaries do not include those that can also be used as notins, whi(;h is not a probleni since unlike, in English, the lmllti-eategorial words in Korean telld to be invariant of meaning. The non-nonlinal dictionaries aye made usually hy manual work.</Paragraph> <Paragraph position="3"> Those recognized as non-nominal words but not as \[unction words are regarded as llOl.lns. '\['here can he multiple interpretations in segnienthig a word due to the ambiguity of fiinction words as illustrated in the following examp\]e.</Paragraph> <Paragraph position="4"> Wellcalo .... ). wenealo (reactor),</Paragraph> <Paragraph position="6"> One way to deal with the probleni is to nse tim i)robability of each function word and choose the one with the highest vahie. More accurate measure woukl be made using a tlidden Marker Model that is about a stochastic process of fun(>lion words. 'Fhe function words are (:lassified into 32 groups according to their roles and position in sentences. In particular, each segmentation of a word is evahtated as follows.</Paragraph> <Paragraph position="8"> egory of current word given the category of the previous word. P(n) is the probability of candidate noun and l'(fln ) is the prot)ability era fun(> tion word given the candidate noun. The best sequence of these segmentations for a sentence can be obtained. The candidate nouns n of the best sequence are then 'added to the noun dictionary.</Paragraph> <Paragraph position="9"> a.2 Tokenizing and eolni)ound noun analysis Tokenizing aims at recognizing simple and compound nouns froni a text and reporting them as the the final index terms. The method for di(-tionary making is also us('.d for tokenizing. Since the d ictionary making method gives a list; of candidate nouns, we only need to ctaeck if a candidate is a COml)ound noun and judge if the eompotl(mts el&quot; the candidate compound noun e~re consistent, with the content of the deemneat,.</Paragraph> <Paragraph position="10"> To deal with the notion of consistency, we have to deiiue tile nwaning of a term or a set o\[&quot; terms, It is a well recognized practice to regard the discriminating power of a terln as the value of the term. The quality of the (liseriniinating power is the distribution of the tel'ill over a document set.</Paragraph> <Paragraph position="11"> We define the distribution of a terln as the Hleati.. ing of the terin. Similarly the meaning of a set of terms is the distribution of terms on the dec/anent set.</Paragraph> <Paragraph position="12"> l,et M be the distribution of a term ~/i~ over a document set l) = I)1 ... I).,~ snch that J On(&quot; deiinition of 54(.) lnay be as follows.</Paragraph> <Paragraph position="14"> 'l'he similarity t)etween two Lel'lTis (or sets o\[&quot; terms) can be defined as any of vector similarity liieaslires. The Iileasllrl;ln(Hlt of relative infortualion of the two distril)utions corresponding to the two tertns Rives the distauce between the distrit)utions. Given two (listril)utions Mi and k4) for ~l} and :1) respectively, tim discrimination L() is defined as follows (la;lahut, \[988).</Paragraph> <Paragraph position="15"> I--t M~.</Paragraph> <Paragraph position="17"> Since we want the dissimilarity between two dis~ tributions, divergence that is a symmetric version of diserimhiation is nlore appropriate for our case.</Paragraph> <Paragraph position="18"> It is defined as follows (I}lahut, 1988).</Paragraph> <Paragraph position="19"> /;(M~, MS) = i.(M~, a45) + /;(Ms, Md.</Paragraph> <Paragraph position="20"> I,'igure 2 ilhistrates the different distributions ()t' terms over tile same doclilnellt set suggesting the usefulness of the distributions as the representation of tim terms. '\['he divergence ~(.) gives about tile itfforination (uncertainty) el' the two dist.ribu tions as cornpared with each other, and \]las the following characteristics.</Paragraph> <Paragraph position="21"> * Tim more uniform the distril)ution is, the larger L(-) will be.</Paragraph> <Paragraph position="22"> o '\['lie lilOrO the two distributions agree, the less L(.) will he.</Paragraph> <Paragraph position="24"> The eh;:u'a, cl,erisi, ics are useful because good hi-.</Paragraph> <Paragraph position="25"> (\[ex tcrtns should be less ilnil'orni a,n(\[ sltare simihu' eoii|,exts with other terTils in a dOCTTlileUt.</Paragraph> <Paragraph position="26"> \]1l rids respect,, i\]l\]'Ol;illa, l;iOli l,heoreti(' niea.sTire is it\[ore eollerete and l,\]uis possibly It\[ore &(W.ilrai, e t\[ia.ii w.)ellor siTnilarity Tileastircs.</Paragraph> <Paragraph position="27"> l&quot;or e~-~(:h de(:olnposition ('/i,'&quot;,'l)) o\[' a (:OtTTl)OTUid IIOITTI Ck, whal, we want 1,o see is how dirferen{, l, he deconiposed terTns and t, he doCTilUeUl, i,(;T'TTtS a.,'e. Thai, is, /,({rl},.-.,;l)}, Ds<)I.x'o~.cs tile score of the imrticulm * deceiT\[position. D?h;tl, we select he.re is OIT(; decolllpOSitiOll wit, h th(: low-est diverg(;nce. Let, l;iug ~v a n(\[ r' denol,e a, t\[eeOlllposition and l;he I)esl, (leeoluposit, ion resl)eetive/y ,</Paragraph> <Paragraph position="29"> 'l'i~e following SliiTiTii~tt'iZeS ttw l)roc(;(hlre o\[' extra('ting shnple TiOlll\]S \['roil i COl\[tl)ounct llOllltS.</Paragraph> <Paragraph position="30"> i. I{.eiiTOVe iTon-nonliTta.\[ words ushig tile tTTel,ilod for dietioTlary Ilial,:itTg.</Paragraph> <Paragraph position="31"> 2. ldcni,ify cO\[ill)el\[lid iiOliiis llSillg liOitthtM diel, iona.ry.</Paragraph> <Paragraph position="32"> 3. For (xtelt (t(~conlt)osit, ion mi o1&quot; a. COl\[lpound IIOTIII (\]i, colnpul, e \]\](rf, D).</Paragraph> <Paragraph position="33"> 4. Select, &quot;~i with the lowest L(ri, l)).</Paragraph> </Section> <Section position="3" start_page="525" end_page="525" type="sub_section"> <SectionTitle> 3.3 hl(lox weighting </SectionTitle> <Paragraph position="0"> There are three well known liter, hods \[()1' weighl,ing iiTdex l:erius. 'l'liey are based oit the infor--T/l~t{iO/l ofillverse (\[OCtlltteiit fre(tuency, (\]iscriliiht~tl,ion wthi(;, a.nd l)rol)abilisi, ic vMue (Sail,on 1{)88). It, turned ()tit i,\]iat i;hese \[no\[hods lead to similar per\['orTn~mcc, bTll~ inverse docunient frequency is by fax \[lie shnplest of \[\[toni iit l;ei'uis of l;hne (x)i))plexil,y ;-Hid r(.'(iuire(l resollFt:(}s (Sa,lt,()li 1{)887 I \[arT~i3At n 19<(J7).</Paragraph> <Paragraph position="1"> \]llverse (\]O(:lTittelll, t'reqTtelley ltiethod is alSO shown to work with little t)erl'orrnanc(; varbtl, ion a, cross (|\]\]l'ercnl, (\]onl<'.-tins. For tllis r(~.IISOTT> we ~MOl)tCd inverse (\]o(-ulnent fre(luency hi Liie cxi)erinmllLS. \[t is defined as follows.</Paragraph> <Paragraph position="2"> where wij is l, he weight, of' i;hc i'l.h LerTtl iu the .i'l.h doellllTeill,, ~*.7 is the \]llTiiii)er o\[&quot; oc('llrreTl(x;s (if' l, he i'l,h l,erln hi l, he j'th (toeuliTenl,, and dfi is the liliTiibcr of dOCTITIielTI, S hi which the i'i,li l,ernT</Paragraph> </Section> </Section> <Section position="5" start_page="525" end_page="525" type="metho"> <SectionTitle> OCCTlrD 4 l,\]xperiments </SectionTitle> <Paragraph position="0"> 'l'hc goal o\[ experilu<its is to vali&~t,e the pro posed algoril, hnl for a.na.lyzing compo.ud nou.s by co|np;u'ing il, with the mmmal a.nalysis and l, he bigranl lnel, hod.</Paragraph> <Paragraph position="1"> The l,esl, dal, a set consists of 1000 science a l) stra(:~,s writl.en in Kore~ul (Kitu 1.99d). All nomi Nals nix> manually \[aleut\[fled and eoinpoulid ltoillis were deconq)oscd into ~q)proprinte simple nouns by &t\[ expert in(lexcr. In the iirst (;xp(;rit.ent, our proposed Mgoril,h|u is asked to do t,\]lc sa.nm tiring over the test (lnta., and retri(;wd perl'or Imuwes ou 1.he two ttitf(;reut, ouL('om(:s (m~munlly imlexed and aul;olual;ically iudexed al)stracl;s) are ('omparec\[. lIT t, lw S,eCOTT(\[ exl)erinl(?lH,S, the l)erf'orma.nc(~s o\[&quot; the proposed m('tho(l and t)igram Tttetb.o(I a.re eoutl);u:e(I to oloserve how Lit(; preci. sion is all'eel,cal.</Paragraph> <Paragraph position="2"> As is showu at t~d~le l, the portion o\['(:Ollll)Otltid itOlll/S iS at)oIIt; 9~/{) O\[&quot; I;OI, M I\[OIlIIS \['()lind ill I;\[TC Lest set,, but. (:ml TTtMC/C critical eil>('l,s on tile retriewd \[)erforlila.tlee bec&llSe oN;ell COIIT\[)OtlII(I JIOIIIIS e&r ryiT\[g n lore sl)eeili(&quot; information become t,. more a,e(:llT'~l,t(': ill(it;,*( too \[.he dOCIllltC'lttS.</Paragraph> <Paragraph position="3"> Figure 3 and 'l'~tble 2 summarize the perfor umnce of the indexing me.t.hods: mauuM nnMy sis, tim propose(I i)rolmbilisl,ie method, and the bigr~mi utet\[lod. <\['lit; proposed mel, ho(l showed a slightly bct;ter peT'f()riilatlee (around 3%) - 4(~J) them nlmnlM indexing or bigr~mi tildextrig. However, otir method lifts wa.s il\]ore e\[lieient than t)i gi'a, lit indexing in l, errns o\[' the llUli~ber o\[' inde~x LerlliS mid ti~e ~werage iilllTii)er of retrieved doeultietltS per ~ query.</Paragraph> <Paragraph position="4"> The ~tverage anlbiguity of a colilpoi/lid ilolilt is 1.43, and this low anibiguity niust ha.re eonl;ributed l,o tile iiigli &grecnlent ratio of tile \]proposed indexing; method with lil&lill&l indexing.</Paragraph> <Paragraph position="5"> Tim low ~mibiguil, y is pari, ly ~tl, ixibuted 1;o the llOTlli dictionary that has 11o iiTilleeessa+ry entries not found at the documents.</Paragraph> </Section> class="xml-element"></Paper>