File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2161_intro.xml
Size: 4,937 bytes
Last Modified: 2025-10-06 14:06:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2161"> <Title>Positioning Unknown Words in a Thesaurus by Using Information Extracted from a Corpus</Title> <Section position="2" start_page="0" end_page="956" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Thesauruses are among the most useful knowledge resources for natural language processing. For example, English thesauruses sttch as Roger's Thesaurus and WordNet \[4\] are. wideJy used for tasks in this area \[,5, 6, a\]. Howew~r, most existing thesauruses are compiled by hand, and eonsequently~ the following three problems occur when they are used For NI,P systems.</Paragraph> <Paragraph position="1"> First, existing thesauruses have insufficient vocal)ularles~ especially in I~nguages other than English. In J~pa, n, there are no free thesauruses that can I)e shared by researchers. Vu rthermore, ge.eral-domain thesauruses do not (:over (lomMn-specifie terms.</Paragraph> <Paragraph position="2"> Se(:ond~ the human intuition used in constrllcting thesauruses is not explicit. Most existing thesa.tlrttses are hand-crafted by ol)serving huge amounts of data on the usage of words. The data a.n(l human judgements used in (;onstructhlg thesauruses would be very useful in NLP systems; unfi)rtunately, howeve h this information is not represente(I in the thesauruses. null 'l~hird, the structure of thesauruses is subjec-hive. The depth and (lensity of nodes it, (tree-llke.) thesauruses directly a:\[l~ct tilt', calculated distances between words. For example, n(>des fi)r biological words haw=, many levels, while abstract words are classified in relatively shallow lew4s. However, existing thesauruses only represent unif(lrm relationships between words.</Paragraph> <Paragraph position="3"> This pa, per describes a way of overcoming the prol)lems, using a, medium-size Japanese thesaurus aim large corpus. 'Yhe main goal of our work is to expand the thesaurus automatically, explicitly including distinguishing features (viewpoints), and to construct a domain-setlsitive thesaurus system.</Paragraph> <Paragraph position="4"> :Co expand the vocabulary of the thesaurus, it is important to position new words in it automati(:ally. In this paper, words that are not contained in the thesa.urus but that appeared in the corpus more than once are called unknown words, l '\['he proper positions of the unknown words in the thesaurus are estimated by using woM-to-word relationships extracte(l from a large:sca.le corpus. This task may be similar to word-sense disambiguation, which deter-I|Hlles the correct sellse of a, word from several pre-defined candidates. However, in l)ositioning a word whose sense is 1,11knowtl, a suitable position must be selected from thousands of nodes (words) it, the thesaurus, and therefore it is very difficult to position the word with pinpoint accuracy. Instead~ in this paper~ we give a method for determining the area in which the unknown words belongs. For example, suppose the word &quot;SENTOUKI&quot; (fighter) 2 is not contained in a thesaurus. Calculation of the similarity between the word and those in the thesaurus a.sslgns it to tile area \[flying vehicle \[air plane, helicopter\]\]. null Viewpoints are features that distinguish a. node from other nodes in the thesaurus, and are good (:lues for estimating the area to which an unknown wor<l should be assign(d. The area can be efficiently estimate(/ by extracting viewpoints.</Paragraph> <Paragraph position="5"> Several systems have used Wor<lNet a.nd statls: tieal infi)rmation from large corpora \[31 5, 6\]. Howeve.r, there are two common problems: noisy co-occurrence of words a,nd data sparseness. In Word-Net, since each node it, the thesaurus is a set of words that haw~ synonym relationships (SynSet)~ wtrious methods for similarity cah'ulatlon using the SynSet classes have been proposed. In this t)aper~ ISAMAP \[8\], a hand-crafted Japanese thesaurus, is used as a (:ore. To overcome the problems of noise 1Tha.t is, unknown words do not lne~\[\[ very |ow-frequency words.</Paragraph> <Paragraph position="6"> ~A Jttpa.nese word in ISAMAP is represented by a pair of capital Rom~tn letters and the word's English tr~mslatlon.</Paragraph> <Paragraph position="8"> and (la,t;a si)arse.ess, relati(mshil)s ofc(m imcl,ed ,odes in the 1.hesa.urus are used. t{,'s,ik \[)r(qmsed a. classtmse(I a, pproach, in which sets o\[&quot; words are .sed insl:ead of words \[5\]. ht his apl>r,)a<Jt , each bynset is used as a, class. In our apl)roach ~ (,n tim other ha, n(\[, an a,r,'m, l, ha.l; coul;a.ins ('otlrm(:t,,(l no(los iu 1.ho thesa, tlrus is use(l as a class. The .odes are on.</Paragraph> <Paragraph position="9"> ne('ted by IS=A relatio.ships as well as syn(mym rela, i,ionshil)s ~ and theref'ore large areas rel)resenl, strong similarities to unknow, words.</Paragraph> </Section> class="xml-element"></Paper>