File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1083_metho.xml
Size: 16,356 bytes
Last Modified: 2025-10-06 14:14:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1083"> <Title>N|)a, NP AP I I N A I I st, chose Serl?e NPv N P,, Nl)~t NP PP ~ I ~ NP AP NP A P N /' Xl' I I I I I I I N A N /I</Title> <Section position="3" start_page="0" end_page="490" type="metho"> <SectionTitle> 1 Identifying word classes in </SectionTitle> <Paragraph position="0"> medium-size corpora In companies with a wide range of activities, such as EDF, the French electricity company, the rapid evolution of technical domains, the huge amount of textual data involved, its variation in length and style imply building or updating numerous terminologies as NLP resources. In this context, terminology acquisition is defined as a twofold process. On one hand, a terminologist must identify the essential entities of the domain and their relationships, that is its ontology. On the other hand, (s)he must relate these entities and relationships to their linguistic realizations, so as to isolate the lexical entries to be considered as certified terms for the domain.</Paragraph> <Paragraph position="1"> In this paper, we concentrate on the first issue. Automatic exploration of a sublanguage corpus constitutes a first step towards identifying the semantic classes and relationships which are relevant for this sublanguage.</Paragraph> <Paragraph position="2"> In the past five years, important research on the automatic acquisition of word classes based on lexical distribution has been published (Church and Hanks, 1990; Hindle, 1990; Smadja, 1993; Grei~nstette, 1994; Grishman and Sterling, 1994). Most of these approaches, however, need large or even very large corpora in order for word classes to be discovered 1 whereas it is often the case that the data to be processed are insufficient to provide reliable lexical intbrmation. In other words, it is not always possible to resort to statistical methods.</Paragraph> <Paragraph position="3"> On the other hand, medium size corpora (between 100,000 and 500,000 words: typically a reference manual) are already too complex and too long to rely on reading only, even with concordances. For this range of corpora, a pure symbolic approach, which recycles and simplifies analyses produced by robust parsers in order to classify words, offers a viable alternative to statistical methods. We present this approach in section 2. Section 3 describes the results on two technical corpora with two different robust parsers. Section 4 compares our results to Itindle's ones (Hindle, 1990).</Paragraph> <Section position="1" start_page="0" end_page="490" type="sub_section"> <SectionTitle> contexts </SectionTitle> <Paragraph position="0"> As Hindle's work proves it, among others (Grishman and Sterling, 1994; Grefenstette, 1994:), the mere existence of robust syntactic parsers makes it possible to parse large corpora in order to automate the discovery of syntactic patterns in the spirit of Harris's distributional hypothesis.</Paragraph> <Paragraph position="1"> Itowever, Harris' methodology implies also to simplify and transform each parse tree 2 , so as to obtain so-called &quot;elementary sentences&quot; exhibiting the main conceptual classes for the domain (Sager lIa'or instance, Hindle (Hindle, 1990) needs a six million word corpus in order to extract noun similarities from predicate-argunlent structures.</Paragraph> <Paragraph position="2"> 2Changing passive into active sentences, using a verb instead of a nominalization, and so on.</Paragraph> <Paragraph position="4"> et al., 1987).</Paragraph> <Paragraph position="5"> In order to ~mtomate this normalization, we propose to post-process parse trees so as to emphasize the dependency relationships among the content words and to infer semantic classes. Our approach can be opposed to the a prior one which consists in building simplified representations while parsing (Basili et al., 1994; Metzler and Haas, 1989; Smeaton and Sheridan, 19911).</Paragraph> </Section> <Section position="2" start_page="490" end_page="490" type="sub_section"> <SectionTitle> 2.2 R.ecycling the results of robust </SectionTitle> <Paragraph position="0"> parsers For the sake of reusability, we chose to add a generic post-processing treatment to the results of robust parsers. It ilnplies to transduce the trees resulting fl:om different parsers to a common fornlat. null We experimented so t~r two parsers: Aleth(h:am and I,exl;er, which are being used at DER-EDI,' for terminology acquisition and updating. They both analyze corpora of arbitrary length. AlethGram has been developped winthin the GIIAAL project a. I,EXrI'ER has been developped at DER-EI)F (Bourigault, 1993). In this experinlent, we if)cussed on noun phrases, as they are central in most terminologies.</Paragraph> </Section> <Section position="3" start_page="490" end_page="490" type="sub_section"> <SectionTitle> 2.3 The simplification algorithm </SectionTitle> <Paragraph position="0"> The objective is then to reduce automatically the numerous and complex nominal phrases provided by AlethGram and LEXTEI{ to elementary trees, l{enanlt.</Paragraph> <Paragraph position="1"> which more readily exhibit the flmdamental binary relations , and to classify words with respect to these simplified trees.</Paragraph> <Paragraph position="2"> For instance, from the parse tree for slenose serve de le tronc eommun gauche 4 (cf. fig. 2, in which non terminal nodes are indexed for reference purposes), the algorithm 5 yields the set of elementary trees of figure 1. 'l'he trees a and c correspond to contiguous words in the original sequence, whereas b and d only appear after modifier removal (see below).</Paragraph> <Paragraph position="3"> Two types of simplifications are applied when possible to a given tree: 1. ,5'plitting: Each sub-tree immediately dominated by the root is extracted and possibly further simplified. For instance, removing node NP0 yields two sub-trees: NP\], which is elementary (see below) and PP2, which needs further simplification.</Paragraph> <Paragraph position="4"> 2. Modifier removal: Within the whole tree, ev null ery phrase which represents a modified constituent is replaced by the corresponding non modified constituent. For example, in NP0, the adjectival modifier scrrc is removed, as well as the determiner and the adjectives</Paragraph> </Section> </Section> <Section position="4" start_page="490" end_page="492" type="metho"> <SectionTitle> 4 Tight stcnosis of left common mainstem. In both </SectionTitle> <Paragraph position="0"> parsers, {,he accents are removed during tile analysis, the lemmas are used instead of inflected fo,'ms.</Paragraph> <Paragraph position="1"> Additionally, fro' simplitication purposes, a contracted word like du is considered as a preposition- determiner sequ_enec.</Paragraph> <Paragraph position="2"> 5See (Habet't el; al., 1.995) for a detailled presental,ion. The corresponding software, SYCI,AI)E, has been developped by the tirst author.</Paragraph> <Paragraph position="4"> stc~ms~: de N \[ \[ \] I \[ troIlc COIflfft/l\[l ~Tollc .(\]ditch( ~ tronc l)'igm:e 2: I,\]lcmcntary trees for sl.cnose serve de lc tro,m co'm..mm~ (l(mchr. -_coronarien -._gauche -~------___._ a t t ein t e. de~.~ diametre .de.... --------'-----I~ tr0nc</Paragraph> <Paragraph position="6"> aye crca.lxxl. 'l'hc Ih'sl, ()tie, <s/cltos,': ~, iu which sl;a.nds Ior t, he lfiw)t, word, conl.a.ins serf+, whereas l.\[i(; second o11(~> N .&quot;7C;P7'C~ (:Oiii;a.illS S{C'ItO.S<. kl; t;he end o1&quot; l;he SilUl)lilic;tl, iou process, I, ll(;s(; classes ha,re I)(:ei~ cOUll)lelied ~licl olJi(;r oiler; (:rea l;e(I. VVe (-laim t,h~l, th(' s(,inant;i(', similaril,y I)elween two lcxical enl, ries is in i)l:Ol)orl;ioli wii;h I;lie mlillt)er of sha, red (:Olll;(:xl,s, \[,hi: insl;mlc(', in ol,, of ore' (:orl)ora. , ,s/,e~tosu ,'.;ha r(;s 8 conliexl,s wit, h l(szom In order I,o get, ~ glohal vision of the similm:il.ies relyi,g on elenient, ary conl.exD;, a. gi'ad)h is C, Olill)lil;c:(\]. Tim WOl~(ls CO\[lSl;il;llt;,:; l, hc IIO(Ics. A link corresl~onds 1.o a. Cel:l;&ili lliiliil)er oF shared c.oni;exl;s (a<:c.ordill~ l,O ~t. chosen I.hreshold). The edges are labclle, d wiiJi l, he sha.red coiit;cxls. The sl;l:oiigly colineclx;(I c.oinponeill~.s ~I a.nd t;hc cliques '/ a.l'('~ conil)ul.ed a.s woll, ~s t.hcy ~re l;he tiiosi; t'(;l(; Va.lll; l)a.rl,s oF {tie gra.i)h ~ oil i,opologica.I ~lX)lilidS, '\['lie un(l('.l:lying illl;liil;ion is l;h a,l; a~ COiliieclA;d (:Olll\])Otleli/; I'C'\]itl,(:',S lil:':igjhl)ori/igj words (llollSC\]/ and Savit;ch, \]9!)5) m~d I, hal, the cliques tend l,o iso-Ial;c ,<dmih~i:il;y cla,ssc's. An ext;rm::t of a connc'ci;ed conll)onenl; , wil;h 3 as a, threshold, a,l)pears in ligu r(; 3.</Paragraph> <Paragraph position="7"> s'\]'he sub-graphs hi which l.here is ~t 1)aLh I)cl.ween every pair of (lisl>hicl; liO(1CS.</Paragraph> <Paragraph position="8"> rThc sul><~ra, phs in wlfich l;here is a palJl I)et;wee\]l each lto(le and eve>r?/ olhcr noch: of l;he graph.</Paragraph> </Section> <Section position="5" start_page="492" end_page="493" type="metho"> <SectionTitle> 3 Results </SectionTitle> <Paragraph position="0"> a,I)olll; ,52,000 words. 'l'he second one, I,he (k)i;<) n~u;y Medicine (JOrl)US (CM(7), is of a.I)ou{, ($(), 000 words. It was buill; for t, he l,;urol)ca.u M I'\]N El,AS t)rojccl. (Zweigenl)a, ilni, 19{)/I) and is used For 1)ilol.</Paragraph> <Paragraph position="1"> sl,udies in l;erminology exLra.clk)n s.</Paragraph> <Section position="1" start_page="492" end_page="493" type="sub_section"> <SectionTitle> 3.2 A vlsnal lliap of {:OIICOp{s lUld </SectionTitle> <Paragraph position="0"> relationships I@en if iio onl;ology (:~u/ I)c \['ully aJIl;o\]iiat, R:a.ily derived \[;l:Olil a. ('orl)iis (llaJ)erl; ;tll(\[ Na.zarelll,:o, 1996), IJle ,gY( JI,A I)1,\] gra.I)hs ('AI.II I)e iise(I I.o I)()oi.slma I) i.he I)ilihting of l, he onl;olog;y o(' a. dolila.iu. 'l'he SY(',I,AI)I,; ii(fl, work gives a <glol>a,I view over t,hc COrl).S which etmhles {m all, ernal;e i)a, ra,digniaJ;ic a, nd sylil;agul~l;ic exl)lora.l, iou of I,he cont;cxl; o\[' a word. 'l'hc gl;al)h (;nat)les 1;o idenl,ify I;lic concel)t,s , I hcir possit)lc t, yl)icaJ I)rOl)erl, ies, a, lld also t, he rcla, l;ionshil)s b(;I;weCll 1;he selecl,cd COilCel)l,s. null Tim cliques I)ring ()ill; sitia.ll i)ara.diglllai.ic scl.s of \['orins which, ill a. tirsl, sl.et) > Ca.ll Im iuLer I)relx;d as onl;ologh:M classes rellecl;ing coliCelfl..~. 'l'he a.rc lal)Ns l.ticn help Ix) retilie I.llosc chlsses t)y acldiu S sOlile of the Sllrl'Olilidhl~ words whicli axe li()l, pa.i'l; ()\[' t;lie clkluc bul. which ileverllie-le~s sha,r(; the iiIOSl; siguifica.nl; or SOllle Siiliila.r <Ollbexl,8. 1@0111 the clique {sl,+e~,o,sc, b.<Uos b obsl, r~ml, ion, altcinb:} (of. fig. 3), one ca, t\] build l lie cla.ss of all'eel;ions which arc Io(:al;ed in l, he I)odv as {Idam.:, occ1.<~7o., s/..~.Js<~, Ic,~7o., <:.l<:{li~:ali<.,, ob,~'l, rl,:l, ion, aZl, c'inl, c}. Siinila.rly, from l, he gt'al)h lyrics (~ d(; {ca'rotTd<', #tl, crventrTculaTre} aud a(l,iectivcs rela, l;cd to ;~ q)('(:ific aa'l;ery (~ {coro#utiru, co'ronaricn, diaqonal, ci'lvonfl<~:(;}). 'l'he a l, i;i:iblll;e (legr(:e of (;lie a, fl'ecl, ion is a, lso reveaJed IJlrougjh</Paragraph> <Paragraph position="2"> Last, relationships between concepts can be extracted, such as the&quot;part-of&quot; relation between tronc and artere, and segment and artere (fig. 3).</Paragraph> </Section> <Section position="2" start_page="493" end_page="493" type="sub_section"> <SectionTitle> 3.3 Distinguishing word meanings </SectionTitle> <Paragraph position="0"> Polysemy and quasi-synonymy often makes the ontological reading of linguistic data difficult.</Paragraph> <Paragraph position="1"> However, through cliques and edge labels, the SYCLADE structured and documented map of the words helps to capture the word meaning level. Among a set of connected words where w is similar to wi and wj, cliques bring out coherent sub-sets where wi and wj are also similar to each other. We argue that the various cliques in which a word appears represent different axes of similarity and help to identify the different senses of that word.</Paragraph> <Paragraph position="2"> For instance, in the whole set of words connected to etude (study) in a strongly connected component of the NTC graph (analyse, evaluation, resultat, presentation, principe, calcul, travail...), some subsets form cliques with etude. Two of those cliques (resp. a and b in fig. 4 - threshold of 7) bring out a concrete and a more theoretical use of etude.</Paragraph> <Paragraph position="3"> The network also enables to distinguish the uses of quasi-synonyms such as eoronaire and coronarien in the CMC corpus. Even if they are among the most similar adjectives (7 shared contexts) and if they belong to the same clique {coronaire, eoronarien, diagonal, circonflexe}, the fact that eoronarien alone is connected to evaluation adjectives (severe, signifieatif and important) shows that they cannot always substitute to each other.</Paragraph> </Section> </Section> <Section position="6" start_page="493" end_page="494" type="metho"> <SectionTitle> 4 Towards an adequate similarity </SectionTitle> <Paragraph position="0"> esfimatation for the building of ontologies The comparison with the similarity score of (Hindle, 1990) shows that SYCLADE similarity indicator is specifically relevant for ontology bootstrap and tuning. Hindle uses the observed frequencies within a specific syntactic pattern (subject/verb, and verb/object) to derive a cooccu,> rence score which is an estimate of mutual information (Church and Hanks, 1990). We adapted this score to noun phrase patterns) However the similarity measures based on cooccurrence scores and nominal phrase patterns are less relevant for an ontological analysis. The subgraph of the chirurgical acts words, which is easy to identify from the SYCLADE graph (fig. 5a), is split in different parts in the similarity graph (fig. 5b). This difference stems from the fact that this cooccurrence score overestimates rare events and underlines the collocations specific to each form. 1deg For instance, it appears that the relationship between stenose and lesion, which was central in figure 3, with 8 shared contexts, almost diseappears if one considers the number of shared cooccurrences.</Paragraph> <Paragraph position="1"> Therefore, similarity measures based on cooccurrences and similarity estimation based on shared contexts must not be used in place of each other.</Paragraph> <Paragraph position="2"> As opposed to Hindle's lists of similar words which are centered on pivot words whose neighbors are all on the same level, in SYCLADE graphs, a word is represented by its role in a whole syntactic and conceptual network. The graph enables to distinguish the various meanings of words, a crucial feature in the ontological perspective since the meaning level is closer to the concept level than the word level. In addition, the results are clear and more easily interpretable than those given by a statistical method, because the reader does not have to supply the explanation as to why and how the words are similar.</Paragraph> <Paragraph position="3"> The building of an ontology, which is a time-consuming task and which cannot be achieved automatically, can nevertheless be guided. The SYCLADE graphs based on shared contexts can facilitate this process.</Paragraph> <Paragraph position="5"> where f(NIPN2) is the fi'equency of noun N1 occurring with N2 in a noun preposition pattern, f(N1) is the frequency of NI as head of any N1PN,~ sequence and f(N2) the frequency of N2 in modifier/argument position of auy N~PN2 sequence and k is the count of NxPN v elementary trees in the corpus. COOCNAda and CooeAd~N are similarly defined.</Paragraph> <Paragraph position="6"> 1degThe various cooccurrence scores retrieve sets of collocations which are sharply different fi'om the contexts shown by SYCLADE connected components.</Paragraph> <Paragraph position="7"> The coll6cations which get the greatest cooccurrence scores seem to characterize medecine phraseology (facteur (de) risque, milieu hospitalier) but not the coronary diseases as such.</Paragraph> </Section> class="xml-element"></Paper>