File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1019_metho.xml
Size: 26,302 bytes
Last Modified: 2025-10-06 14:12:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-1019"> <Title>Entry # Homograph # Pronunciation Paradigm Label POS Syntactic Codes Usage Label Pointers to the base-lemma and/or to all derivatives Pointers to graphical variants Sense# Field Label Synctactic Codes Figurative , extended, etc. Definitions Pointers to Synonyms Pointers to Antonyms Pointers to Hyponyms, Hyperonyms Pointers to other Entries through other Relations Semantic (inherent) Features Formalized Word-sense Representation Examples # Example Figurative, rare , .. Definitions of a particular contextual usage Idioms Citations Proverbs</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> ACQUISITION OF SEMANTIC INFORMATION FROM AN ON-LINE DICTIONARY </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> After the first work on machine-readable dictionaries (MRDs) in the seventies, and with the recent development of the concept of a lexical database (LI)B) in which interaction, flexibility and multidim;ensionality can be achieved, but everything must be explicitly stated in advance, a new possibility which is now emerging is that of a procedmal exploitation of the full range of semantic in!brmation implicitly contained in MRI)s. The dictionary is considered in this framework as a prima~'y source of basic general knowledge. In the paper we describe a project to develop a system which has word-sense acquisition fi'om information contained in computerized dictionaries and knowledge organization as its main objectives. The approach consists in a discovery procedure technique operating on natural language delinitions, which is recursively applied and relined. We start \[i'om free-text definitions, in natural language linear form, analyzing and converting them into infbrmationally equivalent structured forms. This new approach, which aims at reorganizing ti'ee text into elaborately structured information, could be called the Lcxical Knowledge Base (I.KB) approach.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Baekgromld </SectionTitle> <Paragraph position="0"> For a cmlsidcrable period in theoretical and computational linguistics, there was a predominant lack of interest in lexical problems, which were regarded as being of minor importance with respect to &quot;core&quot; issues concerning linguistic phenomena, mainly of a syntactic nature, l)uring the last few years, howevcr, this trend has been ahnost reversed. The role of the lexicon in both linguistic thcories and computational applications is now being greatly revalued and one aspect on which a number of research groups are now focussing their attention is the possibility of reusing the large quantity of data contained in alrcady existing machine-readable lcxical sources, mainly dictionaries prepared for photocomposition, as a short cut in the construction of extensive NLl'-oriented lexicons.</Paragraph> <Paragraph position="1"> This position was formulated very clearly in a number of papers presented at a recent workshop organized in Grosseto (Italy) and sponsored by the European Community (see Walker, Zampolli, Calzolari, furthcoming), and can be found in the set el&quot; recommendations which was one of&quot; the results of this workshop (Zampolli 1987, pp.332-335).</Paragraph> <Paragraph position="2"> After the first work on machine-readable dictiona,'ies (MRI)s) in the seventies (see Olney 1972, Sherman 1974), and with the recent development oI~the concept of'a lexical database (l.l)B) in which interaction, flexibility and multidiinensionality can be achieved, but everything must be explicitly stated in advance (see e.g. Amsler 1980, Byrd 1983, Calzolari 1982, Michiels 1980), a new possibility which is now emerging is that o1&quot; a procedural exploitation of the lull range of semantic intbrmation implicitly contained in MRI)s (see Wilks 1987, Binot 1987, Alshawi forthcoming, Calzolari forthcoming).</Paragraph> <Paragraph position="3"> \[he dictionary is now considered as a prilnary source not only of lcxical knowledge but also of basic general knowledge (ranging over the entire &quot;world&quot;), and some of tim dictionary systems which are being developed have knowled~,e acquisition and knowledge organization as their principal objectives (see also l.enat al/d \[:eigenbaum 1987). In this paper we describe at project which we are now conducting on the acquisition of semantic inlbrmation ti'om computerized dictionaries.</Paragraph> </Section> <Section position="4" start_page="0" end_page="89" type="metho"> <SectionTitle> 2. I)ata and estal)lished methods fiw hierarchic'd semantic </SectionTitle> <Paragraph position="0"> classifying The data we use in our research include the lexical information contained in the Italian Machine I)ictionary (I)MI), which is ah'eady structured as a LI)B and is nrainly based on the Zingarelli Italian dictionary (1970); the DM l-l)B has different types o\[&quot; linguistic inIormation already accessible on-line. A morphological module generates and analy,'es the intlected word-forms: approximately I million fiom 120,000 lemmas, l.cnm/as, word-forms, deriwitivcs/suflixes, POS, usage codes, and specialized terminology codes, can be used its direct access search keys through which the user can query the database dictionary. On the semantic side, synonyms, hyponyms, and hypcrnyms constitute already implemented access paths covering all of the approximately 200,000 definitions contained in the dictionary. Examples of possible queries arc the lbllowing: give me all the nouns defined as names of vehicles, of sounds, of games, all the verbs defined by a particular genus term, for example 'M UOVEII.E' (to move), 'TAGLIARI.:' (to cut), etc. The procedures used to find hypernyms in definitions and to create taxonomies are similar to those used by other groups (see Chodorow 1985, Calzolari 1983, Amslcr 1981).</Paragraph> <Paragraph position="1"> We have now begun work on restructuring another dictionary available in MRF, the Garzanti Italian dictionary (1984). A parser has been implemented which, on the basis of the typesetting codes for photocomposition, identifies the rough structure of each lexical entry. Fig. 1 displays the output of a parsed entry of the Garzanti dictionary. Fig. 2 represents the provisional model for a monolingual lexical entry as we have defined it so far. Fig. 3 gives the projection of the first interpretation of the typesetting codes into this model; other kinds of information will be added afterwards (for example, that obtained by the inductive procedures described in the paper).</Paragraph> <Paragraph position="2"> ............................................................</Paragraph> <Paragraph position="4"> = 2 \[3\] = qualsiasi oggetto che non si sappia o non si veglia determinate: {aehe serve quell'-?) / {quell' uomo e' un pessimo} -, e ~ un tipo poco raccomandabile</Paragraph> <Paragraph position="6"> / {essere bone}, {male in}.-, trovarsi in buone, cattive condizioni fisiche o economiche.</Paragraph> <Paragraph position="7"> ............................................................ Fig. I - Output of the photocomposition codes.</Paragraph> <Paragraph position="8"> (the number in the first column identifies the type The merging of part of the data of tile DM I and the Garzanti dictionary into a single LI)B has already been completed, e.g. for lemmas, POSs, usage codes, etc. We now have to tackle the problcm of reorganizing the semantic data (dcfinitions and examples). Itcre our strategy is to design a new procedural system which is ablc to gradually &quot;learn&quot; and acquire semantic infornlation from dictionary definitions, going well bcyond thc IS-A hierarchies constructed so far, in order to attempt to also capturc what is prescnt in the &quot;diffcrcntia&quot; part of the definition. This can be achieved with some success given the particular nature of lexicographic definitions, with: a) a generic (and pe'rhaps over simplistic) description of the &quot;world&quot;; b) a rather lcxically and syntactically constrained and a somewhat regular natural language tcxt (Calzolari 1984, Wilks 1987).</Paragraph> <Paragraph position="9"> After having mappcd the codcs for photocomposition into linguistically relevant codes, all the preliminarily parsed data of the Garzanti have been organized on a PC in the form of a Textual Database (DBrl'), a fuEl-text Information Retrieval (IR) system in which all occurrences of any word-form or lermna can be directly accessed (Picchi 1983). The I)BT has been found to be a very powerful tool in evidencing lexical units and particular syntagms which can then be exploited in our &quot;patternmatching&quot; procedure. With the text in DBT form it is possible to search occurrences of single word-forms in definitions and examples, lemmas, codes of various types (POS, specialized languages, usage labels, etc.), and also cooccurrences of any of these items throughout the entire dictionary. In addition, structures composed of combinations of the above elements connected by the logical operators &quot;and&quot; and &quot;or&quot; to any degree of complexity can also be searched. The results of such queries are returned together with the pertinent dictionary entries.</Paragraph> <Paragraph position="10"> Obviously frequencies can also be obtained. All this information can be retrieved with Fast interactive access.</Paragraph> <Paragraph position="11"> We have therefore already implemented two types of organization for dictionary data: 1) DB-type organization with the DM1 (we have not used a standard DBMS, but an ad hoc designed relational 1)B system); 2) a full-text IR system for the Garzanti dictionary.</Paragraph> <Paragraph position="12"> Although both types of organization have proved to be very powerful tools for different scopes, at tile same time each presents certain drawbacks and difficulties, due to the particular nature of dictionary data which in neither case has it been possible to fully exploit. Dictionary data is in fact of a very particular nature, consisting of a combination of free text in a highly organized structure. The DB approach copes well with the second characteristic, while the \[R approach is successful in handling free text. tlowever neither is capable of fully exploiting the two features in combination. A new method must be envisaged, capable of reorganizing free text into elaborately structured information: this could be called the Lexical Knowledge Base (LKB) approach, and is the aim of the project described here.</Paragraph> <Paragraph position="13"> 3. Techniques fi~r word-sense acquisition Discow:ry procedure techniques prove to be useful in extracting semantic information from definition texts. In general, our approach consists in starting from fi'ee-tcxt definitions, in natural languagc linear form, analyzing and converting them into inlormationally equivalent structured tbrms. The preliminary step of the work consisted in applying the morphological analyzer to the definitions; tim result of this process tbr one definition appears in Fig. 4. A program designed for homograph disambignation was then run on the otput produced by this morphological processor. The disambiguator consists partly in rules generally valid for Italian, based on the immediate right and left context, and partly in ad hoc rules written for the particular syntax used in lexicographic definitions. Fig. 5 shows the result of applying this disambiguation procedure to all the homographs shown in the preceding example. We then had to implement a set of discovery procedures acting on dictionary definitions.</Paragraph> <Paragraph position="14"> The first analysis of the definitional data was performed manually for single definitimls, and quantitatively for the most frequently occurring words and syntagms. From this analysis we have established a number of broadly defined and simplified Categories of knowledge and Relations, which on the one hand intuitively reflect basic &quot;conceptual categories&quot; and on the other represenl attested lexicographic definitional categories. They also rely on past experience of similar work (both on Italian and on English), or of AI research. In order to allow the inductive</Paragraph> <Paragraph position="16"> ......................................................</Paragraph> <Paragraph position="17"> Fig. 5 - Output of the disambiguation procedure patteru-tnatching rules to perl'orn/ the successive phases correctly and so that nlore coherent retrieval operations are possible, a &quot;basic vocabulary&quot; has been established (bolh for the &quot;(k~tegories&quot; and for the &quot;Relations&quot;) mainly (m the basis o1&quot; quantitative and intuitive considerations, and is constituted by v<ords acting its Labels. As an example, the following lcmma~i: 'arnese, attrczzo, dispositivo, strumcnto, congcgno', which altogether appear in dictionary definitions 761 dines, have been grouped under the l.abel 'INSTR.UMI:~NT'. Other examples of I.abels behmging to the basic vocalmlary which ha~e been established tbr hyl~ernyms are the following: SET, PART, SCII!N(II!, Ill;MAN, ANIMAl., Pl.A.CI~, ,\CT, I I-I'ISCI', I.IQUII), Pl.ANT, INI \[AIHTANT, SO1.;ND, G:\M F, TI'XTII.I-, MOVIi, BliCOMI!, l/)Sl-, etc.</Paragraph> <Paragraph position="18"> This is, therefore, ou.r approach. We begin with a system which has simple and general pnrpose pattern-matching capabilities, designing it as an incremental system. To cope with the fact that there are ~ariations in the way the same conceptual category or the same relation is linguistically (lcxically aider syntactically) rendered in natural language definitions, each sttcll category or relation is associated with a list of specilicd lcxical units and or syntactic t'caturcs which give the variant Ibrms. The search is then driven by these lists of patterns to handle the grammatical and lexical variations.</Paragraph> <Paragraph position="19"> The &quot;pattcn>nmtching&quot; strategy has bccn obviously integrated with the Italian morphological analyzer to handle inflectional variation. The patterns may contain either l.abcls, or Lemmas, or Word-tbrms. For the Labels, the system searches for all the associated lcmmas and all their word-fornas (unless otherwise spccificd); in the same way l.emmas are automatically expanded to cover their inllccted word-fo,'ms, Generally, wc look for recurring patterns in the definitions and attempt to associate them with corresponding relations or conceptual categories. Fig. 6 lists some of the entries and delinitimts obtained when querying the dictionary in t)BT form for cooccurrcnces of items such as 'science, discipline, branch,...' together with 'studies, concerns, &quot; Analyzing the results of similar queries to the dictionary we are able to better identify a number of patterns to be used in the semantic scanning of the definitions.</Paragraph> <Paragraph position="20"> Textual Data Base Dizlonario Garzanti .......... ;;~;;;;;;'~;;'~-~iE~-~';i;;i~ ...................</Paragraph> <Paragraph position="21"> 3) ANATOMIA : PoS s.f. S#1 scienza ehe mediante la dissezionee altri metodi di ricerca studia gli organismi viventi nella lore forma esteriore e ...</Paragraph> <Paragraph position="22"> 6) ARALDICA : PoS s.f. scienza del blasone, che studia e regola la composizione degli stemmi gentilizi.</Paragraph> <Paragraph position="23"> 9) ASTROFISICA : PoS s.f. scienza che studia la natura fisica degli astri.</Paragraph> <Paragraph position="24"> IB) BIOLOGIA : PoS s.f. scienza che studia i fenomeni della vitae le leggi che li governano.</Paragraph> <Paragraph position="25"> 35) ETIMOLOGIA : PoS s.f. S#I scienza che studia le origini delle parole di una lingua.</Paragraph> <Paragraph position="26"> 37) FISICA : PoS s.f. scienza teorlco-sperimentale che studia i fenomeni naturali e le leggi relative 56) MERCEOLOGIA : PoS s.f. scienza applicata che studia le merci secondo la lore origine, i caratteri fisici, gli usi, la produzione e ...</Paragraph> <Paragraph position="27"> . ............................................................... searching for ... BRANCA l STUDIA 3) DIETETICA : PoS s.f. branca della medicine che studia la composizione dei cibi necessari a un'alimentazione razionale.</Paragraph> <Paragraph position="28"> 8) FARMACOLOGIA : PoS s.f. branca della medicina che studia i farmaci e la lore azione terapeutica sull'organismo. null 21) TOSSICOLOGIA : PoS s.f. branca della medicine che studia la nature e gli effetti delle sostanze velenose e del lore antidoti.</Paragraph> <Paragraph position="29"> ................................................................ searching for ... SPECIALITA' & $TUDIA I) CARDIOLOFIIA : PoS s.f. ({med.}) la speeialith che studia le funzioni e le malattie del cuore.</Paragraph> <Paragraph position="30"> ................................................................ searching for ... RAMO& STUDIA 3) ONOMASTICA : PoS s.f. ramo della linguistica che studia i nomi propri di persona o di luogo.</Paragraph> <Paragraph position="31"> ................................................................ searching for ... SCIENZA & OCCUPA 4) PAPIROLOGIA : PoS s.f. scienza che si occupa dello studio e dell'interpretazione degli antichi papiri.</Paragraph> <Paragraph position="32"> i) AUXOLOGIA : PoS S.f. discipline delle scienze biologiche che si occupa dell'accrescimento degli organismi, in particolare di quello umano.</Paragraph> <Paragraph position="33"> 2) NEUROPSlCHIATRIA : PoS s.f. discipline medica che si occupa delle malattie nervose e mentali.</Paragraph> <Paragraph position="34"> ................................................................ searching for ... DISCIPLINA & STUDIA I) ALGOLOGIA : PoS s.f. disciplina medica che studia to cause e le terapie del dolore.</Paragraph> <Paragraph position="35"> 13) IMMUNOLOGIA : PoS s.f. discipline biologica che studia i fenomeni immunitari.</Paragraph> <Paragraph position="36"> ................................................................ Fig. 6 - Same examples of queries to the dictionary in DBT form.</Paragraph> <Paragraph position="37"> This is an example of a pattern where the Labels SCIENCE and STUI)Y appear: !l)et/Adji SCIF.NCF, \[di NP/*Adj/e NP\] &quot;che&quot; (mediante NP)</Paragraph> </Section> <Section position="5" start_page="89" end_page="90" type="metho"> <SectionTitle> STUDY NP-OBJ </SectionTitle> <Paragraph position="0"> where the tbllowing are the lemmas associated to the Labels: SCII;NCE = (scienza, disciplina, specialita', branca, ramo, parte) STUDY = (studia, si occupa di).</Paragraph> <Paragraph position="1"> NILOi3.1 is the subje(St matter of the science.</Paragraph> <Paragraph position="2"> The results of a first run through the whole dictionary using an initial set of patterns can afterwards be recursively revised when new data are acquired. Our practical global research strategy is to develop a system which at the beginning has only a generalized expertise. This system obviously breaks down at many points on its first rtm; we can then evaluate all these points, and consider when and where measures must be taken to overcome specilic difficulties. In this way, ncw capabilities can be added incrementally to the system so that gradually it is able to cope with increasingly difficult data. Thus wc systematically add new &quot;knowledge&quot; to the system, prompted each time by a failure to cope with the given data. It seems to us that this is a practical research strategy For cliciting and modelling vague and fuzzy knowledge.</Paragraph> <Paragraph position="3"> liven though the methodological approach has been deliberately simplified at the beginning (in order to introduce problems gradually, a few at a time), the dimensions of the data have not bccn limited in any way.</Paragraph> </Section> <Section position="6" start_page="90" end_page="90" type="metho"> <SectionTitle> 4. The knowledge organization. </SectionTitle> <Paragraph position="0"> Although the body of knowledge with which we are dealing is at least partly based on intuition, on vague and not even coherent data (as lexicographic definitions often are), and on inductive empirical strategies, we must attempt to model the knowledge as the system acquires it. The formalism for the representation of word-senses is as follows.</Paragraph> <Paragraph position="1"> Each element is defined as a Function characterized by a Type and Arguments. The Type qualifies the function. The main types include: tlypernym, Relation, Qualifier, etc.</Paragraph> <Paragraph position="2"> Examples of the Type.Relation are: USED, PRODU('I~D, IN-TIIIM:OP, M, SI'IJI)Y, LACK, etc. The type llypernym can be instantiated by: !lyperriym proper, PART, SliT, etc.</Paragraph> <Paragraph position="3"> Arguments may be either Terms, or Terms plus Function, or Functions. A Term can be a Label, a Word, or a combination of these with the logical operators 'and/or'. A Word can be either a Word-form, or a Lemma plus Grammatical Information (e.g. INpl means plural Noun).</Paragraph> <Paragraph position="4"> The following definitions: Battcrio, s.m., microrganismo vcgetale unicellularc priw) di clorofilla.</Paragraph> <Paragraph position="5"> Batteriologia, s.f., parte della microbioloNa che studia i battcri. are now represented as: As the metalanguage and the rules are declared separately from the pattern-matching parser, the system is incremental, flexible, portable (it can be used with other languages or other dictionaries), and testable. In fact, the system has been designed so that it is easy to test alternative strategies or sets of rules or constraints.</Paragraph> <Paragraph position="6"> This kind of organization will allow us to draw inferences, using part of the formal structure associated to an entry and inserting it in other structures in which that entry appears as an Argument. For example, 'microbiologia' present in tile second definition above is dclined in its turn as 'parle della biologia the studia i microrganismi...', translated as (T.IIYI'-PAIUI',f(T.Rlil..SI'IiC,IN\[I~iologia)), and 'biologia' which is &quot;scienza che studia i fenomeni della vita...' is finally defined as T.1IYP-SCII!NCIL This last l.abcl SCIENCE is obviously also inherited by 'Battcriologia' and by..</Paragraph> <Paragraph position="7"> 'Microbiologia', 5. Nome experimcnlal results Ahcady alter just one run, by looking at cooccmrcnces of hypcmyms and particular relations, v,'e can identit}C/ those cnvironment~; in which certain relatio)~s arc most likely to appear, or in which certain ambiguous lcxical and/or syntactic cues (e.g. the prepositions PER 'for', DI 'o1&quot;, A 'to', etc.) can be disambiguated as referring to only one relation, or in which certain relations are never found, and so OIl, A set of constraining ;ulcs can be associated to anlain conceptual units (1 lypemyms or l~,clations, expanded automatically to all the pt:rtincnt lexical realizations) in order to disambigu;lic their immediate context. Some units therelbre activate l)axt(cular subroutines for au ad-hoc interpretation of what follows. These rtfles explicitly took lbr items to which a determined meaning is associated. In tile following pattern, we have a rule which, after an IJSI;I) relation, links the word &quot;in&quot; to a 'place' relation, thc woMs &quot;pcr, a'&quot; (for/to) to the purpose, &quot;da&quot; (by) to the agent, and &quot;come&quot; (as) to the wa 5 rig usage. Other kinds of relations are not ~tc(ivatcd bx. a particular rule, but ha\c a meaning in themselves, c.t,. ('ONSIII UI 1!1) BY, SIMII.AR &quot;fO, ctc.</Paragraph> <Paragraph position="8"> IlYPt';R .... USEI) tt :omc NP (:: x~a.~) ~cra Vmf. NI' (= imrposc) in NI' (= place) da NP (= agent) The analysis in SOmE cases is thercfbrc t',urposcly delayed until more relevant information has been acquired, and wi\]l eventually be based on the results of dclinitions already successl'ully handled. This analysis o\[&quot; the litst resuhs will lead to an improven~ent of the system, adding other patterns or other surface realizations of already existing patterns to the lirst simple list of t)atterns, and also imposing constraints on given hypernyms or on given relations. I'heretbre, after the first stage, the system consists of patterns augmented with conditioning rules which will then drive subsequent runnings of tile procedure (\[br those cases which are lexically or grammatically conditioned). In this way, the system can be gradually retined. The analysis procedure is envisaged as a series of cycles which lind relevant cooccurrenccs of categories and relations that can then be set as conditioning rules to further guide successive searches. Art interactive phase is also foreseen so that, when necessary, definitions can be modified \[br a normalization in accordance to acceptable analysis structures. From succes!ive passes through the data, applying different and increasingly <efined sets of patterns and rules, the procedure huilds up, as completely as possible with this methodology, a formal description of the structure of the lexical definitions. At the end, from a comparison of the different formalized stuctures generated, we will be able to associate structures which differ for only one element (a conceptual category or relation). In this way, we can construct something like &quot;minimal pairs&quot; of sense-definitions, which only differ in one conceptual or relational feature. It can be reasonably supposed that this teature is related or realizes one of the differences between these words. It will also be possible to build hierarchies not only for hypcrnyms, but also, and more interestingly, for complex conceptual structures considered as a whole.</Paragraph> </Section> class="xml-element"></Paper>