File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0314_metho.xml
Size: 25,338 bytes
Last Modified: 2025-10-06 14:14:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0314"> <Title>Inducing Terminology for Lexical Acquisition</Title> <Section position="4" start_page="126" end_page="129" type="metho"> <SectionTitle> 3 Integrating linguistic and </SectionTitle> <Paragraph position="0"> statistical information for term discovery The principled definitions of legal grammatical structures by which terms are expressed and the description of their distributional properties in a sub-language are crucial for the automatic construction of a domain terminological dictionary. A number of methods for language driven terminological extraction and complex nominals parsing and recognition have been proposed to support NLP and lexical acquisition tasks. They mainly differ in the emphasis they give to syntactic and statistical control of the induction process. In (Church,1988) a well-know purely statistical method for POS tagging is applied to the derivation of simple noun phrases that are relevant in the underlying corpus. On the contrary more language oriented methods are those where specialized grammar are used. LEXTER (Bourigault,1992) extracts maximal length noun phrases (mlnp) from a corpus, and then applies a special purpose noun phrase parsing to ~hem in order to focus on significant complex nominals. Although the reported recall of the mlnp extraction is very high (95%) tile precision of the method is not reported.</Paragraph> <Paragraph position="1"> Voutilanen (1993) describes a noun phrase extraction tool (NPtool) based upon a lemmatizer for English (ENGTWOL) and on a Constraint Grammar parser. The set of potential well-formed noun phrases are selected according to two parsers working with different NP-hood heuristics. A very high performance of NP recognition is reported (98.5% recall, and 95% precision).</Paragraph> <Paragraph position="2"> A more statistically oriented approach is undertaken in (Daille et a1,1994) where a methodology for syntactic recognition of complex nominals is described. Linguistic filters of morphological nature are also applied. Corpus driven analysis is mainly based on mutual information statistics and the resulting system has been successfully applied to technical documentation, e.g. telecommunication.</Paragraph> <Paragraph position="3"> All these methods deal with the problem of NP recognition. As we are essentially interested to NP that are actual terms in a domain, we will need to decide which NPs are actual terms. We will define: 1. well formedness principia for term denotations and a description of the different grammatical phenomena related to terms of a language 2. distributional properties that distinguish terms from other (accidental) forms (e.g. non terminological complex nominals).</Paragraph> <Section position="1" start_page="126" end_page="127" type="sub_section"> <SectionTitle> 3.1 Grammatical descriptions of terms in Italian </SectionTitle> <Paragraph position="0"> It is generally assumed that a terminologic dictionary is composed of a (possibly structured) list of nouns, or complex nominals. Nominal forms are in fact lexicalization of domain concepts: proper nouns, acronyms as well as technical concepts are mostly represented as nominal phrases of different length and complexity. For this reason, we concentrated only on noun phrases analysis, as the main source of terminologic information 2. A term is obtained by applying several mechanisms that add to a source word (generally a noun) a set of further specifications (as additional constraints of semantic nature). 2In lexical acquisition the role of other syntactic categories (e.g. verbs, adjectives, ...) is also very important but the set of phenomena related to them is very different, ms also outlined by (Basili et al.,1996b) A detailed analysis of the role of syntactic moditiers and specifiers (De Rossi,1996) revealed that legal structures for modifiers and specifiers in Italian are mainly of two types: 1. restrictive (or denotative) modifiers (postnominal participial, adjectival or prepositional phrases) 2. appositive (or connotative) modifiers (prenominal modifiers, i.e. adjectival phrases) Restrictive modifiers are generally used to constraint, the semantic information related to the corresponding noun, via a further specification of a given typefor that noun as in scambi commerciali (*exchanges commercial): the referent noun is forced to belong to a restricted set of exchanges (that are in fact of commercial nature). On the contrary, appositive modifiers are used by the speaker/writer to add additional details: his own point of view or pragmatic information, as in la bianca cornice (the white frame) or la perduta genre (the lost people). Appositive modifiers do not correspond to any (shared) classification, but rather to the subjective speaker's point of view. Furthermore prenominal modifications are rather unfrequent in Italian. We thus decided to focus only on restrictive modifiers, the best candidates to bring terminological (i.e. assessed classificatory) information. The set of syntactic phenomena that have been studied as good candidates for restrictive forms are: 1. adjectival specification (via postnominal adjectives, as in inquinamento idrologico (*pollution hydrological) ) 2. nominal specification (postnominal appositions, as in vagone letto (wagon-lit), or Fiat Auto (Fiat Cars)) 3. locative phenomena (postnominal proper nouns indicating locations, as in IBM Italia 4. verbal specification (via postnominal past participle, as in siti inquinati (*sites polluted)) 5. prepositional specification (via a particular set of postnominal prepositional structures, as in Istituto di Matematica (Institute of Mathematics), or barca a vela (sailin9-boat)). a Given the above linguistic principles, a special purpose grammar for potential terminological structures can be sketched. With a simple language of regular expressions the grammar of adjectival, 3The set of prepositions that have been selected to introduce typical restrictive descriptions are: di,a,per, da. Only postnominal prepositional phrases introduced by one of these prepositions have been allowed for term expressions.</Paragraph> <Paragraph position="1"> prepositional and participial restrictions can be expressed as: Term 6- noun A_P&quot; Term. 6- noun A_P (Con9 A_P) deg Term 6- noun&quot; Term. 6- noun (- noun)* Term 6- noun , Term, A_P 6- adjective I past_participle Cong 6- '-' \[ e Prepositional postmodifiers are modeled according to the following rules:</Paragraph> <Paragraph position="3"> Note that the allowed structures are post nominal due to the typical role of specifications in Italian.</Paragraph> </Section> <Section position="2" start_page="127" end_page="129" type="sub_section"> <SectionTitle> 3.2 Distributional properties and term </SectionTitle> <Paragraph position="0"> extraction The recursive nature of some rules require an iteralive analysis of the corpus. The following algorithm is used: 1. Select singleton nouns whose distributional properties are those for terms and insert them in the terminologic dictionary (TD) 2. Use the valid terms in TD to trigger the grammar and build complex nominals cn 3. Select those cn whose distributional properties are those for terms and insert them in TD. 4. Iterate steps 2 and 3 to build longer cn 4 Note that newly found complex terms, added to TD in step 3, force a re-estimation of term probabilities obtained by a further corpus scanning, so that their heads are not counted twice.</Paragraph> <Paragraph position="1"> The validation of a limited set of potential surface forms as actual terms is crucial for lowering the complexity of tile above algorithm. Given the grammar, we need criteria to decide which surface forms, that, reflect the typical structure of a potential terms, are actual lexicalizations of relevant concepts of the corpus. The kind of observations that are available from the corpus are: (i) the set of lemmas met in the texts, (ii) the set of their well formed restrictions (i.e. complex nominals) and (iii) the distributional properties of entries in (i) and (ii). We firstly establish when a singleton lemma is a relevant concept by using distributional properties of nouns. Then we characterize which restrictions of those terms are valid lexicalizations of more specific concepts. We proceed as follows: 4 As terminological units longer than 5 words are very infrequent in any sublanguage, we decided to stop after the second iteration 1. Select the set of lemmas that by themselves are markers of relevant concepts in the corpus. Lemmas are detected according to their frequency in the observed language sample as well as to their selectivity, i.e. how they partition the set of documents. This phase produces an early TD dictionary of simple terminological elements.</Paragraph> <Paragraph position="2"> 2. Extend TD also with those (well-formed) restrictions, cn(l), of any l E TD according to the mutual information they exchange with I.</Paragraph> <Paragraph position="3"> ,Select and Extend depend on distributional properties of simple lemmas and complex nominals, respectively. null The distributional property needed for the Select step is the term specificity. Specific nouns are those frequently occurring in a corpus, but whose selectivity in sets of documents is very high, that is: they are very frequent in a (possibly small) set of documents and very rare in the rest. In order to capture such behavior we use two scores: the frequency tij of a term i in a document j and the inverse document frequency of a term (Salton,1989). Given a term i, its inverse document frequency is defined as follows: idfi = log s ~N where dfi is the number of documents of the corpus that include term i, while N is the total number of documents in the collection. The following criteria is defined to capture singleton terms: if exists at least one document where a noun i is required as index (because it is relevant for that document and selective with respect to other documents) then such a noun denotes a relevant domain term (i.e. specific concept). In order to decide we rely on idfi and tij as follows.</Paragraph> <Paragraph position="4"> DEF: (Singleton term). A noun i is a term if at least one document j exists for which: N wij = tij log2 ~// _> r (1) wi.i captures exactly the notion of specificity required in the Select step of our algorithm. Potential heads of terminological entries are selected according to their selective power in the corpus* Even very rare words of the corpus can be captured by (1). In the Extend step of the a.lgorithm we need to evaluate the mutual information values of phrase</Paragraph> <Paragraph position="6"> where: freq(x, y) is the frequency of the joint event of (x,y), freq(x), freq(y) and N are the frequency of x, y and the corpus size respectively. In order to apply the standard definition of mutual information we need to extend it to capture the specific nature of the joint event head-modifier (H M1). Note that M1 denotes post nominal adjectives or past participle but also prepositional phrase like dello Stato in territorio deilo Stato. We decided to estimate the Mutual Information of such structures in a left to right fashion. The rightmost modifier (i.e. M1 in (H M1) structures, or M, in (H M1 ... M,,)) is considered as the right event y and every left incoming sub-structure (i.e. H or H ... Mn-t) is represented as a single event x. The generalized evaluation of Mutual information for cn = ( ( H , M1, M2 . . &quot; Mn- \] ), M, ) is thus:</Paragraph> <Paragraph position="8"> (2) As an example a term like debito pubblico (public debt) receive a mutual information score according to the following figure:</Paragraph> <Paragraph position="10"> while debito publico estero (foreign public debt) produces to the following ratio:</Paragraph> <Paragraph position="12"> The threshold if(H) depends on noun H as it is evaluated according to the statistical distribution of every complex nominals headed by H 5. The set of singleton terms is exactly the same set that a classical indexing model (Salton,1989) obtains from the document collection (i.e. the corpus). The Extend ~ In the experimental tests best values for g have been obtained a.s a fimction of mean and variance of the \[ distribution over the set of cn headed by H phase allows to capture all the relevant specifications of the singleton terms, cornpile a more appropriate dictionary (for the corpus) and structure it in hierarchically organized entries.</Paragraph> </Section> </Section> <Section position="5" start_page="129" end_page="129" type="metho"> <SectionTitle> 4 Implementation Issues </SectionTitle> <Paragraph position="0"> The model described in the previous section has been used to implement a system for terminology derivation from a corpus. The system relies upon the POS tagging activity as it is carried out within a LA framework (e.g. the ARIOSTO system (Basili et a1.,1996)) and extracts a fifil terminologic dictio- null nary TD of: 1. simple terms (i.e. nouns) as seeds of a terminological structured dictionary (selected according to (1) 2. complex nominal forms of some of those seeds, generated by the grammar and filtered according to (3).</Paragraph> <Paragraph position="1"> Terminology extraction is triggered after POS tagging. Morphologic analysis is rerun according to the compiled TD. This feedback allows the system to exploit complex term extraction before activating syntactic recognition, in order to prune out significant components of grammatical ambiguity. This improves the overall ability of the linguistic processor and supports term oriented rather than lemma oriented lexical acquisition.</Paragraph> <Paragraph position="2"> A dedicated subsystem has been developed to support manual validation of single terms. In Figure 1 a screen dump of the graphical interface that supports the interactive validation (or removal)of terms in TD is shown. TD is hierarchically organized in separate sections where singleton terms dominate all their specified subconcepts. A section is the set of terms that share the same term head. A term like smaltimento dei rifiuti (garbage collection), has the noun &quot;smaltimento&quot; (garbage) as its term head. A specific section includes terms like smaltimento dei rifiuti, smaltimento di materiale tossico, smaitimento di gas di scarieo .... ). In Figure 1 the head noun debito (debt)) is reported: the section related to debito includes all its validated specifications (e.g.debito pubblico (public debt), debito pubblico estero (.foreign public debt) ... ).</Paragraph> </Section> <Section position="6" start_page="129" end_page="130" type="metho"> <SectionTitle> 5 Experimental Set-Up </SectionTitle> <Paragraph position="0"> In this section we describe the experimental set-up used to evaluate and assess the described model of terminological derivation.</Paragraph> <Paragraph position="1"> The method has been tested over two corpora of italian documents. The first corpus (ENEA) is a</Paragraph> <Paragraph position="3"> collection of scientific abstracts on the environment, made of about 350.000 words. The second corpus (Sole24Oore) is an excerpt of financial news from the Sole 24 Ore economic newspaper, of about 1.300.000 words. The terminology extraction have been run over both tile corpora. From the ENEA corpus we derived a dictionary of about 2828 words. From the Sole24Ore corpus 5639 terms have been extracted.</Paragraph> <Paragraph position="4"> In order to carry out the experiments we used a subset of tile ENEA corpus in order to measure performance over manually validated documents. The specific nature of our tests required the definition of particular performance evaluation measures. In fact, together with the classical notion of recall and precisions, we used also data compression, as the percentage of incorrect syntactic data that are no longer produced when specific terminology is used.</Paragraph> <Paragraph position="5"> A further index is the average ambiguity defined according to the notion of collision set (Basili et al., 1994). Ill order to accomplish the task further reference information has been used: two standard domain specific thesaura have been used for comparing the result of the terminology extraction in the environmental domain (ENEA corpus).</Paragraph> <Section position="1" start_page="129" end_page="130" type="sub_section"> <SectionTitle> 5.1 Linguistic analysis of corpus data </SectionTitle> <Paragraph position="0"> In Table 1 the section headed by attivitd, as it has been derived from the ENEA corpus, is shown. The specific nature of the corpus is well reproduced by the data. Here two specific senses of the lemma attivit6 are captured: natural and biological activity as in attivitd entropies and human activities (like attivitd produttiva (productive activity) or attivitd di costruzione (building activity). These latter have specific implications (for what concerns artificial pollution) in the environment.</Paragraph> <Paragraph position="1"> Table 1 reports also the distribution of the term in a set of 106 documents. In method RI terms have been selected by classical inverse document frequency (Salton,1989) applied to singleton lemmas (i.e.attivit6). In Method TI we run inverse document frequency after a terminology driven lemmatization of documents (i.e. using complex terms as source lemmas). The two sections of the table show that no index has been lost by the TI method (all of the 15 indexes have been found). This result is more general: TI method produces more indexes. Over the 106 documents MI extracts 476 simple indexes while TI extracts 732 (terminological) indexes.</Paragraph> <Paragraph position="2"> Again in Table 1, 5 of the fifteen indexes found by the TI method are complex nominals. In the set of documents from 1 to 20 (lto20 column) these allow to discriminate between attivit6 and attivitd antropica.</Paragraph> <Paragraph position="3"> Such an higher discriminating power is required not only for document classification/retrieval but, first of all for lexical acquisition: in this technical domain in fact it seems necessary to rely on the information that attivit6 is typically carried out by humans while attivit6 antropica is not. We are convinced that these are the typical selectional constraints to be captured by corpus driven lexical acquisition methods. Finer lexicalizations (like attivit6 antropica) are the only way to provide a better input to the target acquisition tasks.</Paragraph> </Section> <Section position="2" start_page="130" end_page="130" type="sub_section"> <SectionTitle> 5.2 Experiment 1: Effectiveness of the </SectionTitle> <Paragraph position="0"> termlnology extraction The aim of this experiment was to test the ability of the method to capture relevant concepts in the sublanguage. We run this test on the environmental domain (ENEA corpus). The reference term dictionary was manually compiled by a team of three domain experts, culturally heterogeneous. We got a complete list of terms (simple nouns as well as complex nominals) to be used as a test-set (RT). The reference document set was a collection of 106 documents. The experts compiled a set of 482 terms organized in 155 sections (i.e. relevant head nouns). Each section thus includes 3.12 terms. For sake of completeness we selected two large hand-coded thesaura for the environment: the CNR dictionary</Paragraph> </Section> </Section> <Section position="7" start_page="130" end_page="131" type="metho"> <SectionTitle> RT CNR AIB TD </SectionTitle> <Paragraph position="0"> smaltimento dei fanghi _ X smaltimento dei rifiuti X X X smaltimento delle scorie X dictionary (AIB,1995). Both these dictionaries as well as the automatically generated dictionary TD have been compared with the reference RT. The comparison has been carried out throughout the different aligned sections. The alignment of the section related to the head smaltimento is reported in Table 2 (&quot;X&quot; means the presence of the term in the corresponding dictionary, while &quot;.&quot; denotes its absence): Any dictionary D can thus be evaluated by measuring precision~ i.e.</Paragraph> <Paragraph position="1"> precision = RTterrnsoDterrn8 Dterrns and recall, i.e.</Paragraph> <Paragraph position="2"> recall = RTterrnsf'lDterrns RTterrns For example within the section related to the head smaltimento, we have 3 RT terms, of which 1 is in CNR and AIB respectively and 3 are in TD. When applying the recall and precision definition to every sections of the RT dictionary we obtained tile average performance scores reported in Table 3 over the three dictionaries.</Paragraph> <Section position="1" start_page="130" end_page="131" type="sub_section"> <SectionTitle> 5.3 Experiment 2: Shallow parsing with </SectionTitle> <Paragraph position="0"> terminological knowledge Consulting a terminologic dictionary before activating a shallow syntactic analyzer is helpful to solve several morphological and syntactic ambiguities. For exa~nple, given the sentence 6 L 'ufficiale della Guardia di Finanza visit I'aereoporto di Fiumicino (The officer of Finance Guard visited the Fiumicino airport) a typical shallow syntactic analyzer (SSA) (Basili et al., 1992) produces the following elementary syntactic links (esl), due the syntactic ambiguity of prepositional phrases (PP), e.g. ( (di linanza), (di As each sentence reading cannot assign more than a single referent to each PP, we can partition the set of esl into several collision sets (i.e. sets of esi that cannot belong to the same sentence reading according to (Basili et al, 1994)). The sample sentence gives rise to the following collision sets: { (ufliiciale di flna, nza) (guardia di finanza) }</Paragraph> <Paragraph position="2"> When terminology is available many complex nominals are retained as single tokens and several ambiguity disappear. In the Sole24Ore corpus our method produced both the terms guardia di finanza and aeroporto di Fium, icino so that the final list of esl reduces to N-P-N ufficiale della gua.rdia-di-finanz&</Paragraph> <Paragraph position="4"> and no ambiguous (i.e. not singleton) collision set remains. We have two positive effects on the parsing activity. The first is data compression. In fact the overgeneration typically due to the shallow grammatical approach is significantly limited. In our example the early 7 elementary syntactic groups obtained in absence of terminology reduced to 4 with an overall data compression of ((7-4)/7) 42.8%. An extended experimentation has been carried out on a subset of 500 sentences of the corpus. The use of terminology reduces the number of elementary syntactic links from 500 to 403 with a corresponding 20% of overall data compression. Furthermore, the detection of a term carried out over single tokens that are morphologically ambiguous improves also the morphological recognition. In fact the detection of a chain of tokens that are part of the same term implies a specific choice on the grammatical category of each token, thus augmenting the selectivity of POS tagging. Over the same subset of the corpus we measured a decrement of 4% in the number of morphological derivations produced with terminology against the recognition carried out in absence of any terminological knowledge.</Paragraph> <Paragraph position="5"> A second positive aspect of having an available domain specific terminology is the reduction of the underlying syntactic ambiguity and increase of the parser precision. As shown in the example many PP ambiguity disappears as soon as a set complex nominals is detected. This has a strong implication on shallow (or robust ms widely accepted in literature) parsing. We conducted a systematic analysis of correct parsing results by contrasting a parser with and without access to domain terminology. The analysis of the results has been performed by comparing collision sets obtained by the two runs over a set of 100 sentences. Four performance scores have been evaluated: the degree of ambiguity (i.e.the ratio between the number of ambiguous esl's over the total number of derived esl's); the average ambiguity (expressed by the average eardinality of the collision sets (i.e. the number of reciprocally ambiguous esl%); finally, precision and recall have been measured according to a hand validation of the derived syntactic material 7. The analysis has been carried out specifically for prepositional esl% (i.e. nounpreposition-noun, verb-preposition-noun, adjectivepreposition-noun links). Results are reported in Table 4 where separate columns express the scores for the different runs: a simple parser (SP), and a terminology driven parser (TP). As a result the simple parser obtains several complex nominals but only as syntactic structures so that it fails in detecting higher order syntactic links (i.e. syntactic relations between complex nominals and other sentence segments). In these cases we penalized also the recall of the SP method, so that the difference between the two methods relies not only in amount of persisting ambiguity (i.e. precision), but also in coverage (better captured by recall).</Paragraph> </Section> </Section> class="xml-element"></Paper>