File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3150_metho.xml
Size: 17,020 bytes
Last Modified: 2025-10-06 14:13:01
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-3150"> <Title>SURFACE GRAMMATICAL ANALYSIS FOR THE EXTRACTION OF TERMINOLOGICAL NOUN PHRASES Didier BOURIGAULT Ecole des Hautes Etudes en Sciences Sociales</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SURFACE GRAMMATICAL ANALYSIS FOR THE EXTRACTION OF TERMINOLOGICAL NOUN PHRASES </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> LEXTER is a software package for extracting terminology. A corpus of French language texts on any subject field is fed in, and LEXTER produces a list of likely terminological units to be submitted to an expert to be validated. To identify the terminological units, LEXTER takes their form into account and proceeds in two main stages : analysis, parsing. In the first stage, LEXTER uses a base of rules designed to indentify frontier markers in view to analysing the texts and extracting maximal-length noun phrases. In the second stage, LEXTER parses these maximal-length noun phrases to extract subgroups which by virtue of their grammatical structure and their place in the maximal-length noun phrases are likely to be terminological units. In this article, the type of analysis used (surface grammatical analysis) is highlighted, as the methodological approach adopted to adapt the rules (experimental approach).</Paragraph> <Paragraph position="1"> I) Constituting Constituting a terminology of a subject field, that is to say establishing a list of the terminological units that represent the concepts of this field, is an oft-encountered problem. For the Research Development Division of Electricit6 de France (French Electricity Board), this problem arose in the information documentation sector. An automatic indexing system, using different thesauri according to the application, has been operational for three years or more \[Monteil 1990\]. The terminologists and information scientists need a terminology a terminology extraction tool in order to keep these thesauri up to date in constantly changing fields and to create &quot;ex nihilo&quot; thesauri for new fields. This is the reason why the terminological extracting software, LEXTER, was developed, forming the first link in the chain that goes to make up the thesaurus. A corpus of frenchlanguage texts is fed into LEXTER, which gives out a list of likely terminological units, which are then passed on to art expert for validation.</Paragraph> <Paragraph position="2"> AUlXS DE COLING-92, NANTES, 23-28 AO~r 1992 9 7 7 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 2) What is a terminological unit ? The main aim here is not to provide a rigorous definition of what a terminological unit is, but rather to outline its essential features, and thus to justify the hypotheses (concerning the form of terminological units) on which LEXTER is based.</Paragraph> <Paragraph position="3"> Semantic function : the representation of the concept The first characteristic of the terminological unit is its function as the representation of a concept. The terminological unit plays this role of representation in the framework of a terminology, which is the linguistic evidence of the organisation of a field of knowledge in the form of a network of concepts; the terminological unit represents a concept, uniquely and completely, taken out of any textual context. The existence of this one-to-one relationship between a linguistic expression and an extra-linguistic object is, as we shall see, a situation which particulary concerns the terminological units.</Paragraph> <Paragraph position="4"> The appearance of a new terminological unit is most often a parallel process to that of the birth of the concept which it represents. This &quot;birth&quot; is marked by the consensus of a certain scientific community. This consensus is attested only when the occurrences of this linguistic expression, or term-to-be, shows a stable correlation to the same object in the subject field, uniquely and completely, in the writings of the agents of this scientific community.</Paragraph> <Paragraph position="5"> When this is the case, the object in question takes its place in the network describing the subject field, and the expression takes on the status of a terminological unit. This referential function is, for E. Benveniste, the &quot;synaptic&quot; mark of a syntagm \[Benveniste 1966\].</Paragraph> <Paragraph position="6"> It is thus because occurrences in text of a terminological unit systematically refer to a concept, that a relationship of representation is established, out of any textual context, between the terminological unit and the concept. This underpins the specific status of the terminological unit as opposed to that of the word in language, a status close to that of a descriptor in Information Science (\[Le Guern 1984\]).</Paragraph> <Paragraph position="7"> Syntactic form : synaptic composition We put forward the hypothesis that this function of representing the concept out of context puts a certain number of constraints on the form that terminological units may take on. It has been seen that the construction of terminological units obey well-known rules of syntactic formation, called synaptic composition (\[Benveniste 1966\]). For example : terminological units are noun phrases, generally made up of nouns and adjectives, and pratically never containing conjugated verbs; the prepositions used most often are &quot;de&quot; and &quot;~&quot;, rarely followed by a determiner.</Paragraph> <Paragraph position="8"> To illustrate this, take the concept of a &quot;screen belonging to a portable computer&quot;. Without going in to the linguistic phenomena behind this, it can be said that, in context, both the syntagms &quot;l'rcran d'un ordinateur portable&quot; (&quot;the screen of a portable computer&quot;) and &quot;un 6cran d'ordinateur portable&quot; (&quot;a portable computer screen&quot;) can refer generically to the concept. However, if one wished to represent this concept out of any textual context, the chances are that one would reject the expression &quot;rcran d'un ordinateur portable&quot;, for wich the interpretation of the article &quot;un&quot; may be ambiguous, and accept the expression &quot;6cran d'ordinateur portable&quot;, more naturally used in isolation and thus more suitable to go into a terminology.</Paragraph> <Paragraph position="9"> From these considerations on the form and the function of terminological units, two mains ideas are relevant to developing a computer NLsed system of terminology constitution : 1) It is possible to devise an extraction program solely based on syntactic data, since the grammatical form of terminological units is relatively predictable; 2) It is not possible to expect this program to extract terminological units and nothing else, given the basically referential semantic function of occurrences of terminological units : this means that the results obtained can only be considered, a priori, as likely terminological units.</Paragraph> <Paragraph position="10"> ACRES DE COL1NG-92, NAN'~S, 23-28 Aotrr 1992 9 7 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 3) How LEXTER works : analysis and parsing To detect terminological units, LEXTER takes the form of these units into consideration, and works in two phases : analysis and parsing.</Paragraph> <Paragraph position="11"> LEXTER treats categorized texts, which have been submitted to a morphological analysis : each word is tagged with its grammatical category (noun, verb, adjective, etc.).</Paragraph> <Paragraph position="12"> At this stage, LEXTER takes advantages of &quot;negative&quot; knowledge about the form of terminological units, by identifying those grammatical patterns which never go to make up these units and which can thus be considered potential terminological limits. Such patterns are made up by, say, conjugated verbs, pronouns, conjonctions, certain strings of preposition + determiner, etc.</Paragraph> <Paragraph position="13"> The LEXTER analysis module is thus set up with a base of rules for identifying frontier markers, which it uses to analyse the texts. This analysis phase produces a series of text sequences, most often noun phrases. The way the rules are worked out is presented in section 5.</Paragraph> <Paragraph position="14"> These noun phrases may well be likely terminological units themselves (as is the case with TRAITEMENT DE TEXTE, in the example in figure 1), but more often still, they contain subgroups which are also likely units (such as DISQUE DUR DE LA STATION DE TRAVAIL, which contains DISQUE DUR and STATION DE TRAVAIL). That is why it is preferable at the analysis stage to refer to the noun phrases identified as &quot;maximal-length noun phrases&quot;.</Paragraph> <Paragraph position="15"> Second stage : parsing the maximal-length noun phrases It is thus necessary, in the second stage, to parse these maximal-length noun phrases in order to obtain subgroups which are likely terminological units by virtue of their grammatical structure and their position in the maximal-length noun phrase. The LEXTER parsing module is made up of parsing rules which indicate which subgroups to extract on the basis of grammatical structure. An example of a rule is given in figure 1. In its present form, the parsing module can recognize up to 800 different structures, enabling it to treat A~F.S DE COLING-92. NANTES, 23-28 AOt~r 1992 9 7 9 PROC. OF COLING-92, NANTES. AUG. 2.3-28, 1992 around 95% of the maximal-length noun phrases obtained from our test corpus on completion of the analysis stage, that is around 43,500 groups out of 46,000. This module, the core of LEXTER, is described more fully in \[Bouriganlt 1992b\].</Paragraph> <Paragraph position="16"> 4) Surface grammatical analysis versus complete syntactic analysis At the beginning of the conceptual phase in the development of LEXTER, it was hypothesized that a complete syntactic analysis of the sentences of the corpus could be foregone, given the limited aim of extracting terminological noun phrases with their characteristic grammatical structure.</Paragraph> <Paragraph position="17"> In LEXTER, the basic linguistic data is the grammatical categoty of the lexical units which make up the sentences, and the analysis and parsing which make use of this data take into account the surface form of utterance considered as sequences of categorized units : only the &quot;place&quot; of the units in the surface sequence is taken into account and not their position (cf \[Milner 1989\]) in a syntactic structure. This is why it is more accurate to speak of a surface grammatical analysis than of a complete syntactic analysis.</Paragraph> <Paragraph position="18"> The quality of the results obtained by the present prototype justify the non-necessity of a complete syntactic analysis. The advantages of restricting the analysis to surface structure are obvious : the program can deal with texts written in styles that are not necessarily academic; it is sturdy and quick, not negligible virtues when it come to the development and extension stages.</Paragraph> <Paragraph position="19"> Although it is not necessary to go into a complete syntactic analysis of the sentences to extract the terminology from a corpus, it would seem highly likely that a syntactic analyser (parser) would be much more efficient if it could use a glossary of the terminological units of the subject area. The syntactic structures of a natural language text, and the syntactic structures of the terminological units, representing out of context, in a terminology, the concepts of a subject field, are to be placed on two different organisational levels. It is thus advisable to dissociate these two analysis, though using the results of one for the other.</Paragraph> <Paragraph position="20"> Since the terminological unit, as its name suggests, is always a semantic unit, which is at the basis of its status (cf SS1), it should be treated as such on the syntactic level as well. This makes it possible to envisage a text analysis in two phases, the first identifying terminological units, the second using these results to analyse the sentences syntactically, with the view to constructing a semantic representation. This is the principle which we intend to adopt to make LEXTER a text analysis tool to aid knowledge acquisition (cf \[Bourigault 1992a\]).</Paragraph> <Paragraph position="21"> 5) An experimental approach to work out rules of analysis To analyse texts, LEXTER uses an analysis rule base which detects frontier markers. Some of these rules are simple : one of them detects all punctuation marks; another all the words belonging to certain grammatical categories : verbs, conjonctions, pronouns, etc.</Paragraph> <Paragraph position="22"> As well as theses simple rules, it is necessary to add more complex rules which examine sequences of lexical units to find frontiers, in particular to spot the boundaries between noun phrases that are complements of the same verb or the same noun.</Paragraph> <Paragraph position="23"> The constraint imposed by the choice of a surface grammatical analysis make it difficult to base the detection of frontiers between noun phrases on reliable theorical morphosyntactic hypothesis (even though the works of F. Debili showed that this choice, for french language, is pertinent for computer processing \[Debili 1982\]). This is particulary so for the semantic-syntactic type of lexical information for the subcategorization of verbs (nouns and adjectives as well) which must be foregone, making even more difficult the tricky task of identifying to what prepositional noun phrases are attached.</Paragraph> <Paragraph position="24"> The alternative is then to rely on intuitive ideas and to compensate for the absence of theorical justification by adopting an empirical approach based on large-scale corpus experimentation.</Paragraph> <Paragraph position="25"> Before any rule is put into one of LEXTER's modules (rules of analysis, rules of parsing), it must pass the test of the results it produces every time it is applied to tile test corpus.</Paragraph> <Paragraph position="26"> ACRES DE COLING-92, NANTES, 23-28 AOOT 1992 9 8 0 PROC. OF COLING-92. NANTES, AUG. 23-28. 1992 This is why it is necessary to work on a test corpus of sufficient volume to be representative of possible cases of analysis and parsing, and to produce a sufficiently fast working software to make this experimental approach worthwhile.</Paragraph> <Paragraph position="27"> For this, a test corpus of 2 5000 pages (arround 1 200000 words) was used, gathered from 1 700 texts, in which the scientific officers of the Research Development Division of Electricit6 de France describe in a short paper (1 or 2 pages) each of their medium term research projects. The analysis and parsing module of LEXTER were programmed in C, using lex and yacc tools in a Unix environment.</Paragraph> <Paragraph position="28"> Each of the stages of analysis and parsing is less than 2 minutes on a Sun work station, making very frequent tests easy and thus elmbling far reaching updathlg and ajustment.</Paragraph> <Paragraph position="29"> It is through an experimental approach that the analysing (and parsing) rules were worked out.</Paragraph> <Paragraph position="30"> By way of illustration, the analysis rules treating the sequences preposition + determiner are presented in figure 2.</Paragraph> <Paragraph position="31"> It is true too that these rules, as all the rules in LEXTER, have their limits and that there are cases where they apply (or do not apply) &quot;wrongly&quot;. These limitations come from the strong hypotheses and the methodological choices which have already been outlined. But in the field of Linguistic Engineerin.g, exceptions do not have the same status as m Linguistic Science; it is here a question of compromise.</Paragraph> <Paragraph position="32"> In the experimental approach adopted, this risk of error is taken into account and kept under control, as each rule is tested separately against the corpus, and it is the test of the number of cases to which it applies which decides whether it gets into LEXTER or not. The principle is not to include rules of analysis which are too strict; it is preferable to drop a rule which is productive in many cases (as for the rule A + LE, LA or LES = frontier) if the number of residual cases of erroneous analysis is too high. This principle, called &quot;of relative strictness&quot;, is justified in that it will be easier for the te~ninologist to eliminate certain likely units than to find real terminological units that escaped detection by LEXTER</Paragraph> </Section> class="xml-element"></Paper>