File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4188_metho.xml
Size: 23,501 bytes
Last Modified: 2025-10-06 14:12:59
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4188"> <Title>CONVERTING LARGE ON-LINE VALENCY DICTIONARIES FOR NLP APPLICATIONS: FROM PROTON DESCRIPTIONS TO METAL FRAMES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> CONVERTING LARGE ON-LINE VALENCY DICTIONARIES FOR NLP APPLICATIONS: FROM PROTON DESCRIPTIONS TO METAL FRAMES GEERT ADRIAENS \[1,2\] GERT DE BRAEKELEER \[1\] </SectionTitle> <Paragraph position="0"> In this paper, we report on a large-scale conversion experiment with on-line valency dictionaries. A linguistically motivated valency dictionary in Prolog is converted into a valency dictionary for a large-scale machine translation system. Several aspects of the two dictionaries and their backgroand projects are discussed, as well as the way their representations are mapped. &quot;/'he results of the conversion are looked at from an economic perspective (fast coding for NLP), and also from a computational-lexicographic perspective (requirements for conversions and for standardization of lexicon information).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> One of the major bottlenecks for large-scale NLP applications such as the METAL(r) MT system 1 is the acquisition of their lexicons 2. Whereas the development and fine-tuning of the grammars of such systems reaches its saturation point after a few years of R&D, the extension of their lexicons is a constant and ever growing concern. In order to speed up the lexical acquisition process, coding tools are developed to increase the human lexicographer's productivity and existing electronic dictionaries are looked for that can be converted and integrated with the particular NLP application at hand.</Paragraph> <Paragraph position="1"> In this paper we report on a large-seale conversion effort with an eye to enhancing the METAL verb dictionaries with several thousands of entries. While the system is capable of defaulting the necessary morphological information for verbs on the basis of their surface appearance (cp. Adriaens & Lemmens 1990), it cannot automatically create the complex syntactic-semantic valency information, i.e. the quantitative and qualitative characterization of the arguments of a verb. Still, this information is of crucial importance for the system to parse and translate correctly. Valency characterizations can be used to discriminate different readings of a sentence during analysis (cp. e.g. the different usages of hail: it is hailing, she hailed curses at me, he hailed me from the window, the people hailed him king).</Paragraph> <Paragraph position="2"> Moreover, they are often useful for disambiguating purposes with an eye to translation: for Dutch, for Boguraev & Briscoe 1989, Zemik 1989.</Paragraph> <Paragraph position="3"> instance, to reach for something is a usage that needs a different translation from m reach somebody something (pakken versus overhandigen). (For a detailed discussion of the importance of valency for NLP and MT in particular, we refer to Gebraers 1991.) To recognize the need for detailed valency descriptions in NLP applications is one thing, to acquire them is less selfevident. In a system like METAL, the valency feature on verbs represents the most complex and hard-to-code element in its lexical representations. Hence, to automate and speed up the acquisition process, we used electronic valency dictionaries for Dutch and French as coded by the PROTON project (see van den Eynde et at.</Paragraph> <Paragraph position="4"> 1988, Eggermont & van den Eynde 1990, Eggermont et at. forthcoming) as our starting point. The conversion was a non-trivial exercise in computational lexicography for several reasons. First, the PROTON databases are mainly descriptive and exhaustive in nature; they were not conceived with particular NLP applications in mind.</Paragraph> <Paragraph position="5"> METAL, on the other hand, seeks parsimony for efficient computational treatment within a machine translation application. More in particular, PROTON codes one entry per valency frame of a verb, whereas METAL merges valency patterns into &quot;superframes&quot;, storing these only once for each verb. Second, their representation formalism is based on a particular distributional linguistic approach (the Pronominal Approach, see 2.2) not completely alien to the METAL representation, but not straightforwardly convertible either. And third, the PROTON databases take the form of Prolog clauses, whereas METAL uses Lisp lists.</Paragraph> <Paragraph position="6"> Beside the purely practical goal of fast lexicon extension, there are a few interesting questions to be asked that may be relevant beyond that goal: - Is such a conversion worth the eflort of defining a &quot;waterproof&quot; mapping between the source and target formalisms, and of developing the programs to do the mapping? In other words, could we trot simply have coded the several thousand verbs by hand instead of spending months on the conversion? - To what extent are these conversion experiments useful for an attempt at defining a theory-neutral standard for the representation of valency information in verb dictionaries for NLP applications? Or, less ambitiously, can we come up with a set of requirements for convertibility of lexieal resources?</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Verbal valency descriptions 2.1 General considerations ACRES DE COL1NG-92, NANTES, 23-28 AOlrr 1992 1 l 8 2 PROC. OF COLING-92. NANTtT, S, AUG. 23-28, 1992 </SectionTitle> <Paragraph position="0"> lu linguistic ternls, veibal valeucy can be characterized as lexically controlled structural potential of a verb; in artificial intelligence terms, one would say that file verb has a frame structure with different role slols to be. filled by constituents in the sentcnce. Since the verb is often the nucleus of infornmtion arom|d which the different sentential elemeuts are orgtulized, it is inLportant for an NLP system to contain this valency info|mation. What then are file aspects of representation cue has to take lute account, ill llarticular with an eye to NLP applications? The first problem to be solved is what falls within file scope of the verb's valeucy (i.e. Ihe number and kind of valency-bound elements) and what falls outside of it (i.e.</Paragraph> <Paragraph position="1"> the free atljuncts of the sentcnce). An answer to this question leads to a quantitative classification (if verbs as nmnovalent (only one wllency element), bivalent (two) etc, and a qualitative classification of verbs as intransitive (subject, no ot~ject), Ir~msitive (subject and object), etc. Next, one bites tile problem of the distinction I)etweeu obligatoly and optiomd valency-bound elements (a distinction that is of particular importance to a role assignment algorithm). And finally, one must name, categorize and sulx;ategorize these elements, defining legal fillers lot a certain slot. If a verb has several valencies (corresponding to different syntactic/semantic readings), an additiunal representational matter to be handled (at a higher level of lexicou organization) is the way 1o store the different valencies. Are patterns stored ~parately with a repetition of the verb for each pattern? Can patterns be collap~d and stored just once with the verb? Decisions on these matters influence the database organization and consultation for NLP applications. In the next two subsections, we will show how the two formalisms between which the conversion was made try to provide answers to tile relu'e~ntation questions folmulated here, ill particular lot large ~ts of French and Dutch verbs.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 PROTON 2.2.1 The PROTON project </SectionTitle> <Paragraph position="0"> The Proton (Prolog en taalonderzoek, Prolog and linguistic research) project started in 1986 with as cue of its major ohjectives die COllStrnction of on-line valency dictionaries tor \[;reach and Dutch. The starting poiut was not a particular NLP application, but rather a linguistic concern for de~riptive correctness and completmless.</Paragraph> <Paragraph position="1"> Still, computational concerns were I)resent right froni the start, which led to the choicc of l'rolog as the declarative language for storing and processing the verbs (with processing ranging from sinlple retrieval of specific subsets of verbs to NLP applications in computer-aidcd language learning and parsing). Paper dictionaries, Ix)th gener',d (Le Petit Robert for French, Van Dale Basiswoordenboek tot Dutch) and valency dictionaries (Bus~ & Dubost 1983 for French) were used as background material. For the actual coding of the verbs, a particular distributional framework Ibrmed the basis, viz. the Pronominal Approach 3. Although there are many interesting sides to this approach (e.g. the exact methodology followed to determine reading 3 See Blanche-Benveniste et al 1984 or Eggeimont et al 1990 for full accounts of the Ih-onomiual Apltroach.</Paragraph> <Paragraph position="2"> distinctinns in verbs), we a~e mainly interested here in tile actual output of the lexicographic work, both quantitatively and for representatinn issues. A.s far as nunlbers are concerned, the cmrent status of the valency dictionaries of Dutch and French is the following. The Dutch valency dictionary contains about 4500 verbs; since each syntaelic/~mautic reading is coded separately, there are actually about 6300 valency llatterns coded. For French the two figures are 4(X)0 and 85004. (Note, in the passing that the frame/verb ratio is 1.3 fi~r Dutch and 2.1 for French.) A rough estimate of the effort spent in doing this codiug is 2 man-years Ior French, 1 man-year&quot; for Dutch. Tile diflerence is mainly due to the fact that French was file first lauguage Proton started out with; by the time Dutch was handled, coding experience mid c(xling Iools were available.</Paragraph> <Paragraph position="3"> Proton database entries arc l'rolog facts, consisting of a three-place v ln'edicate; the three arguments are an identification mlml~cr, the verb's iufinitive, and a list structure containing the informatinn related to one valency realization. Due to space limitations, we have 1o ~cfer to De Braekeleer 1991 tbr a fornlal account of this list structure; tor exmnples, we relier file reader tn SKX~tiOll 4, For clarity's sake, we informally give the meaning of imlx)rtaut abbreviategl nt~tions: pO relates to the notion of subject, pl Io that el direct object, p2 to indirect object, p3 to a slx~cific prepositional object with de (related to French en), pprep to other prepositional objects, ploc/pmanner/ptemp/pqt 1o adverbial of Iocation/manner/time/quantitiy respectively.</Paragraph> <Paragraph position="4"> hi genelal, it cua be said that Proton valency entries arc dense in inl{trmatioo, hut (in the other hand souiewhat loosely structured. We will see below that NLP applicatkuls like Metal have a more rigid structure that is not so dense in information. For a conversion experiment this difference is both an advantage mid a disadvantage: the advantage is that one can go from structures that conlain more than one needs; rite di~ldvantage is that the determination of what maps to what is not straightfi~rwzu'd.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2~3 METAL </SectionTitle> <Paragraph position="0"> 2.3.1. The METAl, system In contrast to Proton, Metal is a specific Nl.l' application, viz. a machine translation system, Its German-English, English-German, German-Spanisll, Dutch-French and French-Dutch systems arc conlmercially available; French-English, German-Danish, English-Spanish, Spanish-English and Russian~ German are under development. Full descriptions of the system can be found elsewhere 5. A brief account of 4 In the course of 1991 the Frencll valency database will be ctJmnrercially available in electronic h~r~t (Eggermont et al. forthcoming).</Paragraph> <Paragraph position="1"> 5 See Bennett & Slocum 1988, Thtumair 1989, Adriaens & Caeye~s 199(I fc~r general overviews; a general des~Tiption of the lexicon tbrmat call Ire found in Adriaens & l-emmens 1990.</Paragraph> <Paragraph position="2"> Ac-rlis DE COLING-92, N^Nr~s, 23 28 AO(Yf 1992 l 1 g 3 Pace. OF COLING-92, NANrI!S, AUG. 23-28, 1992 valency in Metal can be found in Gebrners 1988; an in-depth study of valency and machine translation bringing together work in the Proton and Metal projects is the topic of Gebrners 1991. Here, we will just give a general idea of the place of valency information in the Metal system and of how this information is used in the translation process. Valency patterns are stored as a feature-value pair on verbs in the monolingual dictionaries, in such a way that all patterns are coded only once with the verb; reading distinctions can give rise to different valency patterns, but even then they are all stored together with the verb. During analysis by an augmented context-free grammar (handled by a chart parser), rules at sentence level call a procedure for role assignment to the constituents of the sentence. This process is an intricate combination of general pattern matching algorithms and linguistically defined procedures (triggered by the valency information) for determining the best fitting valency pattern. In fact, the role assignment process can be said to consist of a grammar within the grammar, and a parser within the parser; it takes up a substantial proportion of the total time spent on sentence analysis. During transfer, valency information is again used (in the transfer dictionary) to disambiguate among different verb readings. For mapping into the target language, there are two approaches within Metal that have implications for the amount of valency-related information in the transfer dictionary. One approach tries to build a minimal hypothetical target language frame on the basis of the source role assignments and some crucial mapping information (e.g. for to like -> plaire, the subject is mapped into an indirect object, and the direct object becomes the subject: 1 like you -> Tu me plais). It then searches the monolingual target dictionary for a valency pattern that fits best with its hypothesis. The other approach tries to build the target frame without using the target dictionary at all: on the basis of the source role assignments and mapping information in the transfer dictionary, it builds the valency information for the target (see Gebruers 1991 for a detailed comparison of these approaches, with their advantages and disadvantages). In short, valency plays an important role in all phases of the translation process 6, involving complicated grammar and coding work. We conclude this brief sketch of valency in Metal by adding some figures of the size of the monolingual dictionaries. At the time of the conversion (March 1990), Metal contained 1600 Dutch verbs with 2050 valency patterns (a frame/verb ratio of 1.3) and 1055 French verbs with 1600 patterns (a frame/verb ratio of 1.5). Let us add right away that partly thanks to the conversion effort we were able to increase these figures drastically in a short period of time (see section 4). Currently, there are 3000 Dutch verbs with 3700 valency patterns (frame/verb ratio = 1.2) and 2130 French verbs with 2850 valency patterns (frame/verb ratio = 1.3). In general, all other monolingual dictionaries of the commercially available systems (i.e. English, Spanish and German) also contain over 2000 verbs (2500, 2300 and 4000 respectively).</Paragraph> <Paragraph position="3"> 6 See Gebmers 1991, 206-221 for an overview of valency treatment in other MT systems (TAUM, SUSY, GETA-ARIANE, VAPRE, EUROTRA).</Paragraph> <Paragraph position="4"> In METAL, valency is coded as one of the featare-value pairs on the lexicon entries for verbs (along with other information about morphology, syntax and semantics).</Paragraph> <Paragraph position="5"> Since the system is written in Lisp, its elements show the typical Lisp list structure. As for Proton, we have to refer to De Braekeleer 1991 for a full formal account of the METAL valency format; examples can be found in section 4. The meaning of some important abbreviations is the following: $SUBJ stands for subject, SDOBJ for direct object, $10BJ for indirect object, $ADV for adverbial complement, $POBJ for prepositional object, $SCOMP for subject complement, and $OCOMP for object complement. N1, NO, IMPS and ADJ indicate nominal, sentential, impersonal and adjectival subeategorizations respectively. Adverbial complements are further divided into LOC(ative), MAN(ner), MOV(ement), R(a)NG(e), T(e)MP(oral) and MEA(sure).</Paragraph> <Paragraph position="6"> Further subeategorization information is rendered as feature-value pairs, e.g. (TYPE P1) roughly corresponds to +human role fillers. Metal further uses the &quot;OPT&quot; atom in its valency patterns to indicate the optional valency-bound elements. Obligatory elements come first, those following the &quot;OPT&quot; atom are optional. Finally, the valency pattern contains General Frame Tests (after the &quot;GFT' atom). These tests am executed before the role assigning mechanism tries to find fillers; they concern features that if present at the clause level should have specific values: the auxiliary (values are H/Z, hebben/zijn for Dutch; AlE avoirMtre for French) and the sentence's voice (VC; A/P, active~passive). It is interesting to note how in an application like Metal this kind of information (also present in the Proton descriptions) receives a special status with an eye to an efficient role assignment algorithm: if a valency pattern can be found not to apply because some restriction at the clause level is not satisfied, the pattern is discarded and no computation is wasted on checking the potential role fillers.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Mapping PROTON to METAL </SectionTitle> <Paragraph position="0"> it was already noted in 2.2.2 that the different origin of the two formalisms accounts for certain differences between them. Proton codes in an application-neutral fashion, exhaustively (aiming at descriptive adequacy), on a one-entry one-pattern basis, and in a relatively free format. Metal codes with an eye to a specific application (MT), pragmatically (what do we need for the application to run?), on a one-entry all-patterns basis (even collapsing some patterns in a superframe), and in a relatively rigid format easily digestible by software and lingware. Since the goal of the conversion was to derive the information needed in Metal, a f'wst step was to link all the Metal specifications to the corresponding Proton ones. Given the detailed nature of the Proton valency schemes, there were very few gaps in this mapping. One is worth mentioning, though. Proton does not go as far as Metal in the subcategorization of the adverbial complements (Metal's $ADVs); range and movement complements are not treated in a consistent way. Below, we show part of the resulting mapping table (not all subeategorization details are shown; see De Braekeleer 1991, 61-62). It organizes the valency information from the Metal point of view: the relevant items are AC'I'ES DE COLING-92, NANTES, 23-28 AOt~'r 1992 l 1 8 4 PROC, OF COLING-92, NANTEs, AUO. 23-28, 1992 optionality, naming of roles, categorization, subcategorization and general frame~sts.</Paragraph> <Paragraph position="1"> 4. Aspects of the conversion software Ideally, the conversion should be a fully automatic process that t'alces the Proton database as input and delivers a Metal monolingual verb lexicon. Given that the Proton database also contains a field with several translations for each verb reading, we could even envisage creating transfer entries for the verbs as well. Yet, there are several reasons why we could only actfieve a semi-automatic conversion. As to the automatic generation of transfer entries, this idea had to be abandoned altogether, because it was too hard to pinpoint the distinctive information among the different patterns and translate that into contextual tests and actions in the Metal transfer dictionaries. Still, the translation field was preserved in the conversion output, so that lexicographers coding the transfer entries akeady lind ttte translations on-line. As to the fully automatic generation of a monolingual lexicon, several problems could not be overcome. First, we &quot;already noted in the previous section that not all information needed for Metal was present in the Proton database; this implies that manual checks for completeness of the frames had to be made in any case.</Paragraph> <Paragraph position="2"> Second (and most important), we could find no satisfactory algorithmic solution to the problem of mapping rite one-entry one-valency-pattern organization of Proton into file one-entry all-patterns organization of Metal. Note that this is m)t a simple matter of collecting all the separately coded valency patterns for the same verb, anti storing them once as a long list with file verb ill the target database. For one thing, Metal does not ne.ed all possible valency patterns for its purpose of machine translation; the amomtt of patterns is kept as small as possible for efficient storage and computation rea~ns. Moreover, the patterns that remain are merged into &quot;superpatterns&quot; or &quot;snperframes&quot; as much as possible; where relevant for translation, the transfer dictionaries take them apart again. The way Mehd lexicographers decide on distingnislting valency patterns (verb readings) monolinguaUy proved hard to trmlslate into a foolproof algorithm; there are at tile most some intuitions, heuristics or rules of thumb. Hence, it was decided to convert on a per pattern basis, and leave the merging of patterns to rile human lexiGographer.</Paragraph> <Paragraph position="3"> The conversion software itself is written in Common Lisp (about 1000 lines of code). It works in two phases.</Paragraph> <Paragraph position="4"> First, the Protun Prolog clau~s pass through a finite-state transducer interpretiug them as plain character strings. The output of this pass is &quot;lispified Prolog&quot;: Prolog chmses are turned into l.isp lists. At the same time, the necessary conversions at the character level arc taken care of: characters that would have a special meaning to the &quot;Lisp reader&quot; software (such as a comma or a backquole) are neutralized, and the extended ASCIIcharacter sequences for aecentezl characters are turned into Metal's ISO-8859-1 characters. The ~cond pass parses the lists and converts them into structures whose most important field is the Metal frame. Additional software takes care of putting the Metal frames in their canonical order (i.e. a subject is coded before an object, etc.), and provides tools fi)r lexicographers to manipulate the conversion outpnt. As an illustration, we give one simple example of what the input and the outpot of the</Paragraph> </Section> class="xml-element"></Paper>