File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1099_metho.xml
Size: 16,908 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1099"> <Title>A TOOL FOR COLLECTING DOMAIN DEPENDENT SORTAL CONSTRAINTS FROM CORPORA</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> A TOOL FOR COLLECTING DOMAIN DEPENDENT SORTAL CONSTRAINTS FROM CORPORA </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> SRI International, Menlo Pmq% CA *CAP GEMINI Innovation, Boulogne I:lilla.ncourt, France </SectionTitle> <Paragraph position="0"> Internet: andry@capsogeti.fi: Topical paper : Tools for NL Understanding (Portability).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 ABSTRACT </SectionTitle> <Paragraph position="0"> In this paper, we describe a tool designed to generate semi-automatically the sortal constraints specific to a domain to be used in a natural language (NL) understanding system. This tool is evaluated using the Sll,I Gemini NL understanding system in tile ATIS domain.</Paragraph> <Paragraph position="1"> of work we put into the first domain application 1.</Paragraph> <Paragraph position="2"> In this paper, we describe tile results of using this semi-automatic tool to port the (',e, udlii NL system to the ATIS domahi, a (lomltin that (ienlini had ah'eady been ported to, arid for which it \]lad achiew~,d high perl'orluance ~ttld gi'al'l-illiatical coverage using hand-written sortal constraints. Chossing a known domain, rather than a new one, allowed us to compare tile performance of tile derived sorts to the hand-written ones, holding the domain, grammar, and lexicon constant. It also allowed us to evahlate the selni-~ultoma.tically obtained cown'age using the ewduation tools provided for the A'I?IS corpus.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 INTRODUCTION </SectionTitle> <Paragraph position="0"> The construction of a knowledge base related to a specific domain for a NL understanding system is time consuming. In the Gemini system, the domain-specific knowledge base includes a sort hierarchy and a set ot&quot; sort rules tha~ provide (largely domain-specific) selectional restrictions for ew~ry predicate invoked by the lexicon and the grammar. The selectional restrictions provide a source of constraints over and above syntactic constraints for choosing the correct analysis of a sentem:e. The sort rules are generally entered by a linguist, by hand, from the study of a corpus and while tuning the grammar.</Paragraph> <Paragraph position="1"> IIowever, the use of an interactiw; tool that can help the linguist to acquire this knowledge from a corpus\[a\]\[5\], can drastically reduce the time dedicated to this task, and also improve the quality of the knowledge base in terms of both accuracy and conipleteness. 'l'he reduction in the amount of etfort to develop the knowledge base becomes obvious when porting an existing system to a new domain. At SR,I, our main concern was to port Gemini, our NL understanding system to other domains without investing the same amount</Paragraph> </Section> <Section position="5" start_page="0" end_page="598" type="metho"> <SectionTitle> 3 PARSING WITH SORTS </SectionTitle> <Paragraph position="0"> Gemini\[2\] implements a clear separation Imtween syntactic and sem~mtic information. Each syntactic node invokes a set of semantic rules which result in the bnihling of a set of logical forms for that node. Selectlomd restrictions are enforced on the logical fornls through the sorts nlechanism: All prcdlcations in :~ catldiihd.e logical form IlallSt I)e licensed by some sorts rule. The sorts are located in ~ conceptual hierarchy of approxhmd;cly 200 concepts mid are imphmleiH.ed as Pro\]og terms such that nlol'e gellorai sorts SllltSllllle lliore specific sorts\[6\]. Failure to match any available sorts rule can thus he implernented as unification-failure.</Paragraph> <Paragraph position="1"> Gemini parser creates logical forms expressions like the fbllowing one :</Paragraph> <Paragraph position="3"> In these logical form expressions, every sub-expression is assigned a sort, represented as the IThe actual dom;dn is Air Transportation (ATIS) used as a benchmaxk in the ARPA community.</Paragraph> <Paragraph position="4"> right-hand-side of a ';' operator\[l\]. Sorts rules for predicates are declared with sor/2 clauses: ~or(' l~O,C/'rO N', \[,,;ey\]).</Paragraph> <Paragraph position="5"> sot(to, (\[\[flight\], \[city\]\], \[prot,\]) ). The above declarations lic.ense the use of 'BOSTON' as a zero-ary predicate with &quot;resulting&quot; sort \[city\] and 'to' as a two-place predicate relating flights and cities with resulting sort \[prop\] (or proposition).</Paragraph> <Paragraph position="6"> In the ATIS application domain, for exaulple, the subject (or actor) of the verb deparl, as in 'the morning flights deparling for denver', can 1)e a flight. For this, we use the following set of sort definitions: .~o,'(d~v,,,'t, (\[\[d~v~,,'~,,,'~\]\], \[p,,ov\])) so,,(ftighl, (\[\[fligtd\]\], \[prop\])) so,.(acto,., (\[\[departure\], \[flivhl\]\], \[p,.np\])) 'Phe tirst two definitions make depart and flight p,'edieates compatible with departure and llight ewmts respectively, returning a proposition; the third makes aelor a relation that (:an hold between flights and tlights, also returning a llroposi-Lion. A simple example of a logical form lice.nsed by these rules follows (with the result sort \[prop\] suppressed): qterm( ....... ( ( X ; \[flight\]), \[.rid, \[flight, (X; \[flight\])\], ezists( (Y; \[flight\]), \[.,,,I, \[a~v.,.t, (r; \[,t~v.,.t,,~\])\], (v; \[,l~v,,~t,,,,d), \[actor, (Y; \[del>art,vre\]) , (X; \[f lighl\])\]\])\]) Which would be roughly the logical form for 'a deparling flight'.</Paragraph> </Section> <Section position="6" start_page="598" end_page="600" type="metho"> <SectionTitle> 4 SORT ACQUISITION </SectionTitle> <Paragraph position="0"> 't'he apl)roach we have taken is to start fi'om an il, itial &quot;schematic&quot; sorts fih: we call the signature file (explained below), which essentially allows all predicate argument coml)inations. We tJlell hal'vest a set of preliminary sort rules by parsing a large corpus. The logical forms that induce these preliminary rules e61rle frona parses that; essentially incorporate only syntactic constraints. The resu\] ling sorts rules are filtered by \]lalld alld the process is iterated with an increasingly accurate sorts file, converging rapidly on the sorts file specific to the application domain (fig. 1).</Paragraph> <Section position="1" start_page="598" end_page="598" type="sub_section"> <SectionTitle> 4.1 Signature and lLestrictions </SectionTitle> <Paragraph position="0"> If we started the abow~ iteration process with no sortal information,.then the logical forms resulting</Paragraph> <Paragraph position="2"> frolll a parse would colH.aill iio sortal ill\['Ol'nlatioil, alld only vacnons sortal rules wotlld \])e harvested.</Paragraph> <Paragraph position="3"> &quot;\['\]le first ste l) is tlllls to huild an initial sort file we call the signat'ure \[il~. The idea is to assign lexical predicates inherent sorts, but not to assign assign ally rllles which constrain which lexica\] itelns (:all colnhine with which. The signature file, then, is m~t just domain-independe.nt. It has no information at all ahout semantic coml>inal;o.</Paragraph> <Paragraph position="4"> rial Imssil)ilities, not even those determined by the lallgtla,~e (for example, that the verb break does not allow prolmsitional subjects). The reason for this is so that it can be generated largely automatically from the lexicon.</Paragraph> </Section> <Section position="2" start_page="598" end_page="599" type="sub_section"> <SectionTitle> 4.2 The Signature </SectionTitle> <Paragraph position="0"> I,ets Im,e;in with certain inherently relational predicates, for which the sigllatnre file gives only an arity and the result sort. I&quot;or example the signature fc~r the predica.tes al (corresponding to the preposition) and actor (corresponding to logical subject) wouhl be the same: .~#.,.~,,,.,,.(.t, (IX, r\], b,,'ov\]) .~i~t,.,v,,,.,~(.,,z,,,., (IX, v\], \[v,',,v\]) This signature is u~ed as the sort rule R~r at and actor in the sorts tool's first iteration. The efl>ct is t.o limit the choice of sorts rules for these ln'edicates 1.o rules which are further instantiat,ions their signatm'os, that is, to rules licensing them to take two arguments of any sort to make a proposition. The object in successive iterations will be to assign these relational predicates substantive sortal constraints, thus constraining head modifier relations and the parse possibilities.</Paragraph> <Paragraph position="1"> Verbs, nouns, some adjective and adverbs, on the other hand, have signatures with fully or partially instanciated arguments: For example, in the ATIS domain, the verbs depart, get_in, mad the nouns data, flight have the signatures: signature(depart, (\[\[departure\]\]~ \[prop\])) si~nat~,re(get_in, (\[\[a,'ri~at\]\], \[prop\])) signature(data, (\[\[information\]\], b,rop\])) slgnature(flight, (\[\[flight\]\], \[prop\])) These declarations have no effect on the combinatorial possibilities of these words (they tell us nothing about what can be the subject of the verb depart or what verbs the noun flight can be sub-ject of), but when a logical form is built up fl'om a syntactically licensed parse (like the one give.n above for a departing flight), these sortal declarations will &quot;fill in&quot; the sorts for the connecting predicate actor, generating the sort rulc: slgnature(actor, (\[\[departure\], \[flight\]\], \[prop\]) Thus in the signature file, lexical predicates have their own &quot;inherent&quot; sort rules, which then help build up the sort rules for the relational predicates. The inherent sort rules for adjectives like cheap and late will constrain only their first argument. The reason for this is that it is this first argument that modifiers (such as intensifying adverbs and specifiers), will hook on to.</Paragraph> <Paragraph position="2"> *ig.ature(eheap, (\[\[eost_soa\], A, n\], \[prop\])) ~ignat~re(tate, (\[\[temporal_stage\], A, 13\], \[p,'op\])) At the same time the argument position filled in by what the adjectives modify is left unconstrained. The signature file thns makes no commitment about what sorts of things can be late or cheap; it just needs to say there is such a thing as lateness and cheapness. This is why for a new domain the signature file can be generated largely automatically, using a new inherent sort for each new lexical item, mssigning the type of predicate appropriate to its grammatical category.</Paragraph> <Paragraph position="3"> All zero-arity predicates (names) need to have inherent sorts. Certain general 'tool words' which include numbers, dates, time, and commons words, will receive the same signatures in any do-</Paragraph> <Paragraph position="5"> In addition to this, however, there is a whole list of words specific to the dornain which riced to be inherently sorted. This part of creating a signature file will need to be done by band: signature(' N AS II Y I L L E', (\[city\])) signature(' AI l~_C AN A1k A', (\[airline\])) signature(' LA_GU AfUg l A', (\[airport\]))</Paragraph> </Section> <Section position="3" start_page="599" end_page="600" type="sub_section"> <SectionTitle> 4.3 Extracting the Sorts </SectionTitle> <Paragraph position="0"> We now give a more detailed example of how sort rules are extracted fl'om logical forms (bFs) built by the parser. For '*he morning flights flying to denver', we obt~dn roughly the following Logical</Paragraph> <Paragraph position="2"> The eXLracLiotl process COllSiStS Of a recursive exploration of the logical form and retrie, val of each predirate gild its arglllliellts, ldor example, from the LFs above, our tool would extract the follow- null ing sort definitions set 7 : sot(flight, \[\[flight\]I, \[prop\]) ~o~(..o.,i,,g, \[\[,t.u-v..~\]\], \[v~ov\]) sor(n_n_rel, \[(\[\[,lay.port\]\], bJrop\]), \[flight\]\], b,rop\]) sot(fly, \[\[flight\]\], \[prop\]) sor(aelor, \[\[fti~aht\], \[ftiyht\]\], \[prop\]) sot(to, \[\[flight\], \[city\]\], \[prop\]) sor(f rag-nl,, \[\[flight\]\], b,'rop\]) sor(np_f rag, \[\[prop\]\], \[prop\]) When constrained only by signatures, the parser typically finds a large number of logical forms. The sorts tool provides the option of harvesting sort rules in one of two ways, either from all generated logical forms, or only from the Preferred Logical I'brm (PLF). The parse preference component implemented in Gemini chooses the best intepretation from the chart, based on syntactic heuristics\[2\], and provides a set of PLFs. In addition to the extraction of the sort rules, we also calculate tire occurrence (r)i of each sort rule for all the sentences of the corpus. We then normalized (r)i by the number of logical forms that include the sort rule (Ni). F, ach value Oi is stored along with its sort, rule and used to calculate the probabilities related to the sort rule : - ~=o 6)i In fact three sets of probabilitilies are calculated for each rule R: (1) Global probability of sort rule R: the number of invocations of rule 1% normalized by the number of LFs containing I~ and divided by the total nmnbcr of rule invocations in the corpus; (2) Conditional probability of rule 1~ given a particular predicate; (3) Conditional probability of 1% given the predicate in l~ and an argument of the same sort as the first argument of R.. Also, associated to each sort definition, we keep the list of the indexes of a small set of sentences which contain the corresponding sort definition in its logical form. This set is used as a sample for the set editor tool.</Paragraph> </Section> <Section position="4" start_page="600" end_page="600" type="sub_section"> <SectionTitle> 4.4 The Argument Restrictions </SectionTitle> <Paragraph position="0"> The argument restrictions are instantiated versions of the signatures for each predicate. For example, after parsing and extraction from tire logical forms, the arguments X and Y of the signature associated to the preposition at will help to generate a list of several sort definitions such as : so,.(.t, (\[\[.i~po~t\], \[eitu\]\], \[p,.op\]) as in : 'the aiport at Dallas', so~(.t, (\[\[dom.in_e,~nt\], \[~i.r~_Vo;n*\]\], b&quot;op\]) as in : 'departure at 9prn'.</Paragraph> </Section> </Section> <Section position="7" start_page="600" end_page="601" type="metho"> <SectionTitle> 5 SORT EDITING </SectionTitle> <Paragraph position="0"> At each step of tire process, after parsing, tile linguist, using the interactive sort editor, can examine the new sort file which has been generated and choose which sortal definition need to be eliminated. Statistical information ~sociated to each sort definition helps him decide which ones are revelant or not. We have also included tire possiblility of adding a sort definition, although this kind of operations should be very rare. In fact the main activity of the linguist using the sort editor tool, will be to filter the sort definitions generated by the parsing of the corpus.</Paragraph> <Section position="1" start_page="600" end_page="601" type="sub_section"> <SectionTitle> 5.1 Description of the tool </SectionTitle> <Paragraph position="0"> The sort editor tool is all interactive, window-based program. It hms a main window for displaying and editing the sorts and a set of buttons that help the user to either display additional information or perform actions such as : * load or save a sort file, * select a fimctor among tile list. of Ml fimctors and disphty the list of its possible arguments, result and probabilities, * deletion and insertion of a sort definition, * display a sample of sentences associated to a specific sort definition, * mapping between the sort definitions and a reference sort file (for evaluation), * changing the way the sort definitions are displayed (result or not, mapping or not, global prolmhility, conditional to a functor, or relative to the first argument of a definition), * use of a threshold on the ln'ol>abilities to filter the sort definitions,</Paragraph> <Paragraph position="2"> * display the sentences associated to a sort definition, null * display the list of predicates which have been excluded form the extraction, * specification of a sortal hierarchy to be used with the sort definitions for the next iteration, * use of a whiteboard to save specific sentences and information daring a session.</Paragraph> <Paragraph position="3"> The tool uses ProXT, the Quintus Prolog interface to MOTIF widget, set and the X-Toolkit.</Paragraph> </Section> </Section> class="xml-element"></Paper>