XML Viewer - c92-4186

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4186_metho.xml
Size: 16,734 bytes
Last Modified: 2025-10-06 14:13:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4186">
  <Title>EBL2: AN APPROACH TO AUTOMATIC LEXICAL ACQUISITION</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The elements of the scheme
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Core Language Engine, CLE
</SectionTitle>
      <Paragraph position="0"> The Core Language Engine is a general purpose natural language processing system for English developed by SRI Cambridge. It is intended to be used as a building block in a broad range of applications, e.g. data-b~.se query systems, machine translation systems, text-to-speecb/speech-to-text systems, etc.</Paragraph>
      <Paragraph position="1"> The object of the CLE is to map certain natural language expressions into appropriate predicates in logical form (or Quasi-Logical Form \[Alshawi ,(.: van Eijck 89\]). The system is based completely on tmi lication and facilitates a reversible phrase-structure type grammar.</Paragraph>
      <Paragraph position="2"> The Swedish Institute of Computer ,'qci(m(e has with support from 8RI generalized the fi'anwwork and developed all equivahmt system for Swedish (the S-CLE, \[Gamback &amp; Rayner 92\]). The two copies of the CLE have been used together to form a machine translation system \[Alshawi et a191\]. The S-('LE has a fairly large gramnmr covering most of the common constructions in Swedish. There is a good treatment of inflectional morphology, covering all main inflectional closes of nouns, verbs and adjectives.</Paragraph>
      <Paragraph position="3"> The wide range of l)ossihle applications have put severe restrictions on the type of lexicon that can be used. The S-CLE h~ a function-word lexico~J containing about 400 words, including most Swedish pronouns, conjllnctlous, prepositions, determiners, particles and &amp;quot;special&amp;quot; verbs. In addition, there is a &amp;quot;core&amp;quot; content-word lexicon (with common nouns, verbs and adjectives) and domain specitic h'xica.</Paragraph>
      <Paragraph position="4"> This part of tbe system is still under development and all these content-word lexica together haw, about 750 entries.</Paragraph>
      <Paragraph position="5"> The lexical entries contain information about il~flectional morphology, syntactic and semantic subcategorization, anti sortal (selectional) restrictions. Information abont the linguistic properties of an entry is represented by complex categories that include a principal category symbol and specifications of constraints on the values of syntactic/semantic features. Such categories also appear in the C.LF,'s grammar and matching and merging of the information encoded in them is carried out by unification during parsing. Two categories can be unified if the constraints on their feature values are compatible In the actual &amp;quot;core&amp;quot; and domain Icxica, this information is kept implicit and represented as pointers to entries in a &amp;quot;paradigm&amp;quot; lexicon with a number of words representing basic word usages and inflections.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Vocabulary EXpander, VEX
</SectionTitle>
      <Paragraph position="0"> In the English CLE, new lexicon entries can be added by tile users with a tool developed for the purpose.</Paragraph>
      <Paragraph position="1"> q'his lexicon acquisition tool, the Vocabulary EXpander, is fully described in \[Carter 89\]. In parallel with the development of the S-CLE, a Swedish version of the VEX system was designed \[Gamback 92\].</Paragraph>
      <Paragraph position="2"> VEX allows for the creation of lexical entries by users with knowledge both of a natural language and of a Sl)ecilic application domain, but not of linguistic theory or of tile way lexical entries are represented in the CLE. It presents examl)le sentences to the user and asks lor information on tile grammaticality of the sentences, and for selcctional restrictions on arguments of predicates VEX adopts a copy and edit strategy in colmtrnctiug Icxical entries. It builds on the &amp;quot;paradigm&amp;quot; lexicon and sentence patterns, that is, declarative knowledge of the range of sentential contexts ill which the word usages in that lexicon Call OCCUI'.</Paragraph>
      <Paragraph position="3"> In the present work we want to investigate to what extent snch creation of lexicon entries can be performed with a minimum of user interaction, lnstead of presenting exaruple sentences to the user we are allowing the program to use a very large text where hopefully unknown words will occur in several ditlbrenl sentence patterns. This strategy will he filrther described i~, the following sections.</Paragraph>
      <Paragraph position="4"> First, however, we will define what we mean by the notion of (subcategorization) &amp;quot;paradigm&amp;quot;. Tile definition we adopt here is based on the one used in \[Carter 89\], namely that Definition 1 a paradigm zs any minimal non.empty intersection of Icxical entries. Every category in a pa,'adlgm will occur in czaclly the same set of entries in the lexicon as every other category Of auy) in that paradigm.</Paragraph>
      <Paragraph position="5"> Every ent,y consists of a dis3o2ul union of paradigms.</Paragraph>
      <Paragraph position="6"> lh're, we assume that a lexicon can be described in terms of (a small set of) sucb paradigms, relying on ttle fact. that the open-class words exhibit at least approximate regularities)</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 The Lexicon Learning system, L 2
</SectionTitle>
      <Paragraph position="0"> Previous experiments in automatic lexical acquistilion at. S1CS (L ~ - Lexicon Learning) used a set of 1 The system does not attempt to cope with c|oaed-categc)ry words. '\['hey have to be entered into a apecific function-word lexicon by a skilled linguist.</Paragraph>
      <Paragraph position="1"> ACTES DE COLING-92, NANTES, 23-28 AO~r 1992 1 1 7 3 I'gOC. OF COLING-92, NAN'rES, AUG. 23-28, 1992 sentences and a formal grammar to infer the lexical categorit.'-s of the words in the sentences. The original idea wa.q to start with an empty lexicon, assuming that the grammar would place restrictions on the words in the sentences sufficient to determine an assignment of lexical categories to them \[Rayner el al 88\]. This can I)e viewed as solving a set of equations where the words are variables that are Io be assigned lexical categories and the constraints that all sentences parse with respect to the grammar are the equations.</Paragraph>
      <Paragraph position="2"> Unfortunately, it proved almost impossit,le to parse sentenees containing several nnknown words.</Paragraph>
      <Paragraph position="3"> For this reason the scheme was revised in several ways \[tlgrmander 88\]; instead of starting with an eu/pty lexicon, the starting point bccanw, a lexicon coutaining clnsed-cl;kss words snell ;L~ l)FOllOIlnS~ prepositions and determiners. The system would then at each stage only process sentences that coil rained exactly one unknown word, the hop,, I)eing that tlie words learned from these sentences would reduce the number of unknown words in the other ones. In addition to this, a rnorphologicat component w~s included to guide the assignments. Although the project proved the femsibility of the scheme, it also revealed some of its inherent problems, especially the need for fa.ster parsing methods.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.4 Explanation-based learning, EBL
</SectionTitle>
    <Paragraph position="0"> A problem with all natural language grammars is that they allow a vemt number of possible constructions that very rarely, if ever, occur in real sentences. The application of explanation-based learning ~ (EBL) to natural language processing allows us to reduce tim set of possible analyses and provides a solution to the parsing inefficiency problem mentioned above (Subsection 2.3).</Paragraph>
    <Paragraph position="1"> The original idea \[Rayner 88\] was t.o bypass llOl'lna\] processing and instead use a set of learlled rules that perh)rmed the t.~qks of the normal parsing component, l:ly indexing the learned rules eflicieutly, analysing an input sentence using the learned rules is w~ry much faster than normal processing \[Samuelsson &amp; Rayner 9t\]. The learned rules can be viewed as templates for grammatically correct phrases which are extracted from the. granmmr and a set of training sentences using explanatiou-bmqed learning, llere, we assume the following definition: Definition 2 a ten'tplate ts a generalization constrvcted from lhe parse tree for a successfidly processed phrase, .,1 template is a tree spanning the parse with a mother category as root and a collection of its ancestor nodes 2t~xplanation-lmsed learning is n machine learning tech-Illqlle closely related to tllaCro-operator learllil|g, chtlllkillg, and parliM evaluation and is described in e.g.. \[I)e.long &amp; Mooney 8~';, Mitchell et at 86\].</Paragraph>
    <Paragraph position="2"> (at arbitrary, but pre-defined, deep levels of nesting) as I~a~les.</Paragraph>
    <Paragraph position="3"> The fact that the templates are derived from the original gramlnar guarantees that they represent correct phrlLses and the fact that they are extracted from real senteuces ensnres Ihat they represent constructions that actually occur.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Explanation-based
</SectionTitle>
    <Paragraph position="0"> lexical learning, EBL 2 The basic algorithm goes ,xs follows:  1. Using a large corpus from the domain, extract teUll)lates from the sentences contaiuing uo 1.111known words.</Paragraph>
    <Paragraph position="1"> 2. Analyse the remaining sentences (the ones contaiuing unknown words) using the templates, while maintaining an interim lexicon for the unknown words.</Paragraph>
    <Paragraph position="2"> 3. Compare the restrictions placed on the unknown  words by the analyses obtained with other hand-coded phrase templates specific for the paradigms m the lexicon d. (2reate &amp;quot;rear' lexical entries from the mformati&lt;m m the intcrhn lcxicon when a full set of such templates \[covering a paradigm) has been found.</Paragraph>
    <Paragraph position="3"> In the following subsections, we will address these issues in turn.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Extracting templates from
</SectionTitle>
      <Paragraph position="0"> a domain-specific corpus A typical situation where we think that this method is well suited is when a general purpose NL system with a core lexicon (such as the S-CLE) is to be customized to a specific application domain. Tile vocabulary used in the domain will include e.g. technical terms that are not present in the core lexicon. Also, the use of the words in the core lexicon may differ between domains. In addition to this, some types of gralnmatieal coustrilcts may be more eonllnon ill one domain than ill allother. We will try to &amp;quot;get the flavour of the language&amp;quot; in a particular application euviromnenl from domain-specific texts.</Paragraph>
      <Paragraph position="1"> The corpus is divided into two parts: one with seatellces containing ilnknown words, all(\] another where all the words are known, The latter group is used to extract plmme templates that capture tile grammatical constructions occurring in tile domain. rFhe process of extracting phrase templates from training sentences is outlined in Subsection 2.4.</Paragraph>
      <Paragraph position="2"> AcrEs nl~ COTING-92, NAt,rl~s, 23-28 Ao(rr 1992 1 ! 7 4 PRec. OF COLING-92, NAmV:s, AUG. 23-28, 1992</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Analysing the
</SectionTitle>
      <Paragraph position="0"> remaining sentences Assuming that a partieular set of phrase tenlplales is applicable to a sentence containing an unknown word will associate a set of constraints with the word. Naturally, the constraints Oil I\[le kBowlt words of the sentence should be satisfied if this tcmplatv is to be e(msidered. 3 This will correspond to a partic+ular parse or analysis of the seutenee. Thus a sol of constraints is a.ssociated with each different pm'se A number of entries in the prot.otype iexicou will matcll the set of constraints associated with a senteuce. \['\]aeh prototyI)e is all illCal'llatioIl of il para digtn, Thus we can a.ssociate a word with a set of paradigms. (Note thai the paradigms may be nonexclusive.) All such +msociatious (corresponding to different parses of the same sentence) are collected, and used to update the+ interim h'xieon.</Paragraph>
      <Paragraph position="1"> '\['h(! IllOSt obvious conslraiuts colnt! frolll syllt{ic tie considerations. If, in Ihe sentence John loves a ca( the word loves were unknown, while the other words did indeed have the obvious lexicai entries, the gram mar will require loves to be a transitive verb of third person singular agreement. Since the prototYl)eS of verbs are iu tl,e imperative form, we nmst associate a finite verb form with the imperatiw~, This is done by applying a omrphologieal rule that strips the '-s' from the word loves, reinforcing the hypothesis and gaining the tense information in the process.</Paragraph>
      <Paragraph position="2"> Now, this ntorphological information lnay seem uniml)ortant in Fnglish, but it definitely is +lol it, Swedish: a word with more that+ one sy\]lal,h+ ending with '-or' has to be an in(h.finite common gel,der noun. If it is not of latin origin it lnusl, be a phi ral form an(I thus ils entire morl)hology is kJvm, n The odds that it is a countabh&amp;quot; noun (like d.ck), as (}\[)posed tO 1t lllaSS IIOIln (such {IS walev), ;ll'C ()vet&amp;quot; whehning.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Constructing lexical entries
</SectionTitle>
      <Paragraph position="0"> During tile analysis of the set of sentences conlaining unknown words, an interim lexicon for these unknown words is kept. The interim lexieon is imlexed on word sterns and updated each titlie a IWW Sell fence is i)roeessed. \[&amp;quot;or each word sI, eul+ t'e/o pieces of information are retained in this lexicon: a hypothesis about which paradigm or set. of paradigms lhe word is assumed to belong to, and a justifieat.ion Ihat encodes all evidence relevant to the word. The jnstifieation is used to make the hypothesis aml is main tained so that the entry may be Ul)(lat, ed whett new inlbrmation about tim word arrives. When all the l)hrase templates (sentence patterns) for lhlfilhnent</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 UldeSs tile)' Ih) ill fact COll't!sp(lltd to othtT llr)ll lexicaliz,:d
</SectionTitle>
    <Paragraph position="0"> SlC/llSeS of tile word, in' to hO|llO~l.&amp;l)hS, of a Sl)ecilic para(ligm have been found, an entry for the word is made in the domaimspecifie lexicon that is bcmg constructed. This is done while still keeping the justilication reformation, since this might contaht evidence indicating other word-senses or holnographs null</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 hnI)lementation status
</SectionTitle>
    <Paragraph position="0"> A prelimiuary versi(~u of the lexieal acquisition sys tern has been implemented in Prolog. &amp;quot;File mealtile extracting telnplates froln Selltences with knowll words is \[uily operational. The parser for sentences witil unkuown words has also been tested, while tile iaterim lexicon still is subject to experimentatiolL Presenl.ly, a w'ry siml)le strategy for the interiln lexicon has been tesled. This version uses the set of all hypotheses ns the justification and use their dis.itmetion as the era'rent hypothesis. We are currently working Oll extending this sdlenle to one incorporating the full algorithm deseril)ed above.</Paragraph>
    <Paragraph position="1"> Unknowu wor(l~ are matched with tim subcalegorizatiou paradigms of the S-CLE. In total 62 differenl synl.aet.ic/semantic paradigms are known by the present systmn: 5 for Swedish nmms, l0 for adjectives, aud all tim others for verbs. Tim morphological inflections are subdivided into 14 different inflectional cbLsses of nouns, 3 classes of adjectives, and 24 classes of verbs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML