File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1008_metho.xml
Size: 10,263 bytes
Last Modified: 2025-10-06 14:07:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1008"> <Title>Incremental Identification of Inflectional Types</Title> <Section position="4" start_page="49" end_page="50" type="metho"> <SectionTitle> 4 Processing unknown words </SectionTitle> <Paragraph position="0"> In our al)proach linguistic prol)erties of unlcnown words are inferred fl'om their sentential context as a byproduct of parsing. After parsing, which requires only a slight modification of sl;andard lexical lookup, lexical entries are al)propriately updated. One of our key ideas is a gradual, information-based concept of &quot;unknownness&quot;, where lexical entries are not unknown as a whole, but may contain unknown, i.e. potentially revisable, pieces of information (cf.</Paragraph> <Paragraph position="1"> Barg and Walther (1998)). This allows a uniform treatment for tile full range of lexical entries from completely known to maximally unknown. As discussed in (Barg and Walther, 1998), our system has been implemented in MicroCUF, a derivative of the tbrmalism CUF of (DSrre and Dorna, 1993).</Paragraph> <Paragraph position="2"> I{evisable intbrmagion is fllrl;her classitied an spccializable or generalizable, where the tbrmer can only become 111or0 special, and I;he lal;l;er only lllOrO general, with furl;her cont;exts. Spe(:ializable ldnds of infbrmal;ion include seman(;ic t;ype of nouns, gen(ler, and intleci;iolml class. Among the generalizable, kinds of intbrmaCion a.re i;he selecl;ional rcsl;rictions of verbs and adjecl;ivcs as well as (~hc case of nOUllS. Both kinds of intbrmaI;ion roger;her wil;h nonrevisable (i.e. st, ri(:t) iutkn'mal;ion can cooccur in a single ent.ry.</Paragraph> <Paragraph position="3"> The overall approach in compatil)le wit;h sl;andard consl;railll;-1)ased analyses and makes only a few extra demands ()11 the grmnmar, tlere, l;he revisable intormal;ion musl: 1)e e, xl)lMIly lnarkcd as such. Since ore' model is sii;uai;ed wii;hin (;he framework of (;yl)cd feagurc-based formalisms (of. Carpentx~r (1992)), revisable information is expressed lit terms of fin'real tyt)es. The iniIJal values fin' revisable intbrmat ion arc, specified with (;wo dist;inguished 1;ypes u_.s and u-9 for specializable and generalizable information, resi)ect;ively. Type tmiticai;ion can be employed for the combination of sl>ecializable inf'ormat.ion, whereas generalizable illtbrmatioll requires l,ype lllli()ll.</Paragraph> <Paragraph position="4"> The (lirecl; combim~l;i(m of revisable informal;ion (luring parsing is mffeasible for various reasons discussed in (Barg and \Vail;her, 1.998). It; conscquenl;ly is carried oul; in a selmrai;e st('~ 1) after ghe curreni; sen(:ell(:(; has heel/. \])arse(t. The gralnmai;ical amdysis i(;self I;hus remains coml)lei;(;ly declaral;ive and only makes use of mfiticalJon. In order ix) achieve |;his sel)aral;iou of analysis and revision we inl;roduce Lwo at;I;ribul;es for generalizable informal;iol h namely gen and ctxt, where ctxt receives l;he information inferred from l;he seni;enl;ial contexl;, and gen the polxmtially re, visable inforlnai;ion wil;h I;11(; inil;ial value u_ 9. Parsing l;hus proceeds in an entirely COllVt;lll;iollal lnamlel', excepi; thai; lexical look-up for a word wil;h tlIlI{IIOWII orl;hogralflly or 1)honology does noI; fail but. iustead yields an mMersl)eciiied canonical leM(:al Oll-I:ry. The Ul)(lal;ing after parsing (:Ollll)ares Ihe feal;m'e st ruct.ure of (;he origiual lexical entry with that. illf'errcd conl;exl;ually. The sl)ecializable infl)rlnal;ion of (;11(; forlller in replaced wil;h the (:orr(;slxmding values ot:' (;he lal;lxn'. Moreover, usiug the at.tribut.es gen and ctxt inl;l&quot;Odut:ed above, the new gen value for generalizable intbrmal;ion is compul;e,d by t;he l;yl)c UlriCh of l;hc gen value front/;lie old lcxical elli;ry (initialy 'u_9) with the ctxt value resulging from (;he l)arse. Actual re, vision nal;urally in only carried ouI; when n conl;ex(; in fact; provides new informal;ion.</Paragraph> </Section> <Section position="5" start_page="50" end_page="52" type="metho"> <SectionTitle> 5 Incremental inference of </SectionTitle> <Paragraph position="0"> inflectional information In order to process llllklloWll word forms, we posl;tllate canonical lexical entries which are ret;urned by lexical lookup if a word is hog recorded in the lexicon. For nomls, Ichis enI;ry corresponds 1;o an mlderspecifled basic lexical sign in which l;he inflectional class, case, number, and gender are specitled with revisable types, i.e. the information can be acquired and updaix'~d. Figure 5 shows (;he basic lexical sigll for German norms (with the Olllissioll of tbai;m'e Sl)ecitical;ions ~haI~ arc irrelevanl, fi/r l;his (tiscussion). Whereas intleci;ional class (ftype), number (num),</Paragraph> <Paragraph position="2"> and gender (gend) are specializable, case is generalizable and hence contains the features gen and ctxt.</Paragraph> <Paragraph position="3"> Note that the initial values for specializable information consist of a disjunction (;) of the value u_s and the most general appropriate value for the corresponding feature. This ensures the identification of specializable infornlation (via ~t_s) on the one hand, and the correct specializations on the other.</Paragraph> <Paragraph position="4"> \Vhen a sentence containing an tlnknown noun is parsed, infbrmation about the noun conies from different som'ces: while the surrounding context lnay supply agreement information, the word fornl itself together with morphol)honological constraints may restrict the possible inflectional class.</Paragraph> <Paragraph position="5"> As an examt)le we can suppose that the rather infrequent noun Sund 'sound', 'strait', which like It'und 'dog' belongs to class NS but is unfalniliar to inany German speakers, is not recorded in a given lexicon.</Paragraph> <Paragraph position="6"> The class NS contains both masculine and neuter nouns, and these differ ill none of their inflected forms. Thus, only agreement information from a context, such as dcr cnge PS'und 'the narrow strait' Even in isolation, the forln Sund must be singular since its final shape is not coml)atible with any phlral inflection (i.e. it ends neither ill -s nor ill a schwa syllable). Moreover, the morphoplionological constraiuts on stems allow only three possibilities: S'und is * femiuine (and then tile class is NA, NU, or NM and tile case is underspecified) * nonfenfinine and weak (i.e. (:lass NWN or NWS) (and tlmn the case must be nominative) * nonfbminine and nonweak (and then the case is not genitive) These hypotheses are captured in the three feature structures depicted in figure 6.</Paragraph> <Paragraph position="7"> As we have seen, when a word is parsed in context, this provides additional information. If we know, for exalnple, that S~tnd is lnasculine, the first hypothesis is excluded, and the gender specification of the remaining two hylmtheses can be specializ&l to masc. If we additionally encounter S'und ill dative singular, which is impossible for weak nouns (which nmst have a final -n), then only the third hypothesis remains. Finally, if the plural form S'undc occurs the system can specialize the inflectional class exactly to the type NS. The other morphological information cammt be further generalized or specialized, and we have the final lexical entry fbr Sund.</Paragraph> <Paragraph position="8"> Things are not always this easy. In particular, there may be a number of alternatives both fbr the segmentation of a form into a stem and an inflectional ending and ibr the ~ssignment of a stein to a lexeme. Moreover, these alternatives may depend on each other. Thus, the form Lcincn may be assigned to any of the lexemes Lein 'flax' (masc, NS), Leine 'rope' (fern, NM), or Leincn 'linen' (neut, NS); even in a context, e.g. F'ritz verkauft Leinen 'Fritz sells ropes/linen', it may be impossible to disambiguate the form. While the nouns Band 'book volume' (mase, NU), Band 'strip' (neut, NR), Band 'bond' (neut, NS, archaic and rare in singular), Band 'music band' (fern, NA), and Bande 'gang' (fern, NM) may be unlikely to occur all in the same context, they ilhlstrate the dimension of the t)roblems of segmentation and lexical assignnlent, which in turn coil- null stitute part of the more general 1)robleni of disam-Mguation in natural language processing. \Ve have no magic solution f'or the latter, but in our approach such examples must be handled with disjmwtive representations until the context 1)rovides the necessary disambiguating infornmtion.</Paragraph> </Section> <Section position="6" start_page="52" end_page="52" type="metho"> <SectionTitle> 6 An alternative model using </SectionTitle> <Paragraph position="0"> finite-state techniques Alternatively, the incremental identifieation of inflectional types can be modelled within the Damework of finite-state automata (cf. Sln'oat (1992)) without recourse to unification-based grammar formalisms. A FSA can be defined that has an all)ha1)et consisting of vectors specifying the stem shal)e and ending (and thus the segmentation) as well as tim agreenlcnt inforniation of possible word forms.</Paragraph> <Paragraph position="1"> Starting in an initial state corresponding to the constraints that apply to all unknown words, the FSA is moved by successive forms of an unknown lexeme together with their agreement information into successo,&quot; states that capture the incrementally accrued inflectional intbrmation. The FSA may reach a final state, in which case tile intlectional class has l)een uniquely idenl;ified, or it nlay renmin in a nonfinal state. A lexic.on would siml)ly recoM tile latest state roached for each ll()llll.</Paragraph> <Paragraph position="2"> Imlflcnlentation of t.his model is greatly complicated by the problems of (lisambiguation just discussed in {i5. In general, the states of the FSA must capture disjmmtions not only of intlectional classes, lint also of segmentation and gender alternatives.</Paragraph> <Paragraph position="3"> The application of automatic induction techniques to corpora appears to be essential, and we are currently f)ursuing possibilities for this.</Paragraph> </Section> class="xml-element"></Paper>