XML Viewer - j02-2002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/j02-2002_metho.xml
Size: 38,585 bytes
Last Modified: 2025-10-06 14:07:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="J02-2002">
  <Title>c(c) 2002 Association for Computational Linguistics The Combinatory Morphemic Lexicon</Title>
  <Section position="3" start_page="147" end_page="151" type="metho">
    <SectionTitle>
I.NOM M-GEN book-ACC give-REL.OP child-ACC/*DAT see-TENSE-PERS1
</SectionTitle>
    <Paragraph position="0"> 'I saw the child to whom Mehmet gave the book.' The morphological/phrasal scope conflict of affixes is not particular to morphologically rich languages. Semantic composition of affixes in morphologically simpler languages poses problems with word (narrow) scope of inflections. For instance, fake trucks needs the semantics (plu(faketruck)), which corresponds to the surface bracketing [fake truck]-s, because it denotes the nonempty nonsingleton sets of things that are not trucks but fake trucks (Carpenter 1997). Four trucks, on the other hand, has the semantics (four(plutruck)), which corresponds to four [truck]-s, because it denotes the subset of nonempty nonsingleton sets of trucks with four members.</Paragraph>
    <Paragraph position="1"> The status of inflectional morphology among theories of grammar is far from settled, but, starting with Chomsky (1970), there seems to be an agreement that derivational morphology is internal to the lexicon. Lexical Functional Grammar (LFG) (Bresnan 1995) and earlier Government and Binding (GB) proposals e.g. (Anderson 1982) consider inflectional morphology to be part of syntax, but it has been delegated to the lexicon in Head-Driven Phase Structure Grammar (HPSG) (Pollard and Sag 1994, page 35) and in the Minimalist Program (Chomsky 1995, page 195).</Paragraph>
    <Paragraph position="2"> The representational status of the morpheme is even less clear. Parallel developments in computational studies of HPSG propose lexical rules to model inflectional morphology (Carpenter and Penn 1994). Computational models of LFG (Tomita 1988) and GB (Johnson 1988; Fong 1991), on the other hand, have been noncommittal regarding inflectional morphology. Finally, morphosyntactic aspects have always been a concern in Categorial Grammar (CG) (e.g., Bach 1983; Carpenter 1992; Dowty 1979; Heylen 1997; Hoeksema 1985; Karttunen 1989; Moortgat 1988b; Whitelock 1988), but the issues of constraining the morphosyntactic derivations and resolving the apparent mismatches have been relatively untouched in computational studies.</Paragraph>
    <Paragraph position="3"> We briefly look at Phrase Structure Grammars (PSGs), HPSG, and Multimodal CGs (MCGs) to see how word-based alternatives for morphosyntax would deal with  Computational Linguistics Volume 28, Number 2 the issues raised so far. For convenience, we call a grammar that expects words from the lexicon a lexemic grammar and a grammar that expects morphemes a morphemic grammar. A lexemic PSG provides a lexical interface for inflected words (X  primeprime - Stem, as a regular grammar). Assuming a syncategorematic coordination schema, that is, X - X and X, the N  in the left and right conjuncts of this example would not be of the same type. Revising the coordination schema such that only the root features coordinate would not be a solution either. In (4e), the relation of possession that is marked on the right conjunct must be carried over to the left conjunct as well. What is required for these examples is that the syntactic constituent X in the schema be analyzed as X-PLU(-POSS)-DAT, after N  and N  coordination.</Paragraph>
    <Paragraph position="4"> What we need then is not a lexemic but a morphemic organization in which bracketing of free and bound morphemes is regulated in syntax. The lexicon, of course, must now supply the ingredients of a morphosyntactic calculus. This leads to a theory in which semantic composition parallels morphosyntactic combination by virtue of bound morphemes' being able to pick their domains just like words (above X  , if needed). A comparison of English and Turkish in this regard is noteworthy. The English relative pronouns that/whom and the Turkish relative participle -diVg-i would have exactly the same semantics when the latter is granted a representational status in the lexicon (see Section 6).</Paragraph>
    <Paragraph position="5"> Furthermore, rule-based PSGs project a rigid notion of surface constituency. Steedman (2000) argued, however, that syntactic processes such as identical element deletion under coordination call for flexible constituency, such as SO (subject-object) in the SVO &amp; SO gapping pattern of English and SV (subject-verb) constituency in the OSV &amp; SV pattern of Turkish. Nontraditional constituents are also needed in specifying semantically transparent constituency of words, affixes, clitics, and phrases.</Paragraph>
    <Paragraph position="6"> Constraint-based PSGs such as HPSG appeal to coindexation and feature passing via unification, rather than movement, to deal with such processes. HPSG also makes the commitment that inflectional morphology is internal to the lexicon, handled either by lexical rules (Pollard and Sag 1994) or by lexical inheritance (Miller and Sag 1997). We look at (5c) to highlight a problem with the stem-and-inflections view. As words enter syntax fully inflected, the sign of the verb ver-diVg-i in the relative clause (5c) would be as in (6a), in which the SUBCAT list of the verb stem is, as specified in the lexical entry for ver, unsaturated. The participle adds coindexation in MOD|***|INDEX. The HPSG analysis of this example would be as in Figure 1. Although passing the agreement features of the head separately (Sehitoglu 1996) solves the case problem alluded to in (5c), however, structure sharing of the NP dat with the SLASH, INDEX, and CONTENT features of ver-diVg-i is needed for semantics (GIVEE), but this conflicts with the head features of the topmost NP acc in the tree. The relative participle as a lexical entry (e.g., (6b)) would resolve the problem with subcategorization because its SUBCAT list is empty (like the relative pronoun that in English), hence there would be no indirect dependence of the nonlocal SLASH feature and the local SUBCAT feature via semantics (CONTENT). Such morphemic alternatives are not considered in HPSG, however, and require a significant revision in the theory. Furthermore, HPSG's lexical 2 But see Creider, Hankamer, and Wood (1995), which argues that the morphotactics of human languages is not regular but linear context free.</Paragraph>
    <Paragraph position="7">  Bozsahin The Combinatory Morphemic Lexicon assignment for trace introduces phonologically null elements into the lexicon, which, as we show later, is not necessary.</Paragraph>
    <Paragraph position="8">  MCGs (Hepple 1990a; Morrill 1994; Moortgat and Oehrle 1994) allow different modes of combination in the grammar. In addition to binary modes such as wrapping and commutative operations, unary modalities provide finer control over the categories. Heylen (1997, 1999) uses unary modalities as a way of regulating morphosyntactic features such as case, number, and person for economy in lexical assignments. For instance, Frau has the category  N, which underspecifies it for case and declension. Underspecification is dealt with in the grammar using inclusion postulates (e.g., (7)). The interaction of different modalities is regulated by distribution postulates.</Paragraph>
    <Paragraph position="9">  Lexical assignments to inflected words carry unary modalities: boys has the type pl N, in contrast to sg N for boy. Although such regulation of inflectional features successfully mediates, for example, subject-verb agreement or NP-internal case agreement (as in German), it is essentially word-based, because type assignments are to inflected forms; morphemes do not carry types. This reliance on word types necessitates a lexical rule-based approach to some morphosyntactic processes that create indefinitely long words, such as ki-relativization in Turkish (see Section 6.5). But lexical rules for such processes risk nontermination (Sehitoglu and Bozsahin 1999). Our main point of departure from MCG accounts is the morphemic versus lexemic nature of the lexicon: The morphosyntactic and attachment modalities originate from the lexicon; they are not properties of the grammar (we elaborate more on this later). This paves the way to the morphemic lexicon by licensing type assignments to units smaller than words.</Paragraph>
    <Paragraph position="10"> Besides problems with lexical rules, the automata-theoretic power of MCGs is problematic: Unrestricted use of structural modalities and postulates leads to Turing completeness (Carpenter 1999). Indeed, one of the identifiable fragments of Mul- null Computational Linguistics Volume 28, Number 2 Figure 1 HPSG analysis of (5c).</Paragraph>
    <Paragraph position="11"> timodal languages that is computationally tractable is Combinatory Categorial languages (Kruijff and Baldridge 2000), which we adopt as the basis for the framework presented here. We propose a morphosyntactic Combinatory Categorial Grammar (CCG) in which the grammar and the morphemic lexicon refer to morphosyntactic types rather than syntactic types. We first introduce the syntactic CCG in Section 2. Morphosyntactic CCG is described in Section 3. In Section 4, we look at the computational aspects of the framework. We then show its realization for some aspects of English (Section 5) and Turkish (Section 6).</Paragraph>
  </Section>
  <Section position="4" start_page="151" end_page="156" type="metho">
    <SectionTitle>
Bozsahin The Combinatory Morphemic Lexicon
2. Syntactic Types
</SectionTitle>
    <Paragraph position="0"> CG is a theory of grammar in which the form-meaning relation is conceived as a transparent correspondence between the surface-syntactic and semantic combinatorics (Jacobson 1996). A CCG sign can be represented as a triplet p [?] s:u, where p is the prosodic element, s is its syntactic type, and u its semantic type. For instance, the lexical assignment for read is (8).</Paragraph>
    <Paragraph position="1">  The classical Ajdukiewicz/Bar-Hillel (AB) CG is weakly equivalent to Context-Free Grammars (Bar-Hillel, Gaifman, and Shamir 1960). It has function application rules, defined originally in a nondirectional fashion. The directional variants and their associated semantics are as follows:  (9) Forward Application (&gt;):  X/Y: fY: a = X: fa Backward Application (&lt;): Y: aX\Y: f = X: fa CCG (Steedman 1985, 1987, 1988; Szabolcsi 1983, 1987) is an extended version of AB that includes function composition (10), substitution, and type raising (11). These extensions make CCGs mildly context sensitive.</Paragraph>
    <Paragraph position="2">  [?] X: fa, where * is prosodic combination and fa is the application of f to a. The * will play a crucial role in the lexicalization of attachment later on. 5 The lambda term f[a] denotes internal one-step b-reduction of f on a. In parsing, we achieve the same effect by partial execution (Pereira and Shieber 1987). lf.f[a] is encoded as (a^F)^F in Prolog, where ^  is lambda abstraction. We opted for the explicit f[a] notation mainly for ease of exposition (cf. the semantics of raising verbs, relative participles, etc. in Section 6). Moreover, as Pereira and Shieber noted, (a^F)^F is not a lambda term in the strict sense because a is not a variable.  Computational Linguistics Volume 28, Number 2 looking for a VP (= S\NP) to the right to become S. The reversal of directionality such as topicalization (e.g., This book, I recommend) requires another schema. The reversal is with respect to the position of the verb, which we shall call contraposition and formulate as in (12).</Paragraph>
    <Paragraph position="3">  (&lt;XP) is leftward extraction of a right constituent, and (&gt;XP) is rightward extraction of a left constituent, both of which are marked constructions. Directionally insensitive types such as T|(T|X) cause the collapse of directionality in surface grammar (Moortgat 1988a).</Paragraph>
    <Paragraph position="4"> (12) Leftward Contraposition (&lt;XP): X: a = S  The semantics of contraposition depends on discourse properties as well. We leave this issue aside by (1) noting that it is related to type raising in changing the function-argument relation and (2) categorizing the sentence as S</Paragraph>
    <Paragraph position="6"> calized), which are not discourse equivalent to S. Syntactic characterization as such also helps a discourse component do its work on syntactic derivations.</Paragraph>
    <Paragraph position="7"> CCG's notion of interpretation is represented in the Predicate-Argument Structure (PAS). Its organization is crucial for our purposes, since the bracketing in the PAS is the arbitrator for reconciling the bracketings in morphology and syntax via proper lexical type assignments. It is the sole level of representation in CCG (Steedman 1996, page 89).</Paragraph>
    <Paragraph position="8">  It is the level at which the conditions on objects of interpretation, such as binding and control, are formulated. For instance, Steedman (1996) defines c-command and binding conditions A, B, and C over the PAS. The PAS also reflects the obliqueness order of the arguments: Predicate ... Tertiary-Term Secondary-Term Primary-Term Assuming left associativity for juxtaposition, this representation yields the bracketing in (13) for the PAS. Having the primary argument as the outermost term is motivated by the observations on binding asymmetries between subjects and complements in many languages (e.g., *Himself saw John, *heself).</Paragraph>
    <Paragraph position="9">  (13) 3. Morphosyntactic Types A syntactic type such as N does not discriminate morphosyntactically. A finer distinction can be made as singular nouns, plural nouns, case-marked nouns, etc. For 6 In fact, topicalization of nonperipheral arguments (This book, I would give to Mary) requires that (12) be finitely schematized over valencies, such as S, S/NP, S/PP (Steedman 1985). 7 We will not elaborate on the theoretical consequences of having this level of representation; see, for instance, Dowty (1991) and Steedman (1996).</Paragraph>
    <Paragraph position="10">  The lattice of diacritics for (a) Turkish and (b) English.</Paragraph>
    <Paragraph position="11"> instance, the set of number-marked nouns can be represented as n N, where is a morphosyntactic modality (&amp;quot;equals&amp;quot;) and n is a diacritic (for number). Books is of type n N, but book is not. The type for books can be obtained morphosyntactically by assigning -s (-PLU) the functor type</Paragraph>
    <Paragraph position="13"> N, where b stands for base. A syntactic type such as N\N overgenerates.</Paragraph>
    <Paragraph position="14"> Another modality, &lt; (&amp;quot;up to and equals&amp;quot;), allows wider domains in morphosyntactic typing. For instance, n &lt; N represents the set of nouns marked on number or any other diacritic that is lower than number in a partial order (e.g., Figure 2). The inflectional paradigm of a language can be represented as a partial ordering using the modalities.</Paragraph>
    <Paragraph position="15">  For instance, if the paradigm is Base-Number-Case, we have</Paragraph>
    <Paragraph position="17"> phosyntactic type t to the set of strings that have the type t. The modality is more strict than &lt; to provide finer control; although u(</Paragraph>
    <Paragraph position="19"> because a noun can be number marked but not case marked or vice versa. Also,</Paragraph>
    <Paragraph position="21"> and including case includes case-marked, number-marked, and unmarked nouns.</Paragraph>
    <Paragraph position="22"> The lattice consistency condition is imposed on the set of diacritics to ensure category unity.</Paragraph>
    <Paragraph position="23">  In other words, the syntactic type X can be viewed as an abbreviation for the morphosyntactic type  latticetop &lt; X where latticetop is the universal upper bound. It is the 8 See Heylen (1997) on use of unary modalities for a similar purpose in lexemic MCG. 9 In a lattice L, x [?] y (morphosyntactically, x &lt; y) is equivalent to the consistency properties x [?] y = x and x [?] y = y. We use the join operator for this check, thus it suffices to have a join semilattice.  Computational Linguistics Volume 28, Number 2 most underspecified category of X which subsumes all morphosyntactically decorated versions of X. Figure 2 shows the lattice for English and Turkish.</Paragraph>
    <Paragraph position="25"> if i [?]Dand X [?]A</Paragraph>
    <Paragraph position="27"> For instance, the infinitive marker -ma in (14a) can be lexically specified to look for untensed VPs--functions onto a &lt; S--to yield a complex noun base (14b), which, as a consequence of nominalization (result type N), receives case to become an argument of the matrix verb. The adjective in fake trucks can be restricted to modify unmarked Ns to get the bracketing [fake truck]-s (14c).</Paragraph>
    <Paragraph position="28">  Different attachment characteristics of words, affixes, and clitics must be factored into the prosodic domain as a counterpart of refining the morphosyntactic description. In Montague Grammar, every syntactic rule is associated with a certain mode of attachment, and this tradition is followed in MCG; attachment types are related with the</Paragraph>
    <Paragraph position="30"> for wrapping), which is a grammatical modality.</Paragraph>
    <Paragraph position="31">  In the present framework, however, attachment is projected from the lexicon to the grammar as a prosodic property of the lexical items.</Paragraph>
    <Paragraph position="32">  The grammar is unimodal in the sense that / and \ simply indicate the function-argument distinction in adjacent prosodic elements. The lexical projection of attachment further complements the notion of morphemic lexicon so that bound morphemes are no longer parasitic on words but have an independent 10 See Dowty (1996) and Steedman (1996) for a discussion of bringing nonconcatenative combination into grammar.</Paragraph>
    <Paragraph position="33"> 11 There is a precedent of associating attachment characteristics with the prosodic element rather than the slash in CG (Hoeksema and Janda 1988). In Hoeksema and Janda's notation, arguments can be constrained on phonological properties and attachment. For instance, the English article a has its NP/N category spelled out as &lt;/CX/N,NP,Pref&gt;, indicating a consonantal first segment for the noun argument and concatenation to the left.</Paragraph>
    <Paragraph position="34">  representational status of their own. We write i *s to denote the attachment modality i (affixation, syntactic concatenation, cliticization) of the prosodic element s. Table 1 shows some lexical assignments for Turkish (e.g., the sign a *s [?] X\Y:u characterizes a suffix). The morphosyntactic calculus of CCG is defined with the addition of morphosyntactic types and attachment modalities as follows (similarly, for other combinatory rules):  Hence the morphosyntactic decoration in lexical assignments propagates its lattice condition to grammar as in a  (cf. Heylen [1997], in which the grammar rule imposes a fixed partial order, e.g., X/Y combines with Z if 12 This coincides with Steedman's (1991b) observation that directionality of the main functor's slash is also a property of the same argument. The main functor is the one whose result type determines the overall result type (i.e., X/Y in (15)).</Paragraph>
    <Paragraph position="35">  Computational Linguistics Volume 28, Number 2 Z [?] Y). This is another prerequisite that must be fulfilled for the morphemic lexicon to project the lexical specification of scope.</Paragraph>
    <Paragraph position="36"> The grammar is not fixed on the attachment modality either (unlike a lexemic grammar, which is fixed on combination of words). Hence another requirement is the propagation of attachment to grammar. This is facilitated by the lexical types m *s[?]s:u, where m is an attachment type. The attachment calculus</Paragraph>
    <Paragraph position="38"> * in (15), which reads &amp;quot;attachment types i and j yield type k,&amp;quot; relates attachment to prosodic combination in the grammar.</Paragraph>
    <Paragraph position="39">  It can be attuned to language-particular properties. We can specify some prosodic properties of the attachment calculus for Turkish as follows ('x indicates stress on the prosodic element x): syntactic concatenation 'x</Paragraph>
    <Paragraph position="41"/>
  </Section>
  <Section position="5" start_page="156" end_page="164" type="metho">
    <SectionTitle>
4. Morpheme-Based Parsing
</SectionTitle>
    <Paragraph position="0"> To contrast lexemic and morphemic processing, consider the Turkish example in (16a).</Paragraph>
    <Paragraph position="1"> We show some stages of the derivation to highlight prosodic combination (*) as well.</Paragraph>
    <Paragraph position="2"> Every item in the top row is a lexical entry. Allomorphs, such as that of tense, have the same category in the lexicon (16b). Vowel harmony, voicing, and other phonological restrictions are handled as constraints on the prosodic element. Constraint checking can be switched off during parsing to obtain purely morphosyntactic derivations.</Paragraph>
    <Paragraph position="3">  13 Clearly, much more needs to be done to incorporate intonation into the system. The motive for attachment types is to provide the representational ingredients on behalf of the morphemic lexicon. As one reviewer noted, CCG formulation of the syntax-phonology interface moved from autonomous prosodic types (Steedman 1991a) to syntax-directed prosodic features (Steedman 2000b). The present proposal for attachment modality is computationally compatible with both accounts: Combinatory prosody can match prosodic types with morphosyntactic types. Prosodic features are associated with the basic categories of a syntactic type in the latter formulation, hence they become part of the featural inference that goes along with the matching of categories in the application of combinatory rules.  Bozsahin The Combinatory Morphemic Lexicon The lexicalization of attachment modality helps to determine the prosodic domain of postconditions. For instance, for Turkish, vowel harmony does not apply over word boundaries, which can be enforced by applying it when the modality is  carry agreement features of fixed arity (e.g., tense and person for S, S</Paragraph>
    <Paragraph position="5"> , and case, number, person, and gender for N and NP). Positional encoding of such information as in Pulman (1996) allows efficient term unification for the propagation of these features.</Paragraph>
    <Paragraph position="7"> [?]{&lt;,}). Apart from the matching of syntactic types and agreement, unification does no linguistic work in this framework, in contrast to structure-sharing in HPSG and slash passing in Unification CG (Calder, Klein, and Zeevat 1988).</Paragraph>
    <Paragraph position="8"> CCG is worst-case polynomially parsable (Vijay-Shanker and Weir 1993). This result depends on the finite schematization of type raising and bounded composition. Assuming a maximum valence of four in the lexicon (Steedman 2000a), composition</Paragraph>
    <Paragraph position="10"> ) is bounded by n [?] 3. The refinement of the type raising schema (11) for finite schematization is shown in (17).</Paragraph>
    <Paragraph position="11"> (17) a. Revised Forward Type Raising (&gt;T): NP: a = T/(T\NP):lf.f[a] b. Revised Backward Type Raising (&lt;T): NP: a = T\(T/NP):lf.f[a]</Paragraph>
    <Paragraph position="13"> The finite schematization of type raising suggests that it can be delegated to the lexicon, for example, by a lexical rule that value-raises all functions onto NP to their type-raised variety, such as NP/N to (S/(S\NP))/N. But this move presupposes the presence of such functions in the lexicon, that is, a language with determiners. To be transparent with respect to the lexicon, we make type raising and other unary schema (contraposition) available in the grammar. Since both are finite schemas in the revised formulation, the complexity result of Vijay-Shanker and Weir still holds. Checking the lattice condition as in (15) incurs a constant factor with a finite lattice.</Paragraph>
    <Paragraph position="14"> Type raising and composition cause the so-called spurious-ambiguity problem (Wittenburg 1987): Multiple analyses of semantically equivalent derivations are possible in parsing. This is shown to be desirable from the perspective of prosody; for example, different bracketings are needed to match intonational phrasing with syntactic structure (Steedman 1991). From the parsing perspective, the redundancy of analyses can be controlled by (1) grammar rewriting (Wittenburg 1987), (2) checking the chart for PAS equivalence (Karttunen 1989; Komagata 1997), (3) making the processor parsimonious on using long-distance compositions (Pareschi and Steedman 1987), or (4) parsing into normal forms (Eisner 1996; Hepple 1990b; Hepple and Morrill 1989; K&amp;quot;onig 1989; Morrill 1999). We adopt Eisner's method, which eliminates chains of compositions in O(1) time via tags in the grammar, before derivations are licensed. There is a switch that can be turned off during parsing to obtain all surface bracketings.</Paragraph>
    <Paragraph position="15"> 14 Mediating agreement via unification, type subsumption, or set-valued indeterminacy has important consequences on underspecification, the domain of agreement, and the notion of &amp;quot;like categories&amp;quot; in coordination (see Johnson and Bayer 1995; Dalrymple and Kaplan 2000; Wechsler and Zlati'c 2000). Rather than providing an elaborate agreement system, we note that Pulman's techniques provide the mechanism for implementing agreement as atomic unification, subsumption hierarchies represented as lattices, or set-valued features. The categorial ingredient of phrase-internal agreement can be provided by endotypic functors when necessary (see Sections 5 and 6).</Paragraph>
    <Paragraph position="16">  Computational Linguistics Volume 28, Number 2 There is also a switch for checking the PAS equivalence, with the warning that the equivalence of two lambda expressions is undecidable.</Paragraph>
    <Paragraph position="17"> The parser is an adaptation of the Cocke-Kasami-Younger (CKY) algorithm (Aho and Ullman 1972, page 315), modified to handle unary rules as well: In the kth iteration of the CKY algorithm to build constituents of length k, the unary rules apply to the CKY table entries T[a</Paragraph>
    <Paragraph position="19"> input to potential unary constituents of length k. In practice, this allows, for instance, a nominalized clause to be type-raised after it is derived as a category of type N.</Paragraph>
    <Paragraph position="20"> The remaining combinatory schema is already in Chomsky Normal Form, as required by CKY. The finite schematization of CCG rules and constant costs incurred by the normal form and lattice checking provide a straightforward extension of CKY-style context-free parsing for CCG. Komagata (1997) claims that the average complexity of CCG parsing is O(n  ) even without the finite schematization of type raising (based on the parsing of 22 sentences consisting of around 20 words, with a lexicon of 200 entries and no derivation of semantics in the grammar; a morphological analyzer provided five analyses per second to the parser). Statistical techniques developed for lexicalized grammars (e.g., Collins 1997), readily apply to CCG to improve the average parsing performance in large-scale practical applications (Hockenmaier, Bierner, and Baldridge 2000). Both Collins and Hockenmeier, Bierner, and Baldridge used section 02-21 of the Wall Street Journal Corpus of Penn Treebank for training, which contains 40,886 words (70,151 lexical entries). A recent initiative (Oflazer, et al. 2001) aims to provide such a resource of around one million words for Turkish. It encodes in the Treebank surface-syntactic relations and the morphological breakdown of words. The latter is invaluable for training morphemic grammars and lexicons.</Paragraph>
    <Paragraph position="21"> In morpheme-based parsing, lattice conditions help eliminate the permutation problem in endotypic categories. Such categories are typical of inflectional morphemes. For instance, assume that three morphemes m  have endotypic categories (say N\N), that they can appear only in this order, and that they are all optional. The categorization of m</Paragraph>
    <Paragraph position="23"> for all i, and k  The lattice and its consistency condition on derivability offer varying degrees of flexibility. A lattice with only latticetop and the relation [?] would undo all the effects of parameterization; it would be equivalent to a syntactic grammar in which every basic category X stands for latticetop &lt; X. To enforce a completely lexemic syntax, a lattice with latticetop and free would define all functional categories as functions over free forms. Morphological processing seems inevitable for languages like Turkish, and morphological and lexical ambiguity such as that shown in (19) must be passed on to syntax irrespective of how inflectional morphology is processed (isolated from or integrated with syntax). For the verbal paradigm, Jurafsky and Martin (2000) reports Oflazer's estimation that inflectional suffixes alone create around 40,000 word forms per root. In the nominal paradigm, iterative processes such as ki-relativization (Section 6.5) can create millions of word forms per nominal root (Hankamer 1989).  (a) Lexemic syntax and lexicon (b) Morphemic syntax and split lexicon (c) Morphemic syntax and lexicon  The processing of kazmalari in three different architectures (see Example (19) for glosses). words, a lexemic grammar (e.g., Figure 3a) is computationally nontransparent when interpretation is a component of an NLP system.</Paragraph>
    <Paragraph position="24"> Regarding the first question, let us consider two architectures from the perspective of the lexicon for the purpose of morphology, morphemic syntax, and semantics interface. The architecture in Figure 3b incorporates the current proposal as an interpretive front end to a morphological analyzer such as Oflazer's (1994), which delivers the analyses of words as a stream of morphemes out of which the bound morphemes have to be matched with their semantics from the affix lexicon to be interpretable in grammar. The advantage of this model is its efficiency; morphological parsing of words is--in principle--linear context free; hence, finite-state techniques and their computational advantages readily apply. But the uninterpretable surface forms of bound morphemes must match with those of the affix lexicon, and this is not necessarily a one-to-one mapping because of multiple lexical assignments for capturing syntactic-semantic distinctions (e.g., dative case as a direct object, indirect object, or adjunct marker or -i as a possessive and/or compound marker). Surface form-semantics pairing is not a trivial task, particularly in the case of lexically composite affixes, which require semantic composition as well as tokenization. The matching process needs to be aware of all the syntactic contexts in which certain affix sequences act as a unit, for example, relative participles and agreement markers (-diVg-i relative participle as -OP-POSS or -OP-AGR), possessive and compound markers, etc., for Turkish. The factorization of syntactic issues into a morphological analyzer would also make the separate morphological component nonmodular or expand its number of states to factor in these concerns (e.g., treating the -OP-POSS sequence as a state different from -OP followed  tional affixes.</Paragraph>
    <Paragraph position="25"> by -POSS, in which -POSS is not interpreted with the semantics of possession but that of agreement marking). Not knowing how many of the syntactic distinctions are handled by the morphological analyzer, a subsequent interpreter may need to reconsult the grammar if scoping problems arise.</Paragraph>
    <Paragraph position="26"> The architecture in Figure 3c describes the current implementation of the proposal. Bound morphemes are fed to the parser along with their interpretation. This model is preferred over that presented in Figure 3b for its simplicity in design and extendibility.</Paragraph>
    <Paragraph position="27">  The price is lesser efficiency due to context-free processing of inflectional morphology. By one estimate (Oflazer, Gocmen, and Bozsahin 1994), Turkish has 59 inflectional morphemes out of a total of 166 bound morphemes, and Oflazer (personal communication) notes that the average number of bound morphemes per word in unrestricted corpora is around 2.8, including derivational affixes. In a news corpus of 850,000 words, the average number of inflections per word is less than two (Oflazer et al. 2001). This is tolerable for sentences of moderate length in terms of the extra burden it puts on the context-free parser. Table 2 shows the results of our tests with a Prolog implementation of the system on different kinds of constructions. The test cases included 10 lexical items on average, with an average parsing time of 0.32 seconds per sentence. A relatively long sentence (12 words, 21 morphemes) took 2.9 seconds to parse. The longest sentence (20 words, 37 morphemes) took 40 seconds. The lexicon for the experiment included 700 entries; 139 were free morphemes and 561 were bound morphemes compiled out of 105 allomorphic representations (including all the ambiguous interpretations of bound morphemes and the results of lexical rules). For a rough comparison with an existing NLP system with no disambiguation 16 The morphological analyzer would be in no better position to handle morpheme-semantics pairing if the architecture in Figure 3b were implemented with an integrated lexicon of roots and affixes. For instance, -POSS would still require distinct states because of the difference in the semantics of possession and agreement marking coming from the lexicon.</Paragraph>
    <Paragraph position="28">  Computational Linguistics Volume 28, Number 2 aids, G &amp;quot;ung&amp;quot;ord &amp;quot;u and Oflazer (1995) reported average parsing times of around 10 seconds per sentence for a lexicon of 24,000 free morphemes, and their morphological analyzer delivered around two analyses per second to a lexemic grammar. Oflazer's later (1996) morphological analyzer contained an abstract morphotactic component of around 50 states for inflections, which resulted in compilation to 30,000 states and 100,000 transitions when the morphophonemic rules were added to the system.</Paragraph>
    <Paragraph position="29"> In conclusion, we note that the current proposal for a morphemic lexicon and grammar is compatible with both a separate morphological component (Figure 3b) and syntax-integrated inflectional morphology (Figure 3c). The architecture in Figure 3b may in fact be more suitable for inflecting languages (e.g., Russian) in which the surface forms of bound morphemes are difficult to isolate (e.g., m'este, locative singular of m'esto) but can be delivered as a sequence of morpheme labels by a morphological analyzer (e.g. m'esto-SING-LOC) to be matched with the lexical type assignments to -SING and -LOC for grammatical interpretation.</Paragraph>
    <Paragraph position="30"> It might be argued that in computational models of the type in Figure 3b, the lattice is not necessary, because the morphological analyzer embodies the tactical component. But not only tactical problems (cf. Example (18) and its discussion) but also transparent scoping in syntax and semantics is regulated by the use of lattice in type assignments, and that is our main concern. We show examples of such cases in the remainder of the article. Thus the nonredundant role of the lattice decouples the morphemic grammar-lexicon from the kind of morphological analysis performed in the back end. 5. Case Study: The English Plural In this section, we present a morphosyntactic treatment of the English plural morpheme. The lattice for English is shown in Figure 2b. We follow Carpenter (1997) in categorizing numerical modifiers and intersective adjectives as plural noun modifiers: four boys is interpreted as four(pluboy) and green boxes as green(plubox). This bracketing reflects the &amp;quot;set of sets&amp;quot; interpretation of the plural noun; four(pluboy) denotes the set of nonempty nonsingleton sets of boys with four members. The type assignments in (20) correctly interpret the interaction of the plural and these modifiers (cf. 21a-b). The endotypic category of the plural also allows phrase-internal number agreement for languages that require it; the agreement can be regulated over the category N before the specifier is applied to the noun group to obtain NP.</Paragraph>
    <Paragraph position="31">  Carpenter (1997) points out that nonintersective adjectives (e.g, toy, fake, alleged)are unlike numerical modifiers and intersective adjectives in that their semantics requires phrasal (wide) scope for -PLU, corresponding to the &amp;quot;set of things&amp;quot; interpretation of the plural noun. Thus, toy guns is interpreted as plu(toygun) because the plural outscopes the modification. It denotes a nonempty nonsingleton set of things that are not really guns but toy guns. *toy(plugun) would interpret plu over guns. The situation is precisely the opposite of (21); we need the second derivational pattern to go through and the first one to fail. The following category for nonintersective adjectives derives the wide scope for -PLU but not the narrow scope:  Carpenter (1997) avoided rebracketing because of the plural through lexical type assignments to plural nouns and a phonologically null lexical entry to obtain different semantic effects of the plural. In our formulation, there is no lexical entry for inflected forms and no phonologically null type assignment to account for the distinction in different types of plural modification; there is only one (phonologically realized) category for -PLU.</Paragraph>
    <Paragraph position="32">  The modifiers differ only in the kind and degree of morphosyntactic control. Strict control ()onfour disallows four boy, and flexible control  Computational Linguistics Volume 28, Number 2 not as *four(plu(greenbox)), and four toy guns is interpreted as four(plu(toygun)), not as *plu(four(toygun)). These derivations preserve the domain of the modifiers and the plural without rebracketing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML