File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1022_metho.xml
Size: 10,010 bytes
Last Modified: 2025-10-06 14:14:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1022"> <Title>S. EMH. E: A Generalised Two-Level System</Title> <Section position="4" start_page="0" end_page="160" type="metho"> <SectionTitle> 2 Linguistic Descriptions </SectionTitle> <Paragraph position="0"> The linguist provides SemHe with three pieces of data: a lexicon, two-level rules and word formation grammar* All entries take the form of Prolog terms. 4 (Identifiers starting with an uppercase letter denote variables, otherwise they are instantiated symbols*) A lexical entry is described by the term synword( <morpheme>, (category)).</Paragraph> <Paragraph position="1"> Categories are of the form</Paragraph> <Paragraph position="3"> a notational variant of the PATR-II category formalism (Shieber, 1986).</Paragraph> <Paragraph position="4"> acronym, but the title of a grammatical treatise written by the Syriac polymath (inter alia mathematician and grammarian) Bar 'EbrSy5 (1225-1286), viz. k tSb5 d.semh.~ 'The Book of Rays'.</Paragraph> <Paragraph position="5"> aWe describe here the terms which are relevant to this paper. For a full description, see (Kiraz, 1996). tl_alphabet(0, \[k, t,b, a, el ). % surface alphabet tl_alphabet(1, \[cl, c2, c3,v, ~\] ). tl_alphabet(2, \[k, t,b, ~\] ). tl_alphabet (3, \[a, e,~\] ). % lexical alphabets tl_set(radical, \[k,t,b\]). tl_set(vowel, \[a, el). tl_set(clc3, \[cl, c3\]). % variable sets</Paragraph> <Paragraph position="7"> \[radical(C)\], \[\[\], \[root:\[measure=pa''el\]\], \[\]\]).</Paragraph> <Paragraph position="8"> Listing 1 A two-level rule is described using a syntactic variant of the formalism described by (Ruessink, 1989; Pulman and Hepple, 1993), including the extensions by (Kiraz, 1994c), tl_rule( <id),<LLC>, (Lex}, (RLC}, COp>, <LSC>, <RSC>, (variables>, (features)).</Paragraph> <Paragraph position="9"> The arguments are: (1) a rule identifier, id; (2) the left-lexical-context, LLC, the lexical center, Lex, and the right-lexical-context, RLC, each in the form of a list-of-lists, where the ith list represents the /th lexical tape; (3) an operator, => for optional rules or <=> for obligatory rules; (4) the left-surface-context, LSC, the surface center, Sur\], and the right-surfacecontext, RSC, each in the form of a list; (5) a list of the variables used in the lexical and surface expressions, each member in the form of a predicate indicating the set identifier (see in\]ra) and an argument indicating the variable in question; and (6) a set of features (i.e. category forms) in the form of a list-of-lists, where the ith item must unify with the feature-structure of the morpheme affected by the rule on the ith lexical tape.</Paragraph> <Paragraph position="10"> A lexical string maps to a surface string iff (1) they can be partitioned into pairs of lexical-surface subsequences, where each pair is licenced by a rule, and (2) no partition violates an obligatory rule.</Paragraph> <Paragraph position="11"> Alphabet declarations take the form tl_alphabet( ( tape> , <symbol_list)), and variable sets are described by the predicate tl_set({id), {symbol_list}). Word formation rules take the form of unification-based CFG rules, synrule(<identifier), (mother), \[(daughter1},..., (daughtern}l).</Paragraph> <Paragraph position="12"> The following example illustrates the derivation of Syriac /ktab/5 'he wrote' (in the simple p'al measure) 6 from the pattern morpheme {cvcvc} 'verbal pattern', root {ktb} 'notion of writing', and vocalism {a}. The three morphemes produce the underlying form */katab/, which surfaces as /ktab/ since short vowels in open unstressed syllables are deleted. The process is illustrated in (1)/ a ~'~ */katab/~ /ktab/ (1) c v c v c = I I L k t b The pa &quot;el measure of the same verb, viz./katteb/, is derived by the gemination of the middle consonant (i.e. t) and applying the appropriate vocalism {ae}. The two-level grammar (Listing 1) assumes three lexical tapes. Uninstantiated contexts are denoted by an empty list. R1 is the morpheme boundary (= ~) rule. R2 and R3 sanction stem consonants and vowels, respectively. R4 is the obligatory vowel deletion rule. R5 and R6 map the second radical, \[t\], for p'al and pa&quot;el forms, respectively. In this example, the lexicon contains the entries in (2). 8 (2) synword(clvc2vca,pattern : 0)synword(ktb, root: \[measure = M\]).</Paragraph> <Paragraph position="13"> synword(aa, vocalism : \[measure = p'al\]).</Paragraph> <Paragraph position="14"> synword(ae, vocalism : \[measure = pa&quot;el\]).</Paragraph> <Paragraph position="15"> Note that the value of 'measure' in the root entry is SSpirantization is ignored here; for a discussion on Syriac spirantization, see (Kiraz, 1995).</Paragraph> <Paragraph position="16"> uninstantiated; it is determined from the feature values in R5, R6 and/or the word grammar (see infra, SS4.3).</Paragraph> </Section> <Section position="5" start_page="160" end_page="161" type="metho"> <SectionTitle> 3 Implementation </SectionTitle> <Paragraph position="0"> There are two current methods for implementing two-level rules (both implemented in Semi{e): (1) compiling rules into finite-state automata (multitape transducers in our case), and (2) interpreting rules directly. The former provides better performance, while the latter facilitates the debugging of grammars (by tracing and by providing debugging utilities along the lines of (Carter, 1995)). Additionally, the interpreter facilitates the incremental compilation of rules by simply allowing the user to toggle rules on and off.</Paragraph> <Paragraph position="1"> The compilation of the above formalism into automata is described by (Grimley-Evans et al., 1996).</Paragraph> <Paragraph position="2"> The following is a description of the interpreter.</Paragraph> <Section position="1" start_page="160" end_page="160" type="sub_section"> <SectionTitle> 3.1 Internal Representation </SectionTitle> <Paragraph position="0"> The word grammar is compiled into a shift-reduce parser. In addition, a first-and-follow algorithm, based on (Aho and Ullman, 1977), is applied to compute the feasible follow categories for each category type. The set of feasible follow categories, NextCats, of a particular category Cat is returned by the predicate FOLLOW(+Cat, -NextCats). Additionally, FOLLOW(bos, NextCats) returns the set of category symbols at the beginning of strings, and cos E NextCats indicates that Cat may occur at the end of strings.</Paragraph> <Paragraph position="1"> The lexical component is implemented as character tries (Knuth, 1973), one per tape. Given a list of lexical strings, Lex, and a list of lexical pointers, LexPtrs, the predicate LEXICAL-TRANSITIONS( q-Lex, +LexPtrs, - New Lex Ptrs, - LexC ats ) succeeds iff there are transitions on Lex from LexPtrs; it returns NewLexPtrs, and the categories, Lex-Cats, at the end of morphemes, if any.</Paragraph> <Paragraph position="2"> Two-level predicates are converted into an internal representation: (1) every left-context expression is reversed and appended to an uninstantiated tail; (2) every right-context expression is appended to an uninstantiated tail; and (3) each rule is assigned a 6-bit 'precedence value' where every bit represents one of the six lexical and surface expressions. If an expression is not an empty list (i.e. context is specified), the relevant bit is set. In analysis, surface expressions are assigned the most significant bits, while lexical expressions are assigned the least significant ones. In generation, the opposite state of affairs holds. Rules are then reasserted in the order of their precedence value. This ensures that rules which contain the most specified expressions are tested first resulting in better performance.</Paragraph> </Section> <Section position="2" start_page="160" end_page="161" type="sub_section"> <SectionTitle> 3.2 The Interpreter Algorithm </SectionTitle> <Paragraph position="0"> The algorithms presented below are given in terms of prolog-like non-deterministic operations. A clause is satisfied iff all the conditions under it are satisfied. The predicates are depicted top-down in (3). (SemHe makes use of an earlier implementation by (Pulman and Hepple, 1993).)</Paragraph> <Paragraph position="2"> In order to minimise accumulator-passing arguments, we assume the following initially-empty stacks: ParseStack accumulates the category structures of the morphemes identified, and FeatureStack maintains the rule features encountered so far. ('+' indicates concatenation.) PARTITION partitions a two-level analysis into sequences of lexical-surface pairs, each licenced by a rule. The base case of the predicate is given in Listing 2, 9 and the recursive case in Listing 3.</Paragraph> <Paragraph position="3"> The recursive COERCE predicate ensures that no partition is violated by an obligatory rule. It takes three arguments: Result is the output of PARTITION (usually reversed by the calling predicate, hence, COERCE deals with the last partition first), PrevCats is a register which keeps track of the last morpheme category encountered, and Partition returns selected elements from Result. The base case of the predicate is simply COERCE(\[\], _, \[\]) - i.e., no more partitions. The recursive case is shown in Listing 4.</Paragraph> <Paragraph position="4"> CurrentCats keeps track of the category of the morpheme which occures in the current partition. The invalidity of a partition is determined by INVALID-PARTITION (Listing 5).</Paragraph> <Paragraph position="5"> TwO-LEVEL-ANALYSIS (Listing 6) is the main predicate. It takes a surface string or lexical string(s) and returns a list of partitions and a 9For efficiency, variables appearing in left-context and centre expressions are evaluated after LEXICAL-TRANSITIONS since they will be fully instantiated then; only right-contexts are evaluated after the recursion.</Paragraph> </Section> </Section> class="xml-element"></Paper>