File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/p89-1017_metho.xml
Size: 31,080 bytes
Last Modified: 2025-10-06 14:12:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P89-1017"> <Title>How to cover a grammar Ren6 Leermakers</Title> <Section position="4" start_page="0" end_page="135" type="metho"> <SectionTitle> 2 C-Parser </SectionTitle> <Paragraph position="0"> The simplest parser that is applicable to all context-free languages, is the well-known Cocke-Younger-Kasa~i (CYK) parser. It requires the grammar to be cast in Chomsky normal form. The CYK parser constructs, for the sentence zl ..zn, a parse matrix T. To each part zi+1 ..zj of the input corresponds the matrix element T.j, the value of which is a set of non-terminals from which one can derive zi+1..zj. The algorithm can easily be generalized to work for any grammar, but its complexity then increases with the number of non-terminals at the right hand side of grammar rules. Bilinear grammars have the lowest complexity, disregarding linear grammars which do not have the generative power of general context-free grammars. Below we list the recursion relation T must satisfy for general bilinear grammars. We write the grammar as as a four-tuple (N, E, P, S), where N is the set of non-terminals, E the set of terminals, P the set of production rules, and S 6 N the start symbol. We use variables I,J,K,L E N, ~1,~2,~z E E*, and i,j, kl..k4 as indices of the matrix T 1 .</Paragraph> <Paragraph position="2"> The relation can be solved for the diagonal elements T,, independently of the input sentence. They are equal to the set of non-terminals that derive e in one or more 1 Throughout the paper we identify a gr~ummar rule \[ --* with the boolean expression 'l directly derives ~'.</Paragraph> <Paragraph position="3"> steps. Algorithms that construct T for given input, will be referred to as C-paxsers. The time needed for constructing T is at most a cubic function of the input length ~, while it takes an amount of space that is a quadratic function of n. The sentence is successfully parsed, if S E Ton. From T, one can simply deduce an output grammar O, which represents the set of parse trees. Its non-termlnals axe triples < I,i,j >, where I is a non-termlnal of the original bilineax grammar, and i,j are integers between 0 and n.</Paragraph> <Paragraph position="5"> The grammar rules of O axe such that they generate only the sentence that was parsed. The parse trees according to the output grammar are isomorphic to the parse trees generated by the original grammar. The latter parse trees can be obtained from the former by replacing the triple non-terminals by their first element.</Paragraph> <Paragraph position="6"> Matrix elements of T are such that their members cover part of the input. This does not imply that all members axe useful for constructing a possible parse of the input as a whole. In fact, many are useless for this purpose. Depending on the grammar, knowledge of part of T may give restrictions on the possibly useful contents of the rest of T. Making use of these restrictions, one may get more efficient parsers, with the same functionality. As an example, one has the generalized E~rley prediction. It involves functions predlct~ : 2 ~ --* 2N(N is the set of non-terminais), such that one can prove that the useful contents of the Tj~ axe contained in the elements of a matrix @ related to T by Soo = S~, O,~ ffi predictj_,(~.o O~,) m T,~, if j > O, where O c, called the initial prediction, in some constant set of non-termln~ls that derive (. It follows that T~$ can be calculated from the matrix elements O~t with i < k, l ~ j, i.e. the occurrences of T at the right hand side of the recurrence relation may be replaced by O. Hence 0~j, j > 0, can be calculated from the matrix elements</Paragraph> <Paragraph position="8"> The algorithm that creates the matrix @ in this way, scanning the input from left to right, is called a restricted C-paxser. The above relation does not determine the diagonal elements of ~ uniquely, and a restricted C-paxser is to find the smal\]est solution. Concerning the gain of efficiency, it should be noted that this is very grammax-dependent. For some grammars, restriction of the paxser reduces its complexity, while for others predict functions may even be counter-productive \[4\].</Paragraph> </Section> <Section position="5" start_page="135" end_page="136" type="metho"> <SectionTitle> 3 Bilinear covers </SectionTitle> <Paragraph position="0"> A grammar G is said to be covered by a grammar C(G), if the language generated by both grammars is identical, and if for each sentence the set of parse trees generated by G can be recovered from the set of parse trees generated by C(G). The grammar C(G) is called a cover for G, and we will be interested in covers that axe hilinear, and can thus be parsed by C-paxser. It is rather surprising that at the heart of most parsing algorithms for context-free languages lies a method for deriving a bilineax cover.</Paragraph> <Section position="1" start_page="135" end_page="136" type="sub_section"> <SectionTitle> 3.1 Earley's method </SectionTitle> <Paragraph position="0"> Eaxley's construction of items is a clear example of a construction of a biHneax cover CE(G) for each context-free grammar G. The terminals of CE(G) and G axe identicai, the non-terminals of Cz(G) axe the items (dotted rnies\[1\]) I~, defined as follows. Let the non-terminal defined by rule i of grammar G be given by N~, then I~ is N~ -- a. fl, with lilt + 1 = k (~, # axe used for sequences of terminals and non-terminais). We assume that only one rule, rule O, of G rewrites the start symbol S. The length of the right-hand side of rule i is given by M~ - 1.</Paragraph> <Paragraph position="1"> The rules of C~(G) are derived as follows.</Paragraph> <Paragraph position="2"> * Let I~ be an item of the form A --* ~ * B~, and hence I~ -l be A --, aB. ~. Then if B is a terminal, I~ -I ...* I~B, and if B is non-terminal then I~ -I --I~, for all j such that Nj = B.</Paragraph> <Paragraph position="3"> * Initial items of the form N~ --- .or rewrite to e: * For each i one has the final rule/~ -- I~.</Paragraph> <Paragraph position="4"> In \[4\] a similar construction was given, leading to a grammar in canonical two-form for each context-free grammar. Among other things it differs from the above in the appearance of the final rules, which axe indeed superfluous. We have introduced them to make the extension to RTN's, in section 4, more immediate.</Paragraph> <Paragraph position="5"> The description just given, yields a set of production rules consisting of sections P~, that have the following structure: Pi --- ~-,iI211M' ,'fI#-li -- I~ z'~/} t.,l{I~ ( -- flu {I deg -* I!}, where z~/ E U, {/~i) u E. Note that the start symbol of the cover is/~0. The construction of parse matrices T by C-paxser yields the Eaxley algorithm, without its prediction part. By restricting the parser by the predicto function satisfying</Paragraph> <Paragraph position="7"> the initial prediction 0C/ being the smallest solution of s deg = v, dicto(S u }, one obtains a conventional Earley parser (predict~ -~ U~. {I~ } for k > 0). The cover is such that usually the J predict action speeds up the parser considerably. There are many ways to define covers with dotted rules as non-terminals. For example, from recent work by Kruseman Aretz \[6\], we learn a prescription for a bilinear cover for G, which is smaller in size compared to C~(G), at the cost of rules with longer right hand sides. The prescription is as follows (c~, ~, 7, s are sequences of terminals and non-termlnaJs, ~ stands for sequences of terminals only, and A, B, C are non-terminals): * Let I be an item of the form A --* or. Bs, and K is an item B --* */-, then J .--, IK~, where either J is item A --* c~B~. C~ and ~: = ~C~, or J is item A --* ~B~. and s --- 6.</Paragraph> <Paragraph position="8"> * Let I be an item of the form A ---, 6 .Bc~ or A -* 6., then I --* 6.</Paragraph> </Section> <Section position="2" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 3.2 Lang grammar </SectionTitle> <Paragraph position="0"> In a similar fashion the items used by Lang \[2\] in his algorithm for non-deterministic pushdown automata (NPDA) may be interpreted as non-terminals of a hilinear grammar, which we will call the Lang grammar.</Paragraph> <Paragraph position="1"> We adopt restrictions on NPDA's similarly to \[2\], the main one being that one or two symbols be pushed on the stack in a singie move, and each stark symbol is removed when it is read. If two symbols &re pushed on the sta~k, the bottom one must be identical to the symbol that is removed in the same transition. Formally we write an NPDA as & 7-tuple (Q, E, r, 6, q0, Co, F), where Q is the set of state symbols, E the input alphabet, r the pnshdown symbols, 6 : Q x (I&quot; tJ {e}) x (E U {C/}) --* 2 Qx((~}uru(rxr)) the transition function, qo E Q the initial state, C/0 E 1` the start symbol, and F C_ Q is the set of final states. If the automaton is in state p, and C/~ is the top of the stack, and the current symbol on the input tape is It, then it may make the following eight types of moves: if (r, e) E 6(p, e, e): gO to state r if (r, e) E 6(p, or, e): pop ~, go to state r if (r, 3&quot;) ~ 6(p, a, e): pop ~, push 3', go to state r if (r, e) ~ 6(p, e, It): shift input tape, go to state r if (r, 3') E 6(p, e, It): push 7, shift tape, go to r if (r, e) ~ 6(p, c~, It): pop ~, shift tape, go to r if (r, 3&quot;) ~ 6(p, C/~, It): pop c~, push % shift tape, go to r if (r, 3&quot;or) ~ 6(p, ~, y): push % shift tape, go to r We do not allow transitions such that (r, ~r) ~ 6(p, e, e), or (r, &quot;yo~) ~ 6(p, ~, e), and assume that the initial state can not be reached from other states.</Paragraph> <Paragraph position="2"> The non-terminals of the Lang grammar are the start symbol 3 and four-tuple entities (Lang's 'items') of the form < q, c~,p, ~ >, where p and q axe states, and cr and stack symbols. The idea is that iff there exists a computation that consumes input symbols zi..zj, starting at state p with a stack ~0 (the leftmost symbol is the top), and ending in state q with stack ~0, and if the stack fl(o does not re-occur in intermediate configura~ tions, then < q,a,p,~ >---&quot; z~..zj. The rewrite rules of the La~g grammar are defined as follows (universal quantification over p, q, r, s E Q; ~, ~, 7 E 1`; z E ~, t.J e, It E E is understood):</Paragraph> <Paragraph position="4"> mars that generate the same language \[5\]. The above construction yields such a grammar in bilinear form.</Paragraph> <Paragraph position="5"> It only works for automata, that have transitions like we use above. Lang grammars are rather big, in the rough form given above. Many of the non-terminals do not occur, however, in the derivation of any sentence.</Paragraph> <Paragraph position="6"> They can be removed by a standard procedure \[5\]. In addition, during parsing, predict functions can be used to limit the number of possible contents of parse matrix elements. The following initial prediction and predict functions render the restricted C-parser functionally equivalent to Lang's original algorithm, albeit that Lang considered & class of NPDA's which is slightly different from the class we alluded to above:</Paragraph> <Paragraph position="8"> u{Slk ffi n} (n is sentence length).</Paragraph> <Paragraph position="9"> The Tomita parser \[3\] simulates an NPDA, constructed from a context-free grammar via LR-parsing tw hies. Within our formalism we can implement this idea, and arrive at an Earley-like version of the Tomita parser, which is able to handle general context-free grammars, including cyclic ones.</Paragraph> </Section> </Section> <Section position="6" start_page="136" end_page="137" type="metho"> <SectionTitle> 4 Extension to RTN's </SectionTitle> <Paragraph position="0"> In the preceding section we discussed various ways of deriving bilinear covers. Reversely, one may try to discover what kinds of grammars are covered by certain bllinear grammars.</Paragraph> <Paragraph position="1"> A billnear grammar C~(G), generated from a context-free grammar by the Earley prescription, has peculiar properties. In general, the sections P~ defined above constitute regular subgrammars, with the ~ as terminals. Alternatively, P~ may be seen as a finite state automaton with states I~. Each rule I~ -l --.//Jz~ corresponds to a transition from I~ to I~ -l labeled by z~. This cotrespondence between regular grammars and finite state automata is in fact a special instance of the correspondence between Lang bilinear grammars and NPDA's. The Pi of the above kind are very restricted finite state automata, generating only one string. It is a natural step to remove this restriction and study covers that are the union of general regular subgrammars. Such a grammar will cover a grammar, consisting of rules of the form N~ -. ~, where ~ is a regular expression of terminals and non-terminals. Such grammars go under the names of RTN grammars \[8\], or extended context-free grammars \[9\], or regular right part grammars \[10\]. Without loss of generality we may restrict the format of the fufite state automata, and stipulate that it have one initial *tale I~' and one final state/~, and only the following type of rules: For future reference we define define the set I of non-terminals as I = U,${I~}, and its subset/o = U,{/~i }. A covering prescription that turns an RTN into a set of such subgrammars, reduces to C~ if applied to normal context-free grammars, and will be referred to by the same name, although in general the above format does not determine the cover uniquely. For some example definitions of items for RTN's (i.e. the I~), see \[1,9\].</Paragraph> </Section> <Section position="7" start_page="137" end_page="138" type="metho"> <SectionTitle> 5 The CNLR Cover </SectionTitle> <Paragraph position="0"> A different cover for RTN grammars may be derived from the one discussed in the previous section. So our starting point is that we have a biline&r grammar CPS(G), consisting of regular subgrammars. We (approximately) follow the idea of Tomita, and construct an NPDA from an LR(O)-antomaton, whose states are sets of items. In our case, the items are the non-terminals of C~(G). The full specification of the automaton is extracted from \[9\] in a straightforward way. Subsequently, the general prescription of chapter 3 yields a bilinear grammar. In this way we arrive at what we would like to call the canonical non-deterministic LR-parser (CNLR parser, for short).</Paragraph> <Section position="1" start_page="137" end_page="137" type="sub_section"> <SectionTitle> 5.1 LR(0) states </SectionTitle> <Paragraph position="0"> In order to derive the set Q of LR(0) states, which are subset* of I, we first need a few definitions. Let * be an</Paragraph> <Paragraph position="2"> Similarly, the sets gotot(s, z), and goto.j(s, z), where z E</Paragraph> <Paragraph position="4"> The automaton we look for can be constructed in terms of the LR(0) states. In addition to the goto function*, we will need the predicate reduce, defined by ,'edna(s,_:) -- 3,,((~ -- X~') ^Xl' ~ s).</Paragraph> <Paragraph position="5"> A point of interest is the possible existence of *tacking conflicts\[9\]. These arise if for some s, z both gotol (s, z) and goto2(a, x) are not empty. Stacking conflict* cause an increase of non-determinism that can always be avoided by removing the conflicts. One method for doing this has been detailed in \[9\], and consist* of the splitting in parts of the right hand side of grammar rule* that cause conflicts. Here we need not and will not assume anything about the occurrence of stacking conflict*.</Paragraph> <Paragraph position="6"> Grammars, of which Earley cover* do not give rise to stacking conflicts, form a proper subset of the set of extended context-free grammars. It could very well be that natural language grammar*, written as RTN's in order to produce 'natural' syntax trees, generally belong to this subset. For an example, see section 6.</Paragraph> </Section> <Section position="2" start_page="137" end_page="137" type="sub_section"> <SectionTitle> 5.2 The automaton </SectionTitle> <Paragraph position="0"> To determine the automaton we specify, in addition to the set of states Q, the set of stack symbols F ---- QUIdegu {Co}, the initial state q0 = closure({IoMdeg}), the final states F ffi {slrednce(s, ~)}~ and the transition function</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="137" end_page="138" type="sub_section"> <SectionTitle> 5.3 The grammar </SectionTitle> <Paragraph position="0"> From the automaton, which is of the type discussed in section 3.2, we deduce the bilinear grammar</Paragraph> <Paragraph position="2"> where $,t,q,p E Q, r E QU{C0}, ~,/~ E r, y E E.</Paragraph> <Paragraph position="3"> A* was mentioned in section 3.2, this grammar can be reduced by a standard algorithm to contain only useful non-terminals.</Paragraph> <Paragraph position="4"> If the reduction algorithm of \[5\] is performed, it turns out that the structure of the above grammar is such that useful non-terminals < p, C/~, q, ~ > satisfy a ~Q=~.otfq ~f~Q=~p=q Furthermore, two non-terminals that differ only in their fourth tuple-element always derive the same strings of terminals. Hence, the fourth element can safely be discarded, as can the second if it is in Q and the first if the second is not in Q. The non-termlnals then become pairs < ~, s >, with ~ ~ I' and s ~ Q. For such nonterminals, the predict functions, mentioned in section 2, must be changed:</Paragraph> <Paragraph position="6"> Note that the terminal < q0, q0 > does not appear in this grammar, but will appear in the parse matrix because of the initial prediction 0 c. Of course, when the automaton is fully specified for a particular language, the corresponding CNLR grammar can be reduced still further, see section 6.4.</Paragraph> <Paragraph position="7"> Even the grammar in reduced form contains many non-terminals that derive the same set of strings. In particular, all non-terminals that only differ in their second component generate the same language. Thus, the second component only encodes information for the predict functions. The redundancy can be removed by the following means. Define the function C/ : I' -. 2 Q, such that ~(~r) ---- {s{ < or, s > is a useful non-terminal of the above grammar}.</Paragraph> <Paragraph position="8"> Then we may simply parse with the 'bare' grammar, the non-terminals of which are the automaton stack symbols</Paragraph> <Paragraph position="10"> The function C/ can also be deduced directly from the bare grammar, see section 7.</Paragraph> </Section> <Section position="4" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 5.4 Parse trees </SectionTitle> <Paragraph position="0"> Each parse tree r according to the original grammar can be obtained from a corresponding parse tree t according to the cover. Each subset of the set of nodes of t is partially ordered by the relation 'is descendant of'. Now consider the set of nodes of t that correspond to nonterminals/~. The 'is descendant of' ordering defines a projected tree that contains, apart from the terminals, only these nodes. The desired parse tree r is now obtained by replacing in the projected tree, each node 1 deg by a node labeled by N~, the left hand side of grammar rule i of the original grammar.</Paragraph> </Section> </Section> <Section position="8" start_page="138" end_page="139" type="metho"> <SectionTitle> 6 Example </SectionTitle> <Paragraph position="0"> The foregoing was rather technical and we will try to repair this by showing, very explicitly, how the formalism works for a small example grammar. In particular, we will for a small RTN grammar, derive the Earley cover of section 4, and the two covers of sections 5.3.1 and 5.3.2.</Paragraph> <Section position="1" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 6.1 The grammar </SectionTitle> <Paragraph position="0"> The following is a simple grammar for finite subordinate</Paragraph> </Section> <Section position="2" start_page="138" end_page="139" type="sub_section"> <SectionTitle> 6.2 The Earley cover </SectionTitle> <Paragraph position="0"> The above grammar is covered by four regular subgrarn-</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="139" end_page="139" type="sub_section"> <SectionTitle> 6.3 The automaton </SectionTitle> <Paragraph position="0"> The construction of section 5.1 yields the following set of states:</Paragraph> <Paragraph position="2"> The transitions axe grouped into two parts. First we list the function goto~:</Paragraph> <Paragraph position="4"> goto~(qs, det) = qa; goto~(qs,prep) = qs; goto2(ql=, prep) &quot;J-- qs Likewise, we have the gotot function, which gives the non-stacking transitions for our grammar: gotol (ql , ~) = q'a; gotol (q,, I~ ) = q,; gotol (q~, noun) = q~; gotol (qs, g) ---- qs; gotol(qs, verb) = ~,; goto~(qs, ~=) = qs; goto, (~, , Po ) = elo; goto, (es, ~) = q. ; go,o, (e., ~) = el=; go,o, (q,=, g) = e,, The predicate reduce holds for six pairs of states and non-terminals: redu~O,, Po); redu=O,o, ~); redffi~(q,, ~); reduce(q,l , \]~=); reduce(q,, g); reduce(ql=, l~a )</Paragraph> </Section> <Section position="4" start_page="139" end_page="139" type="sub_section"> <SectionTitle> 6.4 CNLR parser </SectionTitle> <Paragraph position="0"> Given the automaton, the CNLR grammar follows according to section 5.3. After removal of the useless non-terminals we arrive at the following grammar, which is of the format of section 5.3.1.</Paragraph> <Paragraph position="1"> From this grammar, the function C/ can be deduced. It is given by</Paragraph> <Paragraph position="3"> Either by stripping the above cover, or by directly deducing it ~om the automaton, the bare cover can be obtained. We list it here for completeness.</Paragraph> <Paragraph position="5"> Together with the predict functions defined in section 5.3.2, this grammar should provide an efficient parser for our example grammar.</Paragraph> </Section> </Section> <Section position="9" start_page="139" end_page="140" type="metho"> <SectionTitle> 7 Tadpole Grammars </SectionTitle> <Paragraph position="0"> The function ~ has been defined, in section 5, via a grammar reduction algorithm. In this section we wish to show that an alternative method exists, and, moreover, that it can be applied to the class of bilinear tadpole grammars. This class consists of all bilineax grammars without epsilon rules, and with no useless symbols, with non-termlnals (the head) preceding terminals (the tail) at the right hand side of rules.Thus, rules are of the form A -* a6, where we use the symbol 6 as a variable over possibly empty sequences of terminals, and a denotes a possibly empty sequence of at most two non-terminals. Capital romu letters are used for non-terminals. Note that a CNLR cover is a member of this class of grammars, as are all grammars that are in Chomsky normal form.</Paragraph> <Paragraph position="1"> First we change the grammar a little bit by adding q0 to the set of non-terminals of the grammar, assuming that it was not there yet. Next, we create a new grammar, inspired by the grammar of 5.3.1, with pairs < A, C > as non=terminals. The rules of the new grammar are such that (with implicit universal quantification over all variables, as before) The start symbol of the new grammar, which can be seen as a parametrized version of the tadpole grammar, is defined to be < S, qo >. A non-terminal < B, C > is a useful one, whence C E ~(B) according to the definition of ~, if it occurs in a derivation of the parametrized grammar: < S, qo >---&quot; ~ < B, C > A, where iC/ is an arbitrary sequence of non-terminals, and A is a sequence of terminals and non-terminals. Then, we conclude that</Paragraph> <Paragraph position="3"> This definition may be rephrased without reference to the parametrized grammar. Define, for each non-terminal A a set firstnonts(A), such that firstnonts(A) --.. {BIA --&quot; BA}.</Paragraph> <Paragraph position="4"> The predict set o(A) then is obtainabh as</Paragraph> <Paragraph position="6"> where S is the start symbol. As in section 5.3.2, the initial prediction is given by 0= = {q0}.</Paragraph> </Section> <Section position="10" start_page="140" end_page="140" type="metho"> <SectionTitle> 8 An LL/LR-automaton </SectionTitle> <Paragraph position="0"> In order to illustrate the amount of freedom that exists for the construction of automata and associated parsers, we shall construct a non-deterministic LL/LRautomaton and the associated cover, along the lines of section 5.</Paragraph> <Section position="1" start_page="140" end_page="140" type="sub_section"> <SectionTitle> 8.1 The automaton </SectionTitle> <Paragraph position="0"> We change the goto functions, such that they yield sets of states rather that just one state, as follows: go=o,(s, z) ---- {dosure({I,~})l Zl ~ s ^ (Z~ -- ZI=) A j <> M,} goto~O, =) = {ao.ure({z~})lZ, ~' e s A (Z, ~ -- Z,~'=)} The set Q is changed accordingly to be the smallest one that satisfies</Paragraph> <Paragraph position="2"> Every state in this automaton is defined as a set clos~re({I~ }) and is, as a consequence, completely characterized by the one non-terminal I~. The reason for calling the above an LL/LR-automaton lies in the fact that the states of LR(0) automata for LL(1) grammars have exactly this property. The predicate reduce is defined as in section 5.1.</Paragraph> </Section> <Section position="2" start_page="140" end_page="140" type="sub_section"> <SectionTitle> 8.2 The LL/LR-cover </SectionTitle> <Paragraph position="0"> The cover associated with the LL/LR-automaton just defined, is a simple variant of the cover of section 5.3.2:</Paragraph> <Paragraph position="2"> As it is of the tadpole type, the predict mechanism works as explained in section 7.</Paragraph> <Paragraph position="3"> We just mentioned that each LL/LR-state, and hence each non-terminal of the LL/LR-cover, is completely characterized by one non-terminal, or 'item', of the Earley cover. This correspondence between their non-terminals leads to a tight connection between the two covers. Indeed, the cover we obtained from the LL/LRautomaton can be obtained from the cover of section 4, by eliminating the e-rules-I~ ~ --~ e. Of course, the predict functions associated to both covers differ considerably, as it axe the non-terminals deriving e, the items beginning with a dot, that axe the object of prediction in the Earley algorithm, and they axe no longer present in the LL/LR-cover.</Paragraph> </Section> </Section> <Section position="11" start_page="140" end_page="141" type="metho"> <SectionTitle> 9 Efficiency </SectionTitle> <Paragraph position="0"> We have discussed a number of bilinear covers now, and we could add many more. In fact, the space of bilinear covers for each context-free grammar, or RTN grammar, is huge. The optimal one would be the one that makes C-parser spend the least time on the average sentence.</Paragraph> <Paragraph position="1"> In general, the least time will be more or less equivalent to the smallest content of the parse matrix. Naively, this content would be proportional to the size of the cover. Under this assumption, the smallest cover would be optimal. Note that the number of non-terminals of the CNLR cover is equal to the number of states of the LR-antomaton plus the number of non-terminals of the original grammar. The size of the Earley cover is given by the number of items. In worst case situations the size of the CNLR cover is an exponential function of the size of the original grammar, whereas the size of the Ea~ley cover dearly grows linearly with the size of the original grammar. For many grammars, however, the number of LR(0)-states, may be considerably smaller than the number of items. This seems to be the case for the natural language grammaxs considered by Tomita\[3\]. His data even suggest that the number of LR(0) states is a sub-linear function of the original grammar size. Note, however, that predict functions may influence the relation between grammar size and average parse matrix content, as some grammars may allow more restrictive predict functions then others. Summarizing, it seems unlikely, that a single parsing approach would be optimal for all grammars. A viable goal of research would be to find methods for determining the optimal cover for a given grammar. Such research should have a solid experimental back-bone.</Paragraph> <Paragraph position="2"> The matter gets still more complicated when the original grammar is an attribute grammar. Attribute evaluation may lead to the rejection of certain parse trees that are correct for the grammar without attributes. Then the ease and efficiency of on-the-fly attribute evaluation becomes important, in order to stop wrong parses as soon as possible. In the Rosetta machine translation system \[11,12\], we use an attributed RTN during the analysis of sentences. The attribute evaluation is bottom-up only, and designed in such a way that the grammar is covered by an attributed Earley cover.</Paragraph> <Paragraph position="3"> Other points concerning efficiency that we would like to discuss, are issues of precomputation. In the conventional Earley parser, the calculation of the cover is done dynamically, while parsing a sentence. However, it could just as well be done statically, i.e. before parsing, in order to increase parsing performance. For instance, set operations can be implemented more efficiently if the set elements are known non-terminals, rather than unknown items, although this would depend on the choice of programming language. The procedure of generating bilinear covers from LR-antomata should always be performed statically, because of the amount of computation involved. Tomita has reported \[3\], that for a number of grammars, his parsing method turns out to be more eflicient than the Earley ~gorithm. It is not clear, whether his results would still hold if the creation of the cover for the Earley parser were being done statically.</Paragraph> <Paragraph position="4"> Onedmight be inclined to think that if use is made of precomputed sets of items, as in LR-parsers, one is bound to have a parser that is significantly different from and probably faster than Earley's algorithm, which computes these sets at parse time. The question is much more subtle as we showed in this paper. On the one hand, non-deterministic LR-parsing comes down to the use of certain covers for the grammar at hand, just like the Earley algorithm. Reversely, we showed that the Earley cover can, with minor modifications, be obtained from the LL/LR-automaton, which also uses precomputed sets of items.</Paragraph> </Section> class="xml-element"></Paper>