XML Viewer - p96-1032

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1032_metho.xml
Size: 22,377 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1032">
  <Title>Efficient Tabular LR Parsing</Title>
  <Section position="4" start_page="239" end_page="239" type="metho">
    <SectionTitle>
6 to range over Q*.
</SectionTitle>
    <Paragraph position="0"> Consider a fixed input string v E ~*. A configuration of the automaton is a pair (6, w) consisting of a stack 6 E Q* and the remaining input w, which is a suffix of the input string v. The rightmost symbol of 6 represents the top of the stack. The initial configuration has the form (qi~, v), where the stack is formed by the initial stack symbol. The final configuration has the form (qi, q/i,, e), where the stack is formed by the final stack symbol stacked upon the initial stack symbol.</Paragraph>
    <Paragraph position="1"> ZWe dispense with the notion of state, traditionally incorporated in the definition of PDA. This does not affect the power of these devices, since states can be encoded within stack symbols and transitions.</Paragraph>
    <Paragraph position="2"> The application of a transition 81 ~-~ 82 is described as follows. If the top-most symbols of the stack are 61, then these symbols may be replaced by 62, provided that either z = e, or z = a and a is the first symbol of the remaining input. Furthermore, if z = a then a is removed from the remaining input.</Paragraph>
    <Paragraph position="3"> Formally, for a fixed PDA .4 we define the binary relation t- on configurations as the least relation satisfying (881, w) ~- (662, w) if there is a transition 61 ~ 62, and (881, aw) t- (682, w) if there is a transition 61 a 82. The recognition of a certain input v is obtained if starting from the initial configuration for that input we can reach the final configuration by repeated application of transitions, or, formally, if (qin, v) I&amp;quot;* (q~,, aria, e), where t-* denotes the reflexive and transitive closure of b.</Paragraph>
    <Paragraph position="4"> By a computation of a PDA we mean a sequence (qi,,v) t- (61,wl) h... t- (6n,wn), n &gt; 0. A PDA is called deterministic if for all possible configurations at most one transition is applicable. A PDA is said to be in binary form if, for all transitions 61 ~L~ 62, we have 1611 &lt; 2.</Paragraph>
  </Section>
  <Section position="5" start_page="239" end_page="240" type="metho">
    <SectionTitle>
3 Ll:t automata
</SectionTitle>
    <Paragraph position="0"> Let G = (S, N, P, S) be a CFG. We recall the notion of LR automaton, which is a particular kind of PDA. We make use of the augmented grammar G t = (st, N t, pt, S t) introduced in Section 2.</Paragraph>
    <Paragraph position="1"> Let !LR : {A ~ a * ~ I (A --~ aft) E pt}.</Paragraph>
    <Paragraph position="2"> We introduce the function closure from 2 I~'R to 2 ILR and the function goto from 2 ILR x V to 2 l~rt. For any q C ILK, closure(q) is the smallest set such that  (i) q c closure(q); and (ii) (B --~ c~ * Aft) e closure(q) and (A ~ 7) e pt together imply (A --* * 7) E closure(q).</Paragraph>
    <Paragraph position="3">  We then define</Paragraph>
    <Paragraph position="5"> We construct a finite set T~Lp ~ as the smallest collection of sets satisfying the conditions:  (i) {S t ~ t&gt;. S&lt;~} E ~'~Ll=t; and (ii) for every q E ~T~LR and X E V, we have</Paragraph>
    <Paragraph position="7"> be the unique set in &amp;quot;~Ll:t containing (S t ~ t&gt;S * &lt;~); in other words, q/in = goto(q~n, S).</Paragraph>
    <Paragraph position="8">  For A * N, an A-redex is a string qoqlq2&amp;quot;&amp;quot; &amp;quot;qm, m _&gt; 0, of elements from T~Lrt, satisfying the following conditions: (i) (A ~ a .) * closure(q,,), for some a = X1X~. * * * Xm ; and (ii) goto(q~_l, Xk) = qk, for 1 &lt; k &lt; m.</Paragraph>
    <Paragraph position="9"> Note that in such an A-redex, (A --~ * X1Xg.... Xm) * closure(qo), and (A ~ X1...Xk * Xk+z'&amp;quot;Xm) E qk, for 0 &lt; k &lt; m.</Paragraph>
    <Paragraph position="10"> The LR automaton associated with G is now introduced. null Definition 1 .ALR = (S, QLR, TLR, qin, q~n), where QLR &amp;quot;- ~'~LR, qin = {S t -'* t&gt; * S&lt;~}, qlin = goto(qin, S), and TLR contains:  (i) q ~ q q', for every a * S and q, q~ * ~LR such that q' = goto(q, a); (ii) q5 ~-L q q', for every A * N, A-redex q~, and q' * TiLa such that q~ = goto(q, A).</Paragraph>
    <Paragraph position="11">  Transitions in (i) above are called shift, transitions in (ii) are called reduce.</Paragraph>
  </Section>
  <Section position="6" start_page="240" end_page="241" type="metho">
    <SectionTitle>
4 2LR Automata
</SectionTitle>
    <Paragraph position="0"> The automata .At, rt defined in the previous section are deterministic only for a subset of the CFGs, called the LR(0) grammars (Sippu and Soisalon-Soininen, 1990), and behave nondeterministically in the general case. When designing tabular methods that simulate nondeterministic computations of ~4LR, two main difficulties are encountered: * A reduce transition in .ALrt is an elementary operation that removes from the stack a number of elements bounded by the size of the underlying grammar. Consequently, the time requirement of tabular simulation of .AL~ computations can be onerous, for reasons pointed out by Sheil (1976) and Kipps (1991).</Paragraph>
    <Paragraph position="1"> * The set 7~Lrt can be exponential in the size of the grammar (Johnson, 1991). If in such a case the computations of.ALR touch upon each state, then time and space requirements of tabular simulation are obviously onerous.</Paragraph>
    <Paragraph position="2"> The first issue above is solved here by recasting .ALR in binary form. This is done by considering each reduce transition as a sequence of &amp;quot;pop&amp;quot; operations which affect at most two stack symbols at a time. (See also Lang (1974), Villemonte de la Clergerie (1993) and Nederhof (1994a), and for LR parsing specifically gipps (1991) and Leermakers (19925).) The following definition introduces this new kind of automaton.  I ! Definition 2 A~R = (~, QLR' TLR., qin, q1~n), where q, LR ----- 7~LR U ILR, qin = {S t &amp;quot;* I&gt; * S&lt;2}, qJin = goto(qin, S) and TLR contains: (i) q ~ q q,, for every a * S and q, q' * 7~Lrt such that q' = goto(q, a); (ii) q A. q (A --* a .), for every q * TiLR and (A * ) * closure(q); (iii) q (A --* aX * ,8) ~ (A ~ a * X,8), for every q * ~LR and (A ~ aX . ,8) * q; (iv) q (A --* * c~) A, q q', for every q, q' * 7~LR and</Paragraph>
    <Paragraph position="4"> Transitions in (i) above are again called shift, transitions in (ii) are called initiate, those in (iii) are called gathering, and transitions in (iv) are called goto. The role of a reduce step in .ALR is taken over in .APSK by an initiate step, a number of gathering steps, and a goto step. Observe that these steps involve the new stack symbols (A --~ a * ,8) * ILI~ that are distinguishable from possible stack symbols {A .-* a * ,8} * '/'~LR-We now turn to the second above-mentioned problem, regarding the size of set 7dgR. The problem is in part solved here as follows. The number of states in 7~Lrt is considerably reduced by identifying two states if they become identical after items A --~ cr * fl from ILrt have been simplified to only the suffix of the right-hand side ,8. This is reminiscent of techniques of state minimization for finite automata (Booth, 1967), as they have been applied before to LR parsing, e.g., by Pager (1970) and Nederhof and Sarbo (1993).</Paragraph>
    <Paragraph position="5"> Let G t be the augmented grammar associated with a CFG G, and let I2LI~ -- {fl I (A ---, a,8) e pt}. We define variants of the closure and 9oto functions from the previous section as follows. For any set q C I2Lt~, closurel(q) is the smallest collection of sets such that  (i) q C elosure'(q); and (ii) (Aft) e closure' (q) and (A ---* 7) * pt together imply (7) * closure'(q).</Paragraph>
    <Paragraph position="6">  Also, we define goto'(q, x) = {,8 I (x,8) ~ closure'(q)}.</Paragraph>
    <Paragraph position="7"> We now construct a finite set T~2Lrt as the smallest set satisfying the conditions:  (i) {S&lt;l} 6 7~2LR; and (ii) for every q 6 T~2LI:t and X * V, we have goto'(q, X) * T~2LR, provided goto'(q, X) # @. As stack symbols, we take the elements from I2LR and a subset of elements from (V x ~2Lrt):</Paragraph>
    <Paragraph position="9"> In a stack symbol of the form (X, q), the X serves to record the grammar symbol that has been recognized last, cf. the symbols that formerly were found immediately before the dots.</Paragraph>
    <Paragraph position="10"> The 2LK automaton associated with G can now be introduced.</Paragraph>
    <Paragraph position="12"> Note that in the case of a reduce/reduce conflict with two grammar rules sharing some suffix in the right-hand side, the gathering steps of A2Lrt will treat both rules simultaneously, until the parts of the right-hand sides are reached where the two rules differ. (See Leermakers (1992a) for a similar sharing of computation for common suffixes.) An interesting fact is that the automaton .A2LR is very similar to the automaton .ALR constructed for a grammar transformed by the transformation rtwo given by Nederhof and Satta (1994). 2</Paragraph>
  </Section>
  <Section position="7" start_page="241" end_page="242" type="metho">
    <SectionTitle>
5 The algorithm
</SectionTitle>
    <Paragraph position="0"> This section presents a tabular LR parser, which is the main result of this paper. The parser is derived from the 2LR automata introduced in the previous section. Following the general approach presented by Leermakers (1989), we simulate computations of 2For the earliest mention of this transformation, we have encountered pointers to Schauerte (1973). Regrettably, we have as yet not been able to get hold of a copy of this paper.</Paragraph>
    <Paragraph position="1"> these devices using a tabular method, a grammar transformation and a filtering function.</Paragraph>
    <Paragraph position="2"> We make use of a tabular parsing algorithm which is basically an asynchronous version of the CYK algorithm, as presented by Harrison (1978), extended to productions of the forms A ---* B and A ~ and with a left-to-right filtering condition. The algorithm uses a parse table consisting in a 0-indexed square array U. The indices represent positions in the input string. We define Ui to be Uk&lt;i Uk,i.</Paragraph>
    <Paragraph position="3"> Computation of the entries of U is moderated by a filtering process. This process makes use of a function pred from 2 N to 2 N, specific to a certain context-free grammar. We have a certain nonterminal Ainit which is initially inserted in U0,0 in order to start the recognition process.</Paragraph>
    <Paragraph position="4"> We are now ready to give a formal specification of the tabular algorithm.</Paragraph>
    <Paragraph position="5"> Algorithm 1 Let G = (~,N,P,S) be a CFG in binary form, let pred be a function from 2 N to 2 N, let Ai,,t be the distinguished element from N, and let v = ala2. &amp;quot;'an 6 ~* be an input string. We compute the least (n+ 1) x (n+ 1) table U such that  if B 6 Uij, (A ~ B) 6 P, A 6 pred(UO.</Paragraph>
    <Paragraph position="6"> The string has been accepted when S 6 U0,,.</Paragraph>
    <Paragraph position="7"> We now specify a grammar transformation, based on the definition of .A2LR.</Paragraph>
    <Paragraph position="8"> Definition 4 Let A2LR = (S, Q2LR, T2LR, ' qin,q~,) be the 2L1% automaton associated with a CFG G.</Paragraph>
    <Paragraph position="9"> The 2LR cover associated with G is the CFG</Paragraph>
    <Paragraph position="11"> (i) (a,q') --* a, for every (X, q) ~-~ (X, q) (a, q') E T2LR; (ii) (e) ~ C/, for every (X, q) ~-* (X, q) (e) 6 T2LR; (iii) (X~) ~ (X, q) (~), for every (X, q) (~) ~-* (X~) 6 T2LR;  (iv) (A,q') --, (a), for every (X, q) (or) ~-~ (X, q) (A, q') E T2La. Observe that there is a direct, one-to-one correspondence between transitions of.A2La and productions of C2LR(G).</Paragraph>
    <Paragraph position="12"> The accompanying function pred is defined as follows (q, q', q&amp;quot; range over the stack elements): pred(v) = {q I q'q&amp;quot; ~-~ q E T2La} U {q \] q' E r, q' ~*q'qET~La} U {q I q'Er, q'q&amp;quot;~-~q'qET2La}.</Paragraph>
    <Paragraph position="13"> The above definition implies that only the tabular equivalents of the shift, initiate and goto transitions are subject to actual filtering; the simulation of the gathering transitions does not depend on elements in r.</Paragraph>
    <Paragraph position="14"> Finally, the distinguished nonterminal from the cover used to initialize the table is qin'l Thus we start with (t&gt;, {S&lt;l)) E U0,0.</Paragraph>
    <Paragraph position="15"> The 2LR cover introduces spurious ambiguity: where some grammar G would allow a certain number of parses to be found for a certain input, the grammar C2Lrt(G) in general allows more parses. This problem is in part solved by the filtering function pred. The remaining spurious ambiguity is avoided by a particular way of constructing the parse trees, described in what follows.</Paragraph>
    <Paragraph position="16"> After Algorithm 1 has recognized a given input, the set of all parse trees can be computed as tree(q~n, O, n) where the function tree, which determines sets of either parse trees or lists of parse trees for entries in U, is recursively defined by:  (i) tree((a, q'), i, j) is the set {a}. This set contains a single parse tree Consisting of a single node labelled a.</Paragraph>
    <Paragraph position="17"> (ii) tree(e, i, i) is the set {c}. This set consists of an empty list of trees.</Paragraph>
    <Paragraph position="18"> (iii) tree(Xl?,i,j) is the union of the sets T. k (x~),i,j, where i &lt; k &lt; j, (8) E Uk,j, and there is at least one (X, q) E Ui,k and (X~) ---* (X, q) (8) in C2La(G), for some q. For each such k, select one such q. We define 7:, ~ = {t.ts I t E ( X fl ),i,j tree((X,q),i,k) A ts E tree(fl, k,j)}. Each t. ts is a list of trees, with head t and tail ts.</Paragraph>
    <Paragraph position="19"> (iv) tree( ( A, q'), i, j) is the union of the sets T. a where (~) E Uij is such that ( A,ql ),i,j ' (A, q') ---* (c~) in C2La(G). We define T ~ - (a,q'),i,j -- {glue(A, ts) l ts E tree(c~,i,j)}.</Paragraph>
    <Paragraph position="20">  The function glue constructs a tree from a fresh root node labelled A and the trees in list ts as immediate subtrees.</Paragraph>
    <Paragraph position="21"> We emphasize that in the third clause above, one should not consider more than one q for given k in order to prevent spurious ambiguity. (In fact, for fixed X, i, k and for different q such that (X, q) E Ui,k, tvee((X, q),i, k) yields the exact same set of trees.) With this proviso, the degree of ambiguity, i.e. the number of parses found by the algorithm for any input, is reduced to exactly that of the source grammar.</Paragraph>
    <Paragraph position="22"> A practical implementation would construct the parse trees on-the-fly, attaching them to the table entries, allowing packing and sharing of subtrees (cf. the literature on parse forests (Tomita, 1986; Elllot and Lang, 1989)). Our algorithm actually only needs one (packed) subtree for several ( X, q) E Ui,k with fixed X,i,k but different q. The resulting parse forests would then be optimally compact, contrary to some other LR-based tabular algorithms, as pointed out by Rekers (1992), Nederhof (1993) and Nederhof (1994b).</Paragraph>
  </Section>
  <Section position="8" start_page="242" end_page="243" type="metho">
    <SectionTitle>
6 Analysis of the algorithm
</SectionTitle>
    <Paragraph position="0"> In this section, we investigate how the steps performed by Algorithm 1 (applied to the 2LR cover) relate to those performed by .A2LR, for the same input. null We define a subrelation ~+ of t -+ as: (6, uw) ~+  (66',w) if and only if (6, uw) = (6, zlz2&amp;quot;.'zmw) t(88l,z2..-zmw) ~- ... ~ (68re,w) = (86',w), for  some m &gt; 1, where I~kl &gt; 0 for all k, 1 &lt; k &lt; m. Informally, we have (6, uw) ~+ (6~', w) if configuration (~8', w) can be reached from (6, uw) without the bottom-most part 8 of the intermediate stacks being affected by any of the transitions; furthermore, at least one element is pushed on top of 6.</Paragraph>
    <Paragraph position="1"> The following characterization relates the automaton .A2Lrt and Algorithm 1 applied to the 2LR cover. Symbol q E Q~Lrt is eventually added to Uij if and only if for some 6: (q;n,al...an) ~-* (di, ai+l...an) ~+ (~q, aj+l...an). In words, q is found in entry Ui,j if and only if, at input position j, the automaton would push some element q on top of some lower-part of the stack that remains unaffected while the input from i to j is being read.</Paragraph>
    <Paragraph position="2"> The above characterization, whose proof is not reported here, is the justification for calling the resulting algorithm tabular LR parsing. In particular, for a grammar for which .A2Lrt is deterministic, i.e. for an LR(0) grammar, the number of steps performed  by J42LR and the number of steps performed by the above algorithm are exactly the same. In the case of grammars which are not LR(0), the tabular LR algorithm is more efficient than for example a backtrack realisation of -A2LR.</Paragraph>
    <Paragraph position="3"> For determining the order of the time complexity of our algorithm, we look at the most expensive step, which is the computation of an element (Xfl) E Ui,j from two elements (X, q) e Ui,k and (t3) E Uk,j, through (X, q) (fl) ,--% (Xfl) E T2LR. In a straightforward realisation of the algorithm, this step can be applied O(IT2LRI&amp;quot; Iv 13) times (once for each i, k,j and each transition), each step taking a constant amount of time. We conclude that the time complexity of our algorithm is O(\[ T2LR\] * IV \[Z). As far as space requirements are concerned, each set Ui,j or Ui contains at most I O2w.RI elements. (One may assume an auxiliary table storing each Ui.) This results in a space complexity O(I Q2LRI&amp;quot; Iv 12). The entries in the table represent single stack elements, as opposed to pairs of stack elements following Lang (1974) and Leermakers (1989). This has been investigated before by Nederhof (1994a, p. 25) and Villemonte de la Clergerie (1993, p. 155).</Paragraph>
  </Section>
  <Section position="9" start_page="243" end_page="243" type="metho">
    <SectionTitle>
7 Empirical results
</SectionTitle>
    <Paragraph position="0"> We have performed some experiments with Algorithm 1 applied to ,A2L R and .A ~ for 4 practical LR, context-free grammars. For ,4 ~ LR a cover was used analogous to the one in Definition 4; the filtering function remains the same.</Paragraph>
    <Paragraph position="1"> The first grammar generates a subset of the programming language ALGOL 68 (van Wijngaarden and others, 1975). The second and third grammars generate a fragment of Dutch, and are referred to as the CORRie grammar (Vosse, 1994) and the Deltra grammar (Schoorl and Belder, 1990), respectively.</Paragraph>
    <Paragraph position="2"> These grammars were stripped of their arguments in order to convert them into context-free grammars.</Paragraph>
    <Paragraph position="3"> The fourth grammar, referred to as the Alvey grammar (Carroll, 1993), generates a fragment of English and was automatically generated from a unification-based grammar.</Paragraph>
    <Paragraph position="4"> The test sentences have been obtained by automatic generation from the grammars, using the Grammar Workbench (Nederhof and Koster, 1992), which uses a random generator to select rules; therefore these sentences do not necessarily represent input typical of the applications for which the grammars were written. Table 1 summarizes the test material. null Our implementation is merely a prototype, which means that absolute duration of the parsing process  some of their dimensions, and the average length of the test sentences (20 sentences of various length for each grammar).</Paragraph>
  </Section>
  <Section position="10" start_page="243" end_page="244" type="metho">
    <SectionTitle>
4 LR A2LR
</SectionTitle>
    <Paragraph position="0"> time per sentence.</Paragraph>
    <Paragraph position="1"> is little indicative of the actual efficiency of more sophisticated implementations. Therefore, our measurements have been restricted to implementationindependent quantities, viz. the number of elements stored in the parse table and the number of elementary steps performed by the algorithm. In a practical implementation, such quantities will strongly influence the space and time complexity, although they do not represent the only determining factors. Furthermore, all optimizations of the time and space efficiency have been left out of consideration.</Paragraph>
    <Paragraph position="2"> Table 2 presents the costs of parsing the test sentences. The first and third columns give the number of entries stored in table U, the second and fourth columns give the number of elementary steps that were performed.</Paragraph>
    <Paragraph position="3"> An elementary step consists of the derivation of ! one element in QLR or Q2LR from one or two other elements. The elements that are used in the filtering process are counted individually. We give an example for the case of .A~R. Suppose we derive an element q~ E Ui,j from an element (A -. * c~) E Ui,j, warranted by two elements ql,q2 E Ui, ql ~ q2, through pred, in the presence of ql (A --* * c~) ql q' e T~.~ and q2 (A ---* * c~) ~-~ q2 q' E T~R. We then count two parsing steps, one for ql and one for q2.</Paragraph>
    <Paragraph position="4"> Table 2 shows that there is a significant gain in space and time efficiency when moving from ,4~a to  ,A2LR.</Paragraph>
    <Paragraph position="5"> Apart from the dynamic costs of parsing, we have also measured some quantities relevant to the construction and storage of the two types of tabular LR parser. These data are given in Table 3.</Paragraph>
    <Paragraph position="6"> We see that the number of states is strongly reduced with regard to traditional LR parsing. In the case of the Alvey grammar, moving from \[T~LR \[ to \]T~2LR\[ amounts to a reduction to 20.3 %. Whereas time- and space-efficient computation of T~LR for this grammar is a serious problem, computation of T~2La will not be difficult on any modern computer. Also significant is the reduction from \[T~R \[ to \[T2LR\[, especially for the larger grammars. These quantities correlate with the amount of storage needed for naive representation of the respective automata.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML