File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/e91-1012_metho.xml

Size: 19,295 bytes

Last Modified: 2025-10-06 14:12:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="E91-1012">
  <Title>Non-deterministic Recursive Ascent Parsing Ren~ Leermakers</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The ascent recognizer
</SectionTitle>
    <Paragraph position="0"> One way to make the recognizer more deterministic is by combining functions corresponding to a number of competing items into one function. Let the set of all items of G be given by In. Subsets of I6; are called states, and we use q to be an arbitrary state, lWe associate to each state q a function, re-using the above operator \[.\],</Paragraph>
    <Paragraph position="2"> As above, the function reports which parts of the sentence can be derived. But as the function is associated to a set q of items, it has to do so for each item in q. If we define the initial state q0 = {S' --* .S}, now S --,&amp;quot; xl...xn is equivalent to (S' ---* .S,n) * \[q0\](0). Before proceeding, we need a couple of definitions.</Paragraph>
    <Paragraph position="3"> Let ini(q) be the set of initial items for state q, that are derived from q by the closure operation: ini(q) = { B --* .AIB -. A ^ A --* a.3 * q A 3 =C/ B-r}.</Paragraph>
    <Paragraph position="4"> The double arrow =C/, denotes a left-most-symbol rewriting Ba =e~ Cfla, using a non-e rule B ---, Cfl. The transition function goto is defined by (B * V)</Paragraph>
    <Paragraph position="6"> by relating to each state q not only the above \[q\], but also a function__ that we take to be the result of applying operator \[.\] to the state: \[q\] : V x N --* 2 Ideg xN It has the specification</Paragraph>
    <Paragraph position="8"> are recursively implemented by</Paragraph>
    <Paragraph position="10"> This is equivalent to the earlier version because we may replace the clause B ~ e by B ---, .e * ini(q). Indeed, if state q has item A --* a.fl and if there is a left-most-symbol derivation/3 =~* B-r then all items B --* .A are included in ini(q).</Paragraph>
    <Paragraph position="11"> For establishing the correctness of \[q--\] notice that</Paragraph>
    <Paragraph position="13"> By the definition of goto, if A ---, a.B-r * q then A --, aB.-r * goto(q, B). tlence, with the specification of \[q\], So may be rewritten as</Paragraph>
    <Paragraph position="15"> Also, as before, ~ =~* C'r implies that all items C ~ .g are in ini(q), and the existence of C -* .B~ in ini(q)</Paragraph>
    <Paragraph position="17"> In the computation of \[q0\](0), functions are needed only for states in the canonical collection of LR(0) states \[6\] for G, i.e. for every state that can be reached from the initial state by repeated application of the goto function.</Paragraph>
    <Paragraph position="18"> Note that in general the state C/ will be among these, and that both \[C/\](i) and \[g\](B, i) are empty sets for all i _&gt; 0 and B E V.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Deterministic variants
</SectionTitle>
    <Paragraph position="0"> One can prove that, if the grammar is LR(0), each recognizer function for a canonical LR(0) state results in a set with at most one element. The functions for non-empty q may in this case be rephrased as \[q\](i): if, for some I, I E q A final(l) t__hen return {(I, i)} else if B --..e E ini(q) then ret__.urn \[q\](B, i) else if i &lt; n then return \[q\](xi+~, i + 1) else return fi \[q\](B,i): if \[9oto(q, B)\](i) = C/ then return ~ else let (I, j) be the unique element of \[goto(q, B)\](i). Then: if pop(I) E q then return {(pop(l), j)} else return \[q\](Ihs(l), j) fl fi Reversely, the implementations of \[q\](i) and \[q\](B,i) of the previous section can be seen as non-deterministic versions of the present formulation, which therefore provides an intuitive picture that may be helpful to understand the non-deterministic parsing process in an operational way.</Paragraph>
    <Paragraph position="1"> Each function can be replaced by a procedure that, instead of returning a function result, assigns the result to a global (set) variable. As this set variable may contain at most one element, it can be represented by three variables, a boolean b, an item R and an integer i. If a function would have resulted in the set {(I,j)}, the global variables are set to b = TRUE, R = I and i = j. A function value ~ is represented by b = FALSE. Also the arguments of the functions are superfluous now. The rble of argument i can be played by the global variable with the same na__.rne, and lhs(R)can be used instead of argument B of \[q\]. Consequently, procedure \[C/\] becomes a statement b := FALSE, whereas for non-emp.~, q one gets the procedures (keeping the names \[q\] and \[q\], trusting no confusion will arise): \[q\] : if, for some I, I E q A final(l) then R := I</Paragraph>
    <Paragraph position="3"> if b. then if pop(R) E q then R := pop(R)</Paragraph>
    <Paragraph position="5"> Note that these procedures do not depend on the details of the right hand side of R. Only the number of symbols before the dot is relevant for the test &amp;quot;pop(R) E q&amp;quot;. Therefore, R can be replaced by two variables X E V and an integer I, making the following substitutions in the previous procedures:</Paragraph>
    <Paragraph position="7"> After these substitutions, one gets close to the recursive ascent recognizer as it was presented in \[1\]. A recognizer that is virtually the same as in \[l~s obtained by replacing the tail-recursive procedure \[q\] by an iterative loop.</Paragraph>
    <Paragraph position="8"> Then one is left with one procedure for each state. While parsing there is, at each instance, a stack of activated procedures that corresponds to the stacks that are explicitly maintained in conventional implementations of deterministic LR-parsers.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Complexity
</SectionTitle>
    <Paragraph position="0"> For LL(0) grammars the recursive descent recognizer is deterministic and works in linear time. The same is true of the ascent recognizer for LR(0) grammars. In the general, non-deterministic, case the recursive descent and ascent recognizers need exponential time unless the functions are implemented as memo-functions \[3\]. Memo-functions memorize for which arguments they have been called. If a function is called with the same arguments as before, the function returns the previous result without recomputing it. In conventional programming languages memo-functions are not available, but they can easily be implemented. Devices like graph-structured stacks \[4\], parse matrices \[7\], or welbformed - 65 substring tables \[8\], are in fact low-level realizations of the abstract notion of memo-functions. The complexity analysis of the recognizers is quite simple. There are O(n) different invocations of parser functions. The functions call at most O(n) other functions, that all result in a set with O(n) elements (note that there exist only O(n) pairs (I, j) with I E IG, i _&lt; j _&lt; n). Merging these sets to one set with no duplicates can be accomplished in O(n 2) time on a random access machine. Hence, the total time-complexity is O(na). The space needed for storing function results is O(n) per invocation, i.e. O(n 2) for the whole recognizer.</Paragraph>
    <Paragraph position="1"> The above considerations only hold if the parser terminates. The recursive descent parser terminates for all grammars that are not left-recursive. For the recursive ascent parser, the situation is more complicated. If the gra_m.mmar has a cyclic derivation B -** B, the execution of \[q\](B, i) leads to a call of itself. Also, there may be a cycle of transitions labeled by non-terminals that derive e, e.g. if goto(q, B) = q A B ---, e, so that the execution of \[q\](i) leads to a call of itself. There are non-cyclic grammars that suffer from such a cycle (e.g. S --* SSb, S --* e). Hence, the ascent parser does not terminate if the grammar is cyclic or if it leads to a cycle of transitions labeled b_.~ non-terminals that derive e. Otherwise, execution of \[q\](B, i) can only lead to calls of \[p\](i) with p ~ q and to calls of \[q\](C,k), such that either k &gt; i or C--** BAC ~ B. As there are only finitely many such p, C, the parser terminates. Note that both the recursive descent and ascent recognizer terminate for any grammar, if the recognizer functions are implemented as memo-functions with the property that a call of a function with some arguments yields $ while it is under execution. For instance, if execution of \[q\](i) leads to a call of itself, the second call is to yield ~. A remark of this kind, for the recursive descent parser, was first made in ref. \[8\]. The recursive descent parser then becomes virtually equivalent to a version of the standard Earley algorithm \[9\] that stores items A ---* a./~ in parse matrix entry Ti i if/~ ---,* xi+l...xi, instead of storing it if a --*deg x~+l...xj.</Paragraph>
    <Paragraph position="2"> The space required for a parser that also calculates a parse forest, is dominated by this forest. We show in the next section that it may be compressed into a cubic amount of space. In the complexity domain our ascent parser beats its rival, Tomita's parsing method \[4\], which is non-polynomial: for each integer k there exists a grammar such that the complexity of the Tomita parser is worse than n k.</Paragraph>
    <Paragraph position="3"> In addition to the complexity as a function of sentence length, one may also consider the complexity as a function of grammar size. It is clear that both time and space complexity are proportional to the number of parsing procedures. The number of procedures of the recursive descent parser is proportional to the number of items, and hence a linear function of the grammar size. The recursive ascent parser, however, contains two functions for each LR-state and is hence proportional to the size of the canonical collection of LR(0) states. In the worst case, this size is an exponential function of grammar size, but in the average natural language case there seems to be a linear, or even sublinear, dependence \[4\].</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Parse forest
</SectionTitle>
    <Paragraph position="0"> Usually, the recognition process is followed by the construction of parse trees. For ambiguous grammars, it becomes an issue how to represent the set of parse trees as compactly as possible. Below, we describe how to obtain a cubic representation in cubic time. We do so in three steps.</Paragraph>
    <Paragraph position="1"> In the first step, we observe that ambiguity often arises locally: given a certain context C\[-\], there might be several parse subtrees tl...tk (all deriving the same substring xi+l...xj from the same symbol A) that fit in that same context, leading to the parse trees C\[tl\], eft2\] ..... c\[th\] for the given string zl...zn. Instead of representing these parse trees separately, repeating each time the context C, we can represent them collectively as C\[{~1, ..., tk}\]. Of course, this idea should be applied recursively. Technically, this leads to a kind of tree-llke structure in which each child is a set of substructures rather than a single one.</Paragraph>
    <Paragraph position="2"> The sharing of context can be carried one step further.</Paragraph>
    <Paragraph position="3"> If we have, in one and the same context, a number of applied occurrences of a production rule A ---, a/~ which share also the same parse forest for a, we can represent the context of A ---* a~ itself and the common parse forest for a only once and fit the set of parse forests for fl into that. Again this idea has to be applied recursively. Technically, this leads to a binary representation of parse trees, with each node having at most two sons, and to the application of the context sharing technique to this binary representation.</Paragraph>
    <Paragraph position="4"> These two ideas are captured by introducing a function f with the interpretation that f(f3, i,j) represents the parse forest of all derivations from /~ E V* to zi+~...x~, for all i,j such that 0 &lt; i &lt; j &lt; n. The following recursive definitions fix the parse forest representation formally:</Paragraph>
    <Paragraph position="6"> a .---*&amp;quot; xi+l...x~}, for all A E VN, f(AB/3, i, j) = {(f(A, i, k), f(B#, k, J))l i &lt; k &lt; jAA ---,&amp;quot; xi+l...Xk ^ B/~ --~&amp;quot; xk+l...xj}, for all A, B E V.</Paragraph>
    <Paragraph position="7"> The representation for the set of parse trees is then just f(S, 0, n).</Paragraph>
    <Paragraph position="8"> We now come to our third step. Suppose, for the moment, that the guards a ---,* xi+l...xj and the like, occurring above, can be evaluated in some way or another. Then we can use function f to compute the representation of the set of parse trees for sentence xl...xn. If we make use of memo-functions to avoid repeated computation of a function applied to the same arguments, we see that there are at most O(n 2) function evaluations. - 66 If we represent function values by re\]erences to the set representations rather than by the sets themselves, the most complicated function evaluation consumes an additional amount of storage that is O(n): for j - i + 1 values of k we have to perform the construction of a pair of (copies of) two references, costing a unit amount of storage each. Therefore, the total amount of space needed for the representation of all parse trees is O(n3). The evaluation of the guards ct ---.&amp;quot; xi+l...xj etc. amounts exactly to solving a collection of recognition problems. Note that a top-down parser is possible that merges the recognition and tree-building phases, by writing</Paragraph>
    <Paragraph position="10"> for all A, B E V, the other cases for f being left unchanged. Note the similarity between the recognizing part of this algorithm and the descent recognizer of section 2. Again, this parser is a cubic algorithm if we use memo-functions. Another approach is to apply a bottom-up recognizer first and derive from it a set P containing triples (/i, i,j) only if/3 ---'&amp;quot; xi+l...xj, and at least those triples (/i, i,j) for which the guards/3 ---** xi+a ...xj are evaluated during the computation of f(S, O, n) (i.e., for each derivation S ---.&amp;quot; xl...xkAxj+l...Zn &amp;quot;-* Xl...XkOl/iXj+l...Xn &amp;quot;-'** zl...xiflzj+l...xn &amp;quot;~&amp;quot; xl...xn, the triples (/i,i,j) and (A,k,j) should be in P). The simplest way to obtain such P from our recognizer is to assume an implementation of memo-functions that enables access to the memoized function results, after executing \[q0\](O). Then one has the disposal of the set {(/i, i,j)l\[q\](i ) was invocated and (A --* a./i, j) e \[q\](i)} Clearly, (/i,i,j) is only in this set if /i --+&amp;quot; xi+l...x i. Note, however, that no pairs (A --~ ./i,j) are included in \[q\](i) (except if A = S'). We remedy th__is with a slight change of the specifications of \[q\] and \[q\], defining</Paragraph>
    <Paragraph position="12"> A recursive implementation of the recognition functions now is</Paragraph>
    <Paragraph position="14"> If we define, for this revised recognizer,</Paragraph>
    <Paragraph position="16"> {(A, i, j)l\[q\](i) was invocated and (a --, .~,j) e \[q\](i)}u {(x~+~,i,i+ DI0 &lt; i &lt; n}, it contains all triples that are needed in f(S, O, n), and we may write the forest constructing function as</Paragraph>
    <Paragraph position="18"> f(AB/i, i, j) ---- {(I(A, i, k), f(B/3, k, J))l (A, i, k) e P A (Bit, k, j) e P}, for all A, B e V, the other cases for f being left unchanged again. There exists a representation of P in quadratic space such that the presence or absence of an arbitrary triple can be decided upon in unit time. As a result, the time complexity of f(S, O, n) is cubic.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Extended CF grammars
</SectionTitle>
    <Paragraph position="0"> An extended CF grammar consists of grammar rules with regular expressions at the right hand side. Every extended CF grammar can be translated into a normal CF grammar by replacing each right hand side by a regular (sub)grammar. The strong generative power is different from CF grammars, however, as the degree of the nodes in a derivation tree is unbounded. To apply our recognizer directly to extended grammars, a few of the foregoing definitiovs have to be revised.</Paragraph>
    <Paragraph position="1"> As before, a grammar rule is written A --, a, but with a now a regular expression with Na symbols (elements of V). Defining T + = 1...N,, and Ta = 0...Na, regular expression tr can be characterized by 1. a mapping C/~ : T~ + ~ V associating a grammar symbol to each number.</Paragraph>
    <Paragraph position="2"> 2.. a function succo : To --* 2 T+ mapping each number to its set of successors. The regular expression can start with tile symbols corresponding to the numbers in succo(O).</Paragraph>
    <Paragraph position="3"> 3. a set a,~ E 2 7`0 of numbers of symbols the regular expression can end with.</Paragraph>
    <Paragraph position="4"> Note that 0 is not associated to a symbol in V and is not a possible element of succ,,(k). It can be element of a,~ though, in which case there is an empty path through the regular expression.</Paragraph>
    <Paragraph position="5"> We define an item as a pair (A --, a,k), with the interpretation that number k is 'just before the dot'. The correspondence with dotted rules is the following. Let a = B1...Bt, then a is a simple regular expression characterized by ~ba(k) = Bk, succa(k) = {k + 1} if 0 &lt; k &lt; l, succo(l) = {~, and a,, = {I}. Item (A ---. a,0) corresponds to the initial item A ---* .a and (A ---* a, k) to the dotted-rule item with the dot just after Bk.</Paragraph>
    <Paragraph position="6"> The predicate final for the new kind of items is defined by final((A ---* a, k)) = (k E an) Given a set q of items, we define</Paragraph>
    <Paragraph position="8"> The function pop becomes set-valued and the transition function can be defined in terms of it (remember: ~ = q U ini(q)):</Paragraph>
    <Paragraph position="10"> A recursive ascent recognizer is now implemented by</Paragraph>
    <Paragraph position="12"> The initial state q0 is {(S' ---* S, 0)}, and a sentence xl...x, is grammatical if ((S' --* S, 0), n) * \[qo\](O). The recognizer is deterministic if 1. there is no shift-reduce or reduce-reduce conflict, i.e. every state has at most one final item, and in case it has a final item it has no items (A --, ~,j) with k e succ,~(j) A ~b,~(k) * VT.</Paragraph>
    <Paragraph position="13"> 2. for all reachable states q, q N ini(q) = ~, and for all I there is at most one J * ~ such that J E pop(I).</Paragraph>
    <Paragraph position="14"> In the deterministic case, the analysis of section 4 can be repeated with one exception: extended grammar items can not be represented by a non-terminal and an integer that equals the number of symbols before thc dot, as this notion is irrelevant in the case of regular expressions. In standard presentations of deterministic LR-parsing this leads to almost unsurmountable problems \[5\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML