XML Viewer - e93-1036

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1036_metho.xml
Size: 27,063 bytes
Last Modified: 2025-10-06 14:13:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="E93-1036">
  <Title>Generalized Left-Corner Parsing</Title>
  <Section position="3" start_page="305" end_page="306" type="metho">
    <SectionTitle>
2 Left-corner parsing
</SectionTitle>
    <Paragraph position="0"> Before we define LC parsing, we first define some notions strongly connected with this kind of parsing.</Paragraph>
    <Paragraph position="1"> We define a spine to be a path in a parse tree which begins at some node which is not the first son of its father (or which does not have a father), then proceeds downwards every time taking the leftmost son, and finally ends in a leaf.</Paragraph>
    <Paragraph position="2"> We define the relation / between nonterminals such that B / A if and only if there is a rule A --* B a, where a denotes some sequence of grammar symbols.</Paragraph>
    <Paragraph position="3"> The transitive and reflexive closure of / is denoted by L*, which is called the left-corner relation. Informally, we have that B/* A if and only if it is possible  to have a spine in some parse tree in which B occurs below A (or 13 = A). We pronounce B Z* A as &amp;quot;B is a left corner of A&amp;quot;.</Paragraph>
    <Paragraph position="4"> We define the set GOAL to be the set consisting of S, the start symbol, and of all nonterminals A which occur in a rule of the form B--* t~ A fl where is not e (the empty sequence of grammar symbols).</Paragraph>
    <Paragraph position="5"> Informally, a nonterminal is in GOAL if and only if it may occur at the first node of some spine.</Paragraph>
    <Paragraph position="6"> We explain LC parsing by means of the small context-free grammar below. No claims are made about the linguistic relevance of this grammar. Note that we have transformed lexieal ambiguity into grammatical ambiguity by introducing the nonterminals VorN and VorP.</Paragraph>
    <Paragraph position="8"> The algorithm reads the input from left to right.</Paragraph>
    <Paragraph position="9"> The elements on the parse stack are either nonterminals (the goal elements) or items (the item elements). Items consist of a rule in which a dot has been inserted somewhere in the right side to separate the members which have been recognized from those which have not.</Paragraph>
    <Paragraph position="10"> Initially, the parse stack consists only of the start symbol, which is the first goal, as indicate in Figure 1. The indicated parse corresponds with one of the two possible readings of &amp;quot;time flies like an arrow&amp;quot; according to the grammar above.</Paragraph>
    <Paragraph position="11"> We define a nondeterministic LC parser by the parsing steps which are possible according to the following clauses: la. If the element on top of the stack is the nonterminal A and if the first symbol of the remaining input is t, then we may remove t from the input and push an item \[B --~ t * ~\] onto the stack, provided B /* A.</Paragraph>
    <Paragraph position="12"> lb. If the element on top of the stack is the non-terminal A, then we may push an item \[B --~ .\] onto the stack, provided B /* A. (The item \[B --* .\] is derived from an epsilon rule B ---, c.)  2. If the element on top of the stack is the item \[A ~ c~ . t /~\] and if the first symbol of the remaining input is t, then we may remove t from the input and replace the item by the item \[A ---+ Ott o ill.</Paragraph>
    <Paragraph position="13"> 3. If the top-most two elements on the stack are B \[A ~ ~ .\], then we may replace the item by an item of the form \[C --* A * ill, provided C Z* B. 4. If the top-most three elements on the stack are \[a -~ ft. A 7\] A \[A -~ ~ 4, then we may replace these three elements by the item \[B --* fl A * 7\]5. If a step according to one of the previous clauses ends with an item \[A ~ t~ * B ~ on top of the stack, where B is a nonterminal, then we subsequently push B onto the stack.</Paragraph>
    <Paragraph position="14"> 6. If the stack consists only of the two elements S  \[S--* a .\] and if the input has been completely read, then we may successfully terminate the parsing process.</Paragraph>
    <Paragraph position="15"> Note that only nonterminals from GOAL will occur as separate elements on the stack.</Paragraph>
    <Paragraph position="16"> The nondeterministie LC parsing algorithm defined above uses one symbol of lookahead in case of terminal left corners. The algorithm is therefore deterministic for the LC(0) grammars, according to the definition of LC(k) grammars in \[Soisalon-Soininen and Ukkonen, 1979\]. (This definition is incompatible with that of \[Rosenkrantz and Lewis II, 1970\].) The exact formulation of the algorithm above is chosen to simplify the treatment of generalized LC parsing in the next section. The strict separation between goal elements and item elements has also been achieved in \[Perlin, 1991\], as opposed to \[Schabes, 1991\].</Paragraph>
  </Section>
  <Section position="4" start_page="306" end_page="308" type="metho">
    <SectionTitle>
3 Generalizing left-corner parsing
</SectionTitle>
    <Paragraph position="0"> The construction of Lang can be used to form deterministic table-driver parsing algorithms from non-deterministic push-down automata. Because left-corner parsers are also push-down automata, Lang's construction can also be applied to formulate a deterministic parsing algorithm based on LC parsing.</Paragraph>
    <Paragraph position="1"> The parsing algorithm we propose in this paper does however not follow straightforwardly from Lang's construction. If we applied the construction directly, then not as much sharing would be provided as we would like. This is caused by the fact that sharing of computation of different search paths is interrupted if different elements occur on top of the stack (or just beneath the top if elements below the top are investigated).</Paragraph>
    <Paragraph position="2"> To explain this more carefully we focus on Clause 3 of the nondeterministic LC parser. Assume the following situation. Two different search paths have at the same time the same item element \[A --* a o\] on top of the stack. The goal elements (say B' and B&amp;quot; ) below that item element are different however in both search paths.</Paragraph>
    <Paragraph position="3"> This means that the step which replaces \[A ---* a o\] by \[C ---, A deg/~\], which is done for both search paths (provided both C/* B' and C/* B&amp;quot;), is done separately because B' and B&amp;quot; differ. This is unfortunate  because sharing of computation in this case is desirable both for efficiency reasons but also because it would simplify the construction of a most-compact parse forest.</Paragraph>
    <Paragraph position="4"> Related to the fact that we propose to implement the parse table by means of a graph-structured stack, our solution to this problem lies in the introduction of goal elements consisting of sets of nonterminals from GOAL, instead of single nonterminals from GOAL.</Paragraph>
    <Paragraph position="5"> As an example, Figure 2 shows the state of the graph-structured stack for the situation just after reading &amp;quot;time flies&amp;quot;. Note that this state represents the states of two different search paths of a nondeterministic LC parser after reading &amp;quot;time flies&amp;quot;, one of which is the state after Step 3 in Figure 1.</Paragraph>
    <Paragraph position="6"> We see that the goals NP and VP are merged in one goal element so that there is only one edge from the item element labelled with \[VorN ~ &amp;quot;flies&amp;quot; deg\] to those goals.</Paragraph>
    <Paragraph position="7"> Merging goals in one stack element is of course only useful if those goals have at least one left corner in common. For the simplicity of the algorithm, we even allow merging of two goals in one goal element if these goals have anything to do with each other with respect to the left-corner relation /*.</Paragraph>
    <Paragraph position="8"> Formally, we define an equivalence relation ~ on nonterminals, which is the reflexive, transitive, and symmetric closure of L. An equivalence class of this relation which includes nonterminal A will be denoted by \[A\]. Each goal element will now consist of a subset of some equivalence class of ~.</Paragraph>
    <Paragraph position="9"> In the running example, the goal elements consist of subsets of {S, NP,VP, PP}, which is the only equivalence class in this example.</Paragraph>
    <Paragraph position="10"> Figures 3 and 4 give the complete generalized LC parsing algorithm. At this stage we do not want to complicate the algorithm by allowing epsilon rules in the grammar. Consequently, Clause lb of the non-deterministic LC parser will have no corresponding piece of code in the GLC parsing algorithm. For the other clauses, we will indicate where they can be retraced in the new algorithm. In Section 4 we explain how our algorithm can be extended so that also grammars with epsilon rules can be handled.</Paragraph>
    <Paragraph position="11"> The nodes and arrows in the parse forest are constructed by means of two functions: MAKE_NODE (X) constructs a node with label X, which is a terminal or nonterminal. It returns (the address of) that node.</Paragraph>
    <Paragraph position="12"> A node is associated with a number of lists of sons, which are other nodes in the forest. Each list represents an alternative derivation of the nonterminal with which the node is labelled. Initially, a node is associated with an empty collection of lists of sons.</Paragraph>
    <Paragraph position="13"> ADD_SUBNODE (m, 1) adds a list of sons I to the node m.</Paragraph>
    <Paragraph position="14"> In the algorithm, an item element el labelled with \[A --* Xx ... X,n * .a\] is associated with a list of nodes deriving X1 ..... Xm. This list is accessed by SONS (el). A list consisting of exactly one node m is denoted by &lt;m&gt;, and list concatenation is denoted by the operator +.</Paragraph>
    <Paragraph position="15"> A goal element g contains for every nonterminal A such that A L* P for some P in g a value NODE (g, A), which is the node representing some derivation of A found at the current input position, provided such a derivation exists, and NODE (9, A) is NIL otherwise.</Paragraph>
    <Paragraph position="16"> In the graph-structured stack there may be an edge from an item element to a unique goal element, and from a goal in a goal element to a number of item elements. For item element el, SUCCESSOR (el) yields the unique goal element to which there is an edge from el. For goal element g and goal P in g, SUCCESSORS (g, P) yields the zero or more item elements to which there is an edge from P in g.</Paragraph>
    <Paragraph position="17"> The global variables used by the algorithm are the</Paragraph>
    <Paragraph position="19"> following.</Paragraph>
    <Paragraph position="20"> a0 al ... an The symbols in the input string. i The current input position.</Paragraph>
    <Paragraph position="21"> r The root of the parse forest. It has the value NIL at the end of the algorithm if no parse has been found.</Paragraph>
    <Paragraph position="22"> r and Fnezt The sets of goal elements containing goals to be fulfilled from the current and next input position on, respectively.</Paragraph>
    <Paragraph position="23"> I and Inezt The sets of item elements labelled with \[A ~ a * t ~ such that a shift may be performed through t at the current and next input position, respectively.</Paragraph>
    <Paragraph position="24"> F The set of pairs (g, A) such that a derivation from A has been found for g at the current input position. In other words, F is the set of all pairs (g, A) such that NODE (g, A) ~ NIL.</Paragraph>
    <Paragraph position="25"> The graph-structured stack (which is initially empty) and the rules of the grammar are implicit global data structures.</Paragraph>
    <Paragraph position="26"> In a straightforward implementation, the relation /* is recorded by means of one large s' x s boolean matrix, where s is the number of nonterminals in the grammar, and s' is the number of elements in GOAL. We can do better however by using the fact that A Z* B is never true if A 7~ B. We propose the storage of Z* for every equivalence class of ,,, separately, i.e. we store one t' x t boolean matrix for every class of ,,, with t members, t ~ of which are in GOAL.</Paragraph>
    <Paragraph position="27"> We furthermore need a list of all rules A --* X a for each terminal and nonterminal X. A small optimization of top-town filtering (see also Section 6) can be achieved by grouping the rules in these lists according to the left sides A.</Paragraph>
    <Paragraph position="28"> Note that the storage of the relation Z* is the main obstacle to a linear-sized parser.</Paragraph>
    <Paragraph position="29"> The time needed to generate a parser is determined by the time needed to compute Z* and the classes of -~, which is quadratic in the size of the grammar.</Paragraph>
  </Section>
  <Section position="5" start_page="308" end_page="310" type="metho">
    <SectionTitle>
4 Adapting the algorithm for
</SectionTitle>
    <Paragraph position="0"> arbitrary context-free grammars The generalized LC parsing algorithm from the previous section is only specified for grammars without epsilon rules. Allowing epsilon rules would not only complicate the algorithm but would for some grammars also introduce the danger of non-termination of the parsing process.</Paragraph>
    <Paragraph position="1"> There are two sources of non-termination for non-deterministic LC and LR parsing: cyelicity and hidden left-recursion. A grammar is said to be cyclic if there is some derivation of the form A ---+ A. A grammar is said to be hidden left-recursive if A --* B a, B -+* e, and c~ --+* A ~, for some A, B, a, and ~. Hidden left recursion is a special case of left recursion where the fact is &amp;quot;hidden&amp;quot; by an emptygenerating nonterminal. (A nonterminal is said to be nonfalse if it generates the empty string.) Both sources of non-termination have been studied extensively in \[Nederhof and Koster, 1993; Nederhof and Sarbo, 1993\].</Paragraph>
    <Paragraph position="2"> An obvious way to avoid non-termination for non-deterministic LC parsers in case of hidden left-recursive grammars is the following. We generalize the relation i so that B L A if and only if there is a rule A -~ p B fl, where p is a (possibly empty) sequence of grammar symbols such that /~ --* e. Clause lb is eliminated and to compensate this, Clauses la and 3 are modified so that they take into account prefixes of right sides which generate the empty string: la. If the element on top of the stack is the nonterminal A and if the first symbol of the remaining input is t, then we may remove t from the input and push an item \[B -* p t * a\] onto the stack, provided B Z* A and p -+* e.</Paragraph>
    <Paragraph position="3">  3. If the top-most two elements on the stack are B \[A --+ a .\], then we may replace the item by an item of the form \[C --+ p A * fl\], provided C L* B and # -+* e.</Paragraph>
    <Paragraph position="4">  These clauses now allow for nonfalse members at the beginning of right sides. To allow for other nonfalse members we need an extra seventh clause: 4 7. If the element on top of the stack is the item \[A --+ t~ * B fl\], then we may replace this item by the item \[A --+ a B * fl\], provided B --+* e.</Paragraph>
    <Paragraph position="5"> The same idea can be used in a straightforward way to make generalized LC parsing suitable for 4Actually, an eighth clause is necessary to handle the special case where S, the start symbol, is nonfalse, and the input is empty. We omit this clause for the sake of clarity.</Paragraph>
    <Paragraph position="6">  hidden left-recursive grammars, similar to the way this is handled in \[Schabes, 1991\] and \[Leermakers, 1992\]. The only technical problem is that, in order to be able to construct a complete parse forest, we need precomputed subforests which derive the empty string in every way from nonfalse nonterminals. This precomputation consists of performing m A C/= MAKE_NODE (A) for each nonfalse nonterminal A, (where m A are specific variables, one for each nonterminal A) and subsequently performing ADD_SUBNODE (mA, &lt;ms1,... , mBk&gt; ) for each rule A -~ B1 ... Bk consisting only of nonfalse nonterminals. The variables m A now contain pointers to the required subforests.</Paragraph>
    <Paragraph position="7"> GLC parsing is guaranteed to terminate also for cyclic grammars, in which case the infinite amount of parses is reflected by cyclic forests, which are also discussed in \[Nozohoor-Farshi, 1991\].</Paragraph>
  </Section>
  <Section position="6" start_page="310" end_page="311" type="metho">
    <SectionTitle>
5 Parsing in cubic time
</SectionTitle>
    <Paragraph position="0"> The size of parse forests, even of those which are optimally dense, can be more than cubic in the length of the input. More precisely, the number of nodes in a parse forest is O(nP+l), where p is the length of the right side of the longest rule.</Paragraph>
    <Paragraph position="1"> Using the normal representation of parse forests does therefore not allow cubic parsing algorithms for arbitrary grammars. There is however a kind of shorthand for parse forests which allows a representation which only requires cubic space.</Paragraph>
    <Paragraph position="2"> For example, suppose that of some rule A --* c~ fl, the prefix a of the right side derives the same part of the input in more than one way, then these derivations may be combined in a new kind of packed node. Instead of the multiple derivations from a, this packed node is then combined with the derivations from j3 deriving subsequent input. We call packing of derivations from prefixes of right sides subpaclC/ing to distinguish this from normal packing of derivations from one nonterminal.</Paragraph>
    <Paragraph position="3"> Subpacking has been discussed in \[Billet and Lang, 1989: Leiss, 1990; Leermakers, 1991\]; see also \[Sheil, 1976\].</Paragraph>
    <Paragraph position="4"> Connected with cubic representation of parse forests is cubic parsing. The GLC parsing algorithm in Section 3 has a time complexity of O(nP+l). The algorithm can be easily changed so that, with a little amount of overhead, the time complexit~ is re- 3 duced to O(n ), similar to the algorithms in \[Perlin, 1991\] and \[Leermakers, 1992\], and the algorithm produces parse forests with subpacking, which require only O(n 3) space for storage.</Paragraph>
    <Paragraph position="5"> We consider how this can be accomplished. First we define the underlying rule of an item element labelled with \[A --~ a * fl\] to be the rule A --* a ft. Now suppose that two item elements ell and el2 with the same underlying rule, with the dot at the same position and with the same successor are created at the same input position, then we may perform subpacking for the prefix of the right side before the dot. From then on, we only need one of the item elements ell and el2 for continuing the parsing process.</Paragraph>
    <Paragraph position="6"> Whether two item elements have one and the same goal element as successors cannot be efficiently veri- null fled. Therefore we propose to introduce a new kind of stack element which takes over the role of all former item elements whose successors are one and the same goal element and which have the same underlying rule.</Paragraph>
    <Paragraph position="7"> We leave the details to the imagination of the reader.</Paragraph>
  </Section>
  <Section position="7" start_page="311" end_page="311" type="metho">
    <SectionTitle>
6 Optimization of top-down filtering
</SectionTitle>
    <Paragraph position="0"> One of the most time-costly activities of generalized LC parsing is the check whether for a goal element g and a nonterminal A there is some goal P in g such that A Z* P. This check, which is sometimes called top-down filtering, occurs in the routines FIND_CORNERS and REDUCE. We propose some optimizations to reduce the number of goals P in g for which A Z* P has to be checked.</Paragraph>
    <Paragraph position="1"> The most straightforward optimization consists of annotating every edge from an item element labelled with \[A ~ a */~\] to a goal element g with the sub-set of goals in g which does not include those goals P for which A L* P has already been found to be false. This is the set of goals in g which are actually useful in top-down filtering when a new item element labelled with \[B ---* A. 7\] is created during a REDUCE (see the piece of code in REDUCE corresponding with Clause 3 of the nondeterministic LC parser). The idea is that if A L* P does not hold for goal P in g, then neither does B Z* P if A L B.</Paragraph>
    <Paragraph position="2"> This optimization can be realized very easily if sets of goals are implemented as lists.</Paragraph>
    <Paragraph position="3"> A second optimization is useful if / is such that there are many nonterminals A such that there is only one B with A PS B. In case we have such a non-terminal A which is not a goal, then no top-down filtering needs to be performed when a new item element labelled with \[B --* A * a\] is created during a REDUCE. This can be explained by the fact that if for some goal P we have A Z* P, and ifA C/ P, and if there is only one B such that A / B, then we already know that B z* p.</Paragraph>
    <Paragraph position="4"> There are many more of these optimizations but not all of these give better performance in all cases.</Paragraph>
    <Paragraph position="5"> It depends heavily on the properties of / whether the gain in time while performing the actual top-down filtering (i.e. performing the tests A /* P for some P in a particular subset of the goals in a goal element g) outweighs the time needed to set up extra administration for the purpose of reducing those subsets of the goals.</Paragraph>
  </Section>
  <Section position="8" start_page="311" end_page="312" type="metho">
    <SectionTitle>
7 Preliminary results
</SectionTitle>
    <Paragraph position="0"> Only recently the author has implemented a GLC parser. The algorithm as presented in this paper has been implemented almost literally, with the treatment of epsilon rules as suggested in Section 4. A small adaptation has been made to deal with terminals of different lengths.</Paragraph>
    <Paragraph position="1"> Also recently, some members of our department have completed the implementation of a GLR parser.</Paragraph>
    <Paragraph position="2"> Because both systems have been implemented using different programming languages, fair comparison of the two systems is difficult. Specific problems which occurred concerning the efficient calculation of LR tables and the correct treatment of epsilon rules for GLR parsing suggest that GLR parsing requires more effort to implement than GLC parsing.</Paragraph>
    <Paragraph position="3"> Preliminary tests show that the division of nonterminals into equivalence classes yields disappointing results. In all tested cases, one large class contained most of the nonterminals.</Paragraph>
    <Paragraph position="4"> The first optimization discussed in Section 6 proved to be very useful. The number of goals which had to be considered could in some cases be reduced to one fifth.</Paragraph>
    <Paragraph position="5"> Conclusions We have discussed a parsing algorithm for context-free grammars called generalized LC parsing. This parsing algorithm has the following advantages over generalized LR parsing (in order of decreasing importance). null * The size of a parser is much smaller; if we neglect the storage of the relation /', the size is even linear in the size of the grammar. Related to this, only a little amount of time is needed to generate a parser.</Paragraph>
    <Paragraph position="6"> * The generated parse forests are as compact as possible.</Paragraph>
    <Paragraph position="7"> * Cyclic and hidden left-recursive grammars can be handled more easily and more efficiently (Section 4).</Paragraph>
    <Paragraph position="8"> * As Section 5 shows, GLC parsing can more easily be made to run in cubic time for arbitrary context-free grammars. Furthermore, this can be done without much loss of efficiency in practical cases.</Paragraph>
    <Paragraph position="9"> Because LR parsing is a more refined form of parsing than LC parsing, generalized LR parsing may at least for some grammars be more efficient than generalized LC parsing. 5 However, we feel that this does not outweigh the disadvantages of the large sizes and generation times of LR parsers in general, which renders GLR parsing unfeasible in some natural language applications.</Paragraph>
    <Paragraph position="10"> GLC parsing does not suffer from these defects.</Paragraph>
    <Paragraph position="11"> We therefore propose this parsing algorithm as a reasonable alternative to GLR parsing. Because of the small generation time of GLC parsers, we expect this kind of parsing to be particularly appropriate during the development of grammars, when grammars SThe ratio between the time complexities of GLC parsing and GLR parsing is smaller than some constant, which is dependent on the grammar.</Paragraph>
    <Paragraph position="12">  change often and consequently new parsers have to be generated many times.</Paragraph>
    <Paragraph position="13"> As we have shown in this paper, the implementation of GLC parsing using a graph-structured stack allows many optimizatious. These optimizatious would be less straightforward and possibly less effective if a two-dimensional matrix was used for the implementation of the parse table. Furthermore, matrices require a large amount of space, especially for long input, causing overhead for initialization (at least if no optimizations are used).</Paragraph>
    <Paragraph position="14"> In contrast, the time and space requirements of GLC parsing using a graph-structured stack are only a negligible quantity above that of nondeterministic LC parsing if no nondeterminism occurs (e.g. if the grammar is LC(O)). Only in the worst-case does a graph-structured stack require the same amount of space as a matrix.</Paragraph>
    <Paragraph position="15"> In this paper we have not considered GLC parsing with more lookahead than one symbol for terminal left corners. The reason for this is that we feel that one of the main advantages of our parsing algorithm over GLIt parsing is the small sizes of the parsers. Adding more lookahead requires larger tables and may therefore reduce the advantage of generalized LC parsing over its Lit counterpart.</Paragraph>
    <Paragraph position="16"> On the other hand, the phenomenon reported in \[Billot and Lang, 1989\] and \[Lankhorst, 1991\] that the time complexity of GLIt parsing sometimes worsens if more lookahead is used, does possibly not apply to GLC parsing. For GLIt parsing, more lookahead may result in more Lit states, which may result in less sharing of computation. For GLC parsing there is however no relation between the amount of lookahead and the amount of sharing of computation. Therefore, a judicious use of extra lookahead may on the whole be advantageous to the usefulness of GLC parsing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML