File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1015_metho.xml
Size: 17,099 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1015"> <Title>Parsing Non-Recursive Context-Free Grammars</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The CKY algorithm </SectionTitle> <Paragraph position="0"> In this section we present our first parsing algorithm, based on the so-called CKY algorithm (Harrison, 1978) and exploiting a decomposition of computations of PDAs cast in a specific form. We start with a construction that translates the non-recursive input CFGa3a8a7 into a PDA accepting the same language.</Paragraph> <Paragraph position="1"> Let a3a8a7 a100 a12a36a13a41a15a14a17a19a15a82a20a22a15a23a25a24 . The PDA associated a24 of the PDA can always and uniquely be decomposed into consecutive subcomputations, which we call segments, each starting with zero or more push transitions, followed by a single scan transition and by zero or more pop transitions. In what follows, we will formalize this basic idea and exploit it within our parsing algorithm.</Paragraph> <Paragraph position="2"> We writea83 a80a100a123a132 a84 to indicate that there is a computation a12a89a83a39a15a53a45 a24 a93 a0 a12a97a84a39a15a102a24 of the PDA such that all of the following three conditions hold: (i) either a133a83a6a133a100a50a134 or a133a84a5a133a100a114a134 ; (ii) the computation starts with zero or more push transitions, followed by one scan transition readinga45 and by zero or more pop transitions; (iii) if a133a83a6a133a18a135 a134 then the top-most symbol ofa83 must be in the right-hand side of a pop or scan transition (i.e., top-most in the stack at the end of a previous segment) and if a133a84a6a133a116a135 a134 , then the top-most symbol ofa84 must be the left-hand side of a push or scan transition (i.e., top-most in the stack at the beginning of a following segment).</Paragraph> <Paragraph position="3"> Let a136a8a137a36a138a140a139a94a141 a100a114a142a62a16a63a65a64a43a63a67a66a85a143 a37 a142a79a144a133a146a145a62 a15a76 a106a62a77a76a121a74a27a147a79a9a110a143 a37 formal definition of relationa132 above is provided in Figure 1 by means of a deduction system. We assign a procedural interpretation to such a system following Shieber et al. (1995), resulting in an algorithm for the computation of the relation.</Paragraph> <Paragraph position="4"> We now turn to an important property of segments. Any computation a12a62 a63a105a64a43a63a67a66a15a53a45a98a154a123a155a43a155a44a155a88a45a153a156 a24 a93 a0</Paragraph> <Paragraph position="6"/> <Paragraph position="8"> a154 is a suffix ofa84a95a7. This is done by the deduction system given in Figure 2, which defines the relation a100a123a132 a166 . The second sidecondition of inference rule (5) checks whether a seg-</Paragraph> <Paragraph position="10"> may be the first or last segment in a computation.</Paragraph> <Paragraph position="11"> Figure 3 illustrates a computation of a PDA recognizing a string a45 a154a45a128a179a113a45a127a180a82a45a116a181 . A horizontal line segment in the curve represents a scan transition, an upward line segment represents a push transition, and a downward line segment a pop transition. The shaded areas represent segmentsa83 a7 a80a113a175a100a126a132 a84 a7. As an example, the area labelled I representsa62a131a63a65a64a43a63a73a66 a80a43a182a100a126a132 a62a16a63a65a64a43a63a73a66a183a62 a154a62 a179 , for certain stack symbolsa62 a154 anda62 a179 , where the left edge of the shaded area represents a62a131a63a65a64a43a63a73a66 and the right edge represents a62 a63a105a64a43a63a67a66a62 a154a62 a179 . Note that seg-</Paragraph> <Paragraph position="13"> a84a98a7 abstract away from the stack symbols that are pushed and then popped again. Furthermore, in the context of the whole computation, segments abstract away from stack symbols that are not accessed during a subcomputation. As an example, the shaded area labelled III represents segment</Paragraph> <Paragraph position="15"> a79 , for certain stack symbols a76 a154 , a76 a179 and a79 , and this abstracts away from the stack symbols that may occur belowa76 a154 anda79 .</Paragraph> <Paragraph position="16"> Figure 4 illustrates how two adjacent segments are combined. The dashed box in the left-hand side of the picture represents stack symbols from the right edge of segment II that need not be explicitly represented by segment III, as discussed above. We may assume that these symbols exist, so that II and III can be combined into the larger computation in the right-hand side of the picture. Note that if a computation a83 a171a100a123a132a58a166 a84 is obtained as the combination of two segments as in Figure 4, then some internal details of these segments are abstracted away, i.e., stack elements that were pushed and again popped in the combined computation are no longer recorded.</Paragraph> <Paragraph position="17"> This abstraction is a key feature of the parsing algorithm to be presented next, in that it considerably reduces the time complexity as compared with that of an algorithm that investigates all computations of the PDA in isolation.</Paragraph> <Paragraph position="18"> We are now ready to present our parsing algorithm, which is the main result of this section. The algorithm combines the deduction system in Figure 2, as applied to the PDA encoding the input grammara3 a7, with the CKY algorithm as applied to the parsing grammar a3a5a4 . (We assume that a3a9a4 is in CNF.) The parsing algorithm may rule out many combinations of segments from Figure 2 that are inconsistent with the language generated bya3a6a4 . Also ruled out are structural compositions of segments that are inconsistent with the structure that a3 a4 assigns to the corresponding substrings.</Paragraph> <Paragraph position="19"> The parsing algorithm is again specified as a deduction system, presented in Figure 5. The algorithm manipulates items of the forma106a26a69a15a88a83a185a15a186a84a21a110, where a26 is a nonterminal ofa3 a4 anda83 ,a84 are stacks of the PDA encodinga3a11a7. Such an item indicates that there strings generated by a3a5a4 and accepted by the PDA encodinga3 a7.</Paragraph> <Paragraph position="20"> is some terminal string a52 that is derivable from a26 in a3 a4 , and such that a12a89a83a39a15a88a52 a24 a93 a0 a12a193a84a39a15a102a24 . If the item</Paragraph> <Paragraph position="22"> then the intersection of the language generated by a3a9a4 and the language accepted by the PDA (generated bya3a8a7) is non-empty.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Earley's algorithm </SectionTitle> <Paragraph position="0"> The CKY algorithm from Figure 5 can be seen to filter out a selection of the computations that may be derived by the deduction system from Figure 2. One may however be even more selective in determining which computations of the PDA to consider. The basis for the algorithm in this section is Earley's algorithm (Earley, 1970). This algorithm differs from the CKY algorithm in that it satisfies the correct-prefix property (Harrison, 1978).</Paragraph> <Paragraph position="1"> The new algorithm is presented by Figure 6.</Paragraph> <Paragraph position="2"> There are now two types of item involved. The first item has the form a106a26a30a27 a48a194a109a58a49a90a133a127a87a194a195a92a83a39a15a85a87a194a195a165a84a21a110, where a26a163a27 a48a196a109a197a49 has the same role as the dotted rules in Earley's original algorithm. The second and third components are stacks of the PDA as before, but these stacks now contain a distinguished position, indicated by a195 . The existence of an item a106a26a198a27 a48a32a109a165a49a35a133a152a87a19a195a199a83a39a15a88a87a19a195a131a84a21a110 implies that able froma48 . This is quite similar to the meaning we assigned to the items of the CKY algorithm, but here not all stack symbols ina87a151a83 anda87a18a84 are involved in this computation: only the symbols ina83 anda84 are now accessed, while all symbols ina87 remain unaffected. The portion of the stack represented bya87 is needed to ensure the correct-prefix property in subsequent computations following from this item, in case all of the symbols ina84 are popped.</Paragraph> <Paragraph position="3"> The correct-prefix property is ensured in the following sense. The existence of an item a106a26a200a27a99a48a28a109 a49a199a133a43a87a77a195a8a83a185a15a85a87a165a195a6a84a21a110 implies that (i) there is a stringa52 that is both a prefix of a string accepted by the PDA and of a string generated by the CFG such that after rithm.</Paragraph> <Paragraph position="4"> processing a52 , a26 is expanded in a left-most derivation and some stack can be obtained of which a87a151a83 represent the top-most elements, and (ii)a48 is rewritten toa51 and while processinga51 the PDA replaces the stack elementsa83 bya84 .3 The second type of item has the form a106a26a118a27a119a48a90a109 a49a96a133a95a87a205a195a206a83a39a15a88a87a121a195a197a84a194a133a117a81a124a202a43a110. The first three components are the same as before, anda81 indicates that we wish to know whether a stack with top-most symbols a81a8a87a151a83 may arise after reading a prefix of a string that may also lead to expansion of nonterminal a26 in a left-most derivation. Such an item results if it is detected that the existence ofa81 belowa87a95a83 needs to be ensured in order to continue the computation under the constraint of the correct-prefix property.</Paragraph> <Paragraph position="5"> Our algorithm also makes use of segments, as computed by the algorithm from Figure 1. Consistently with rule (5) from Figure 2, we write such thata62 a31a58a136a8a137a36a138a60a139a97a141a41a167 a76 a31a90a149a39a141a151a150 . The use of segments that were computed bottom-up is a departure from pure left-to-right processing in the spirit of Earley's original algorithm. The motivation is that we have found empirically that the use of rule (2) was essential for avoiding a large part of the exponential behaviour; note that that rule considers at most a number of stacks that is quadratic in the size of the PDA.</Paragraph> <Paragraph position="6"> The first inference rule (11) can be easily justified: we want to investigate strings that are both generated by the grammar and recognized by the PDA, so we begin by combining the start symbol and a matching right-hand side from the grammar with the initial stack for the PDA.</Paragraph> <Paragraph position="7"> Segments are incorporated into the left-to-right computation by rules (12) and (13). These two rules are the equivalents of (9) and (10) from Figure 5.</Paragraph> <Paragraph position="8"> Note that in the case of (13) we require the presence ofa87 below the marker in the antecedent. This indicates that a stack with top-most symbols a87a95a83 and a dotted rulea26a28a27a55a48a122a109a8a45a128a49 can be obtained by simultaneously processing a string from left to right by the grammar and the PDA. Thereby, we may continue the derivation with the item in the consequent without violating the correct-prefix property.</Paragraph> <Paragraph position="9"> Rule (14) states that if a segment presupposes the existence of stack elements that are not yet available, we produce an item that starts a backward computation. We do this one symbol at a time, starting with 3We naturally assume that the PDA itself satisfies the correct-prefix property, which is guaranteed by the construction from Section 3 and the fact thata207 a175 is reduced.</Paragraph> <Paragraph position="10"> the symbola81 just beneath the part of the stack that is already available. This will be discussed more carefully below.</Paragraph> <Paragraph position="11"> The predictor step of Earley's algorithm is represented by (15), and the completer step by rules (16) and (17). These latter two are very similar to (12) and (13) in that they incorporate a smaller derivation in a larger derivation.</Paragraph> <Paragraph position="12"> Rules (18) and (19) repeat computations that have been done before, but in a backward manner, in order to propagate the information that deeper stack symbols are needed than those currently available, in particular that we want to know whether a certain stack symbola81 may occur below the currently available parts of the stack. In (18) this query is passed on to the beginning of the context-free rule, and in (19) this query is passed on backwards through a predictor step. In the antecedent of rule (18) the position of the marker is irrelevant, and is not indicated explicitly. Similarly, for rule (19) we assume the position of the marker is copied unaltered from the first antecedent to the consequent.</Paragraph> <Paragraph position="13"> If we find the required stack symbola81 , we propagate the information forward that this symbol may indeed occur at the specified position in the stack.</Paragraph> <Paragraph position="14"> This is implemented by rules (20) and (21). Rule (20) corresponds to the predictor step (15), but (20) passes on a larger portion of the stack than (20).</Paragraph> <Paragraph position="15"> Rule (15) only transfers the top-most symbol a62 to the consequent, in order to keep the stacks as shallow as possible and to achieve a high degree of sharing of computation.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Empirical results </SectionTitle> <Paragraph position="0"> We have implemented the two algorithms and tested them on non-recursive input CFGs and a parsing CFG. We have had access to six input CFGs of the form described by Langkilde (2000). As parsing CFG we have taken a small hand-written grammar of about 100 rules. While this small size is not at all typical of practical grammars, it suffices to demonstrate the applicability of our algorithms.</Paragraph> <Paragraph position="1"> The results of the experiments are reported in Figure 1. We have ordered the input grammars by size, according to the number of nonterminals (or the number of nodes in the forest, following the terminology by Langkilde (2000)).</Paragraph> <Paragraph position="2"> The second column presents the number of strings generated by the input CFG, or more accurately, the number of derivations, as the grammars contain some ambiguity. The high numbers show that without a doubt the naive solution of processing the input grammars by enumerating individual strings (derivations) is not a viable option.</Paragraph> <Paragraph position="3"> The third column shows the size, expressed as number of states, of a lattice (acyclic finite automaton) that would result by unfolding the grammar (Knight and Langkilde, 2000). Although this approach could be of more practical interest than the naive approach of enumerating all strings, it still leads to large intermediate results. In fact, practical context-free parsing algorithms for finite automata have cubic time complexity in the number of states, and derive a number of items that is quadratic in the number of states.</Paragraph> <Paragraph position="4"> The next column presents the number of segments a84 . These apply to both algorithm. We only compute segmentsa83 a80a100a123a132 a84 for terminalsa45 that also occur in the parsing grammar. (Further obvious optimizations in the case of Earley's algorithm were found to lead to no more than a slight reduction of produced segments.) The last two columns present the number of items specific to the two algorithms in Figures 5 and 6, respectively. Although our two algorithms are exponential in the number of stack symbols in the worst case, just as approaches that enumerate all strings or that unfolda3a11a7 into a lattice, we see that the numbers of items are relatively moderate if we compare them to the number of strings generated by the input grammars.</Paragraph> <Paragraph position="5"> Earley's algorithm generally produces more items than the CKY algorithm. An exception is the last input CFG; it seems that the number of items that Earley's algorithm needs to consider in order to maintain the correct-prefix property is very sensitive to qualities of the particular input CFG.</Paragraph> <Paragraph position="6"> The present implementations use a trie to store stacks; the arcs in the trie closest to the root represent stack symbols closest to the top of the stacks. For example, for storinga83 a80a100a126a132 a84 , the algorithm representsa83 anda84 by their corresponding nodes in the trie, and it indexes a83 a80a100a126a132 a84 twice, once through each associated node. Since the trie is doubly linked (i.e. we may traverse the trie upwards as well as downwards), we can always reconstruct the stacks from the corresponding nodes. This structure is also convenient for finding pairs of matching stacks, one of which may be deeper than the other, as required by the inference rules from e.g. Figure 5, since given the first stack in such a pair, the second can be found by traversing the trie either upwards or downwards.</Paragraph> </Section> class="xml-element"></Paper>