File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2036_metho.xml
Size: 22,813 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2036"> <Title>Left-To-Right Parsing and Bilexical Context-Free Grammars</Title> <Section position="3" start_page="272" end_page="272" type="metho"> <SectionTitle> 2 Bilexical context-free grammars </SectionTitle> <Paragraph position="0"> In this section we introduce the grammar formalism we investigate in this paper. This formalism, originally presented in (Eisner and Satta, 1999), is an abstraction of the language models adopted by several state-of-the-art real-world parsers (see Section 1).</Paragraph> <Paragraph position="1"> We specify a non-stochastic version of the formalism, noting that probabilities may be attached to the rewrite rules exactly as in stochastic CFG (Gonzales and Thomason, 1978; Wetherell, 1980). We assume that the reader is familiar with context-free grammars. Here we follow the notation of (Harrison, 1978; Hopcroft and Ullman, 1979).</Paragraph> <Paragraph position="2"> A context-free grammar (CFG) is a tuple G = (VN, VT, P, S), where VN and VT are finite, disjoint sets of nonterminal and terminal symbols, respectively, S E VN is the start symbol, and P is a finite set of productions having the form A -+ a, where A E VN and a E (VN t3 VT)*. A &quot;derives&quot; relation, written ::~, is associated with a CFG as usual. We use the reflexive and transitive closure of =~, written =~*, and define L(G) accordingly. The size of a CFG G is defined as IGI = ~(A--+a)EP \]Aal. If every production in P has the form A --+ BC or A --+ a, for A,B,C E VN,a E VT, then G is said to be in Chomsky Normal Form (CNF).</Paragraph> <Paragraph position="3"> A CFG G = (VN, VT, P, S\[$\]) in CNF is called a bilexical context-free grammar if there exists a set VD, called the set of delexicalized nonterminals, such that nonterminals from VN are of the form A\[a\], consisting of A E VD and a E VT, and every production in P has one of the following two forms: (i) A\[a\] ~ B\[b\] C\[c\], a E {b, c); (ii) A\[a\] ~ a.</Paragraph> <Paragraph position="4"> A nonterminal A\[a\] is said to have terminal symbol a as its lexlcal head. Note that in a parse tree for G, the lexical head of a nonterminal is always &quot;inherited&quot; from some daughter symbol (i.e., from some symbol in the right-hand side of a production). In the sequel, we also refer to the set VT as the lexicon of the grammar.</Paragraph> <Paragraph position="5"> A bilexical CFG can encode lexically specific preferences in the form of binary relations on lexical items. For instance, one might specify P as to contain the production VP\[solve\] -+ V\[solve\] NP\[puzzles\] but not the production VP\[eat\] --4 V\[eat\] NP\[puzzles\]. This will allow derivation of some VP constituents such as &quot;solve two puzzles&quot;, while forbidding &quot;eat two puzzles&quot;. See (Eisner and Satta, 1999) for further discussion.</Paragraph> <Paragraph position="6"> The cost of this expressiveness is a very large grammar. Indeed, we have IGI = O(\]VD\[ 3&quot; IVT\[2), and in practical applications \]VTI >> IVDI > 1. Thus, the grammar size is dominated in its growth by the square of the size of the working lexicon. Even if we conveniently group lexical items with distributional similarities into the same category, in practical applications the resulting grammar might have several thousand productions. Parsing strategies that cannot work in sublinear time with respect to the size of the lexicon and with respect to the size of the whole input grammar are very inefficient in these cases.</Paragraph> </Section> <Section position="4" start_page="272" end_page="273" type="metho"> <SectionTitle> 3 Correct-prefix property </SectionTitle> <Paragraph position="0"> So called left-to-right strategies are standa~dly adopted in algorithms for natural language parsing. Although intuitive, the notion of left-to-right parsing is a concept with no precise mathematical meaning. Note that in fact, in a pathological way, one could read the input string from left-to-right, storing it into some data structure, and then perform syntactic analysis with a non-left-to-right strategy. In this paper we focus on a precise definition of left-to-right parsing, known in the literature as correct-prefix property parsing (Sippu and Soisalon-Soininen, 1990). Several algorithms commonly used in natural language parsing satisfy this property, as for instance Earley's algor!thm (Earley, 1970), tabular left-corner and PLR parsing (Nederhof, 1994) and tabular LR parsing (Tomita, 1986).</Paragraph> <Paragraph position="1"> Let VT be some alphabet. A generic string over VT is denoted as w = al &quot;-an, with n _> 0 and ai E VT (1 < i < n); in case n = 0, w equals the empty string e. For integers i and j with 1 < i < j < n, we write w\[i,j\] to denote string aiai+l&quot; .aj; if i > j, we define w\[i, j\] = c.</Paragraph> <Paragraph position="2"> Let G -- (VN,VT,P,S) be a CFG and let w = al ... an with n _> 0 be some string over VT. A recognizer for the CFG class is an algorithm R that, on input (G,w), decides whether w E L(G). We say that R satisfies the correct-prefix property (CPP) if the following condition holds. Algorithm R processes the input string from left-to-right, &quot;consuming&quot; one symbol ai at a time. If for some i, 0 < i < n, the set of derivations in G having the form S ~* w\[1, i\]7, 7 E (VN U VT)*, is empty, then R rejects and halts, and it does so before consuming symbol ai+l, if i < n. In this case, we say that R has detected an error at position i in w. Note that the above property forces the recognizer to do relevant computation for each terminal symbol that is consumed.</Paragraph> <Paragraph position="3"> We say that w\[1,i\] is a correct-prefix for a language L if there exists a string z such that w\[1,i\]z E L. In the natural language parsing literature, the CPP is sometimes defined with the following condition in place of the above. If for some i, 0 < i < n, w\[1, if is not a correct prefix for L(G), then R rejects and halts, and it does so before consuming symbol ai+i,~ if i < n. Note that the latter definition asks for a stronger condition, and the two definitions are equivalent only in case the input grammar G is reduced, i While the above mentioned parsing algorithms satisfy the former definition of CPP, they do not satisfy the latter. Actually, we are not aware of any practically used parsing algorithm that satisfies the latter definition of CPP.</Paragraph> <Paragraph position="4"> One needs to distinguish CPP parsing from Some well known parsing algorithms in the literature that process symbols in the right-hand sides of each grammar production from left to right, but that do not exhibit any left-to-right dependency between different productions. In particular, processing of the right-hand side of some production may be initiated at some input position without consultation of productions or parts of productions that may have been found to cover parts of the input to the left of that position. These algorithms may also consult input symbols from left to right, but the processing that takes place to the right of some position i does not strictly depend on the processing that has taken place to the left of i. Examples are pure bottom-up methods, such as left-corner parsing without top-down filtering (Wiren, 1987).</Paragraph> <Paragraph position="5"> Algorithms that do satisfy the CPP make use of some form of top-down prediction. Top-down prediction can be implemented at parse-time as in the case of Earley's algorithm by means of the &quot;predictor&quot; step, or can be precompiled, as in the case of left-corner parsing (Rosenkrantz and Lewis, 1970), by means of the left-corner relation, or as in the case of LR parsers (Sippu and Soisalon-Soininen, 1990), through the closure function used in the construction of LR states.</Paragraph> </Section> <Section position="5" start_page="273" end_page="274" type="metho"> <SectionTitle> 4 Recognition without </SectionTitle> <Paragraph position="0"> precompilation In this section we consider recognition algorithms that do not require off-line compilation of the input grammar. Among algorithms that satisfy the CPP, the most popular example of a recognizer that does i A context-free grammar G is reduced if every nonterminal of G can be part of at least one derivation that rewrites the start symbol into some string of terminal symbols.</Paragraph> <Paragraph position="1"> not require grammar precompilation is perhaps Earley's algorithm (Earley, 1970). We show here that methods in this family cannot be extended to work in time independent of the size of the lexicon, in contrast with bidirectional recognition algorithms.</Paragraph> <Paragraph position="2"> The result presented below rests on the following, quite obvious, assumption. There exists a constant c, depending on the underlying computation model, such that in k > 0 elementary computation steps any recognizer can only read up to c. k productions from set P. In what follows, and without any loss of generality, we assume c = 1. Apart from this assumption, no other restriction is imposed on the representation of the input grammar or on the access to the elements of sets VN, VT and P.</Paragraph> <Paragraph position="3"> Theorem 1 Let f be any function of two variables defined on natural numbers. No recognizer for bilexical context-free grammars that satisfies the CPP can run on input (G, wl in an amount of time bounded by f(\[VDI, \[W\[), where VD is the set of delexicatized nonterminals of G.</Paragraph> <Paragraph position="4"> Proof. Assume the existence of a recognizer R satisfying the CPP and running in I(IVDI, IwL) steps or less. We show how to derive a contradiction.</Paragraph> <Paragraph position="5"> Let q >_ 1 be an integer. Define a bilexical CFG</Paragraph> <Paragraph position="7"> and where set pa contains all and only the following productions: Note that there are q bridging productions in Gq. Also, note that V~ = {A,T} does not depend on the choice of q. Thus, we will simply write VD. Choose q > max{f(IVD\[,2),l }. On input (Gq, bq+2bq+i), R does not detect any error at position 1, that is after having read the first symbol bq+2 of the input string. This is because A\[bl\] ~* bq+2~/ with 3' =- T\[ba+i\]T\[ba\]T\[bq-i\]'&quot;T\[bi\] is a valid derivation in G. Since R executes no more than f(IVD\] ,2) steps, from our assumption that reading a production takes unit time it follows that there must be an integer k, 1 < k < q, such that bridging production A\[bk\] --+ A\[bk+i\] T\[bk\] is not read from Gq. Construct then a new grammar GI~ by replacing in Gq the production A\[bk\] --+ A\[bk+l\] T\[bk\] with the new production A\[bk\] --+ T\[bk\] A\[bk+i\], leaving everything else unchanged. It follows that, on input (G~, ba+2ba+i), R behaves exactly as before and does not detect any error at position 1. But this is a contradiction, since there is no derivation in G~ of the form A\[bl\] =~* bq+2&quot;Y, 7 E (VN U VT)*, as can be easily verified. * We can use the above result in the comparison of left-to-right and bidirectional recognizers. The recognition of bilexical context-free languages can be carried out by existing bidirectional algorithms in time independent of the size of the lexicon and without any precompilation of the input bilexical grammar. For instance, the algorithms presented in (Eisner and Satta, 1999) allow recognition in time O(IVDI 3 IwI4). 2 Theorem 1 states that this time bound cannot be met if we require the CPP and if the input grammar is not precompiled. In the next section, we will consider the possibility that the input grammar is in a precompiled form.</Paragraph> </Section> <Section position="6" start_page="274" end_page="276" type="metho"> <SectionTitle> 5 Recognition with precompilation </SectionTitle> <Paragraph position="0"> In this section we consider recognition algorithms that satisfy the CPP and allow off-line, polynomial-time compilation of the working grammar. We focus on a class of bilexical context-free grammars where recognition requires the stacking of a number of unresolved lexical dependencies that is proportional to the length of the input string. We provide evidence that the above class of recognizers perform much less efficiently for these grammars than existing bidirectional recognizers.</Paragraph> <Paragraph position="1"> We assume that the reader is familiar with the notions of deterministic and nondeterministic finite automata. We follow here the notation in (Hopcroft and Ullman, 1979). A nondeterministic finite automaton (FA) is a tuple M = (Q, E, 5, q0, F), where Q and P. are finite, disjoint sets of state and alphabet symbols, respectively, qo E Q and F _C Q are the initial state and the set of final states, respectively, and is a total function mapping Q x ~ to 2 Q, the power-set of Q. Function 5 represents the transitions of the automaton. Given a string w = al &quot;&quot;an, n > O, an accepting computation in M for w is a sequence qo, al,ql,a2,q2 .... ,an,q,, such that qi E 5(qi-l,ai) for 1 <i < n, and q~ E F. The languageL(M) is the set of all strings in E* that admit at least one accepting computation in M. The size of M is defined as \]M\] = ~qeQ,ae~ I~(q,a)l. The automaton M is deterministic if, for every q E Q and a E ~, we have IS(q, a)\] = 1.</Paragraph> <Paragraph position="2"> We call quasi-determinizer any algorithm A that satisfies the following two conditions: 1. A takes as input a nondeterministic FA M --= (Q, ~, 5, qo, F) and produces as output a device DM that, when given a string w as input, decides whether w E L(M); and 2More precisely, the running time for these algorithms is O(IVDI 3 Iw\[3min{\[VT\[, \[w\[}). In cases of practical interest, we always have Iw\[ < IVT\[.</Paragraph> <Paragraph position="3"> 2. there exists a polynomial PA such that every DM runs in an amount of time bounded by PA(Iwl).</Paragraph> <Paragraph position="4"> We remark that, given a nondeterministic FA M specified as above, known algorithms allow simulation of M on an input string w in time O(IM I IwI) (see for instance (Aho et al., 1974, Thin. 9.5) or (Sippu and Soisalon-Soininen, 1988, Thm. 3.38)). In contrast, a quasi-determinizer produces a device that simulates M in an amount of time independent of the size of M itself.</Paragraph> <Paragraph position="5"> A standard example of a quasi-determinizer is the so called power-set construction, used to convert a nondeterministic FA into a language-equivalent deterministic FA (see for instance (Hopcroft and Ullman, 1979, Thin. 2.1) or (Sippu and Soisalon-Soininen, 1988, Thm. 3.30)). In fact, there exist constants c and d such that any deterministic FA can be simulated on input string w in an amount of time bounded by c \]w I + d. This requires function to be stored as a IQ\] x \]El, 2-dimensional array with values in Q. This is a standard representation for automata-like structures; see (Gusfield, 1997, :Sect. 6.5) for discussion.</Paragraph> <Paragraph position="6"> We now pose the question of the time efficiency of a quasi-determinizer, and consider the amount of time needed in the construction of DM. In (Meyer and Fisher, 1971; Stearns and Hunt, 1981) it is shown that there exist (infinitely many) nondeterministic FAs with state set Q, such that any language-equivalent deterministic FA must have at least 2 IQ} states. This means that the power-set construction cannot work in polynomial time in the size of the input FA. Despite of much effort, no algorithm has been found, up to the authors' knowledge, that can simulate a nondeterministic FA on an input string w in linear time in' Iwl and independently of IMI, if only polynomial-time precompilation of M is allowed. Even in case we relax the linear-time restriction and consider recognition of w in polynomial time, for some fixed polynomial, it seems unlikely that the problem can be solved if only polynomial-time precompilation of M is allowed. Furthermore, if we consider precompilation of nondeterministic FAs into &quot;partially determinized&quot; FAs that would allow recognition in polynomial (or even exponential) time in Iw\], it seems unlikely that the analysis required for this precompilation could consider less than exponentially many combinations of states that may be active at the same time for the original non-deterministic FA. Finally, although more powerful formalisms have been shown to represent some regular languages much more succinctly than FAs (Meyer and Fisher, 1971), while allowing polynomial-time parsing, it seem unlikely that this could hold for regular languages in general.</Paragraph> <Paragraph position="7"> Conjecture There is no quasi-determinizer that works in polynomial time in the size of the input automaton.</Paragraph> <Paragraph position="8"> Before turning to our main result, we need to develop some additional machinery. Let M = (Q,E,6, qo, F) be a nondeterministic FA and let w = al..-an E L(M), where n > 0. Let qo, al , ql , . . . , an, qn be an accepting computation for w in M, and choose some symbol $ C/ E. We can now encode the accepting computation as ($, q0)(al, ql) * * * (an, qa) where we pair alphabet symbols to states, prepending $ to make up for the difference in the number of alphabet symbols and states. We now provide a construction that associates M with a bilexical CFG GM. Strings in L(GM) are obtained by pairing strings in L(M) with encodings of their accepting computations (see below for an example).</Paragraph> <Paragraph position="9"> Definition 1 Let M = (Q,E,J, qo,F) be a nondeterministic FA. Choose two symbols $, # ~ E, and let A = {(a,q) I a e EU{$}, q * O}. A bilexicat CFG GM -- (VN, VT, e~ C\[($, qo)\]) is specified as follows: (i) vN = {TIff I ~ * VT} U {C\[~\],C'\[.\] I ~ * a}; (ii) V T = A U ~ U { }; (iii) P contains all and only the following produc null We give an example of the above construction. Consider an automaton M and a string w = ala2a3 such that w * L(M). Let ($,qo)(al,ql)(a2,q2)(a3,q3) be the encoding of an accepting computation in M for w. Then the string ala2a3~(a3, q3)(a2, q2)(al, ql)($, qo) belongs to L(GM). The tree depicted in Figure I represents a derivation in GM of such a string.</Paragraph> <Paragraph position="10"> The following fact will be used below.</Paragraph> <Paragraph position="11"> Lemma 1 For each w * E*, w# is a correct-prefix for L(GM) if and only if w * L(M).</Paragraph> <Paragraph position="12"> Outline of the proof. We claim the following fact. For each k > 0, al,a2,...,ak * ~ and qo, ql , . . . , qk * Q we have qi * 6(qi-l,ai), for all i (1 < i < k), ala2a3#(a3, qs)(a2, q2)(ax, ql)($, q0).</Paragraph> <Paragraph position="13"> ff and only ff C\[($, qo)\] ~* al'.&quot; akg\[Cak, qk)\](atC/-l, qk-x) * * * ($, qo). The claim can be proved by induction on k, using productions (a) to (c) from Definition 1.</Paragraph> <Paragraph position="14"> Let R denote the reverse operator on strings. 3 From the above claim and using production (d) from Definition 1, one can easily show that L(GM) = {w#u \[ w E L(M), u R encodes an accepting computation for w}.</Paragraph> <Paragraph position="15"> The lemma directly follows from this relation. * We can now provide the main result of this section. To this end, we refine the definition of recognizer presented in Section 3. A recognizer for the CFG class is an algorithm R that has random access to some data structure C(G) obtained by means of some off-line precompilation of a CFG G. On input w, which is a string on the terminal symbols of G, R decides whether w E L(G). The definition of the CPP extends in the obvious way to recognizers working with precompiled grammars.</Paragraph> <Paragraph position="16"> Theorem 2 Let p be any polynomial in two variables. 1\] the conjecture about quasi-determinizers holds true, then no recognizer exists that</Paragraph> <Paragraph position="18"> (i) has random access to data structure C(G) pre-compiled from a bilexical CFG G in polynomial time in IGI, (ii) runs in an amount of time bounded by p(IVDI, Iwl), where VD is the set of delexicalized nonterminals of G and w is the input string, and (iii) satisfies the CPP.</Paragraph> <Paragraph position="19"> Proo/. Assume there exists a recognizer R that satisfies conditions (i) to (iii) in the statement of the theorem. We show how this entails that the conjecture about quasi-determinizers is false. We use algorithm R to specify a quasi-determinizer A. Given a nondeterministic FA M, A goes through the following steps.</Paragraph> <Paragraph position="20"> 1. A constructs grammar GM as in Definition 1. 2. A precompiles GM as required by R, producing data structure C(GM).</Paragraph> <Paragraph position="21"> 3. A returns a device DM specified as follows. Given a string w as input, DM runs R on string w~. If R detects an error at any position i, 0 < i < Iw#l, then DM rejects and halts, otherwise DM accepts and halts.</Paragraph> <Paragraph position="22"> From Lemma 1 we have that DM accepts w if and only if w E L(M). Since R runs in time P(\]VDI, Iwl) and since GM has a set of delexicalized nonterminals independent of M, we have that there exists a polynomial PA such that every DM works in an amount of time bounded by pA(IWl). We therefore conclude that A is a quasi-determinizer.</Paragraph> <Paragraph position="23"> It remains to be shown that A works in polynomial time in IMI. Step 1 can be carried out in time O(IM\[). The compilation at Step 2 takes polynomial time in IGM\], following our hypotheses on R, and hence polynomial time in IMI, since IGMI = O(IMI). Finally, the construction of DM at Step 3 can easily be carried out in time O(IMI) as well. * In addition to Theorem 1, Theorem 2 states that, even in case the input grammar is compiled off-line and in polynomial time, we cannot perform CPP recognition for bilexical context-free grammars in time polynomial in the grammar and the input string but independent of the lexicon size. This is true with at least the same evidence that supports the conjecture on quasi-determinizers. Again, this should be contrasted with the time performance of existing bidirectional algorithms, allowing recognition for bilexical context-free grammars in time O(IVDI 3 Iwl ).</Paragraph> <Paragraph position="24"> In order to complete our investigation of the above problem, in Appendix A we show that, when we drop the polynomial-time restriction on the grammar precompilation, it is indeed possible to get rid of any IVT\] factor from the running time of the recognizer.</Paragraph> </Section> class="xml-element"></Paper>