File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1012_metho.xml
Size: 27,392 bytes
Last Modified: 2025-10-06 14:13:11
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1012"> <Title>RECOGNITION OF LINEAR CONTEXT-FREE REWRITING SYSTEMS*</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> RECOGNITION OF LINEAR CONTEXT-FREE REWRITING SYSTEMS* </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The class of linear context-free rewriting systems has been introduced as a generalization of a class of grammar formalisms known as mildly context-sensitive. The recognition problem for linear context-free rewriting languages is studied at length here, presenting evidence that, even in some restricted cases, it cannot be solved efficiently. This entails the existence of a gap between, for example, tree adjoining languages and the subclass of linear context-free rewriting languages that generalizes the former class; such a gap is attributed to &quot;crossing configurations&quot;. A few other interesting consequences of the main result are discussed, that concern the recognition problem for linear context-free rewriting languages.</Paragraph> </Section> <Section position="3" start_page="0" end_page="89" type="metho"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> Beginning with the late 70's, there has been a considerable interest within the computational linguistics field for rewriting systems that enlarge the generative power of context-free grammars (CFG) both from the weak and the strong perspective, still remaining far below the power of the class of context-sensitive grammars (CSG). The denomination of mildly context-sensitive (MCS) has been proposed for the class of the studied systems (see \[Joshi et al., 1991\] for discussion). The rather surprising fact that many of these systems have been shown to be weakly equivalent has led researchers to generalize *I am indebted to Anuj Dawax, Shyam Kaput and Owen Rainbow for technical discussion on this work. I am also grateful to Aravind Joshi for his support in this research. None of these people is responsible for any error in this work. This research was partially funded by the following grants: ARO grant DAAL 03-89-C-0031, DARPA grant N00014-90-J-1863, NSF grant IRI 90-16592 and Ben Franklin grant 91S.3078C-1.</Paragraph> <Paragraph position="1"> the elementary operations involved in only apparently different formalisms, with the aim of capturing the underlying similarities. The most remarkable attempts in such a direction are found in \[Vijay-Shanker et al., 1987\] and \[Weir, 1988\] with the introduction of linear context-free rewriting systems (LCFRS) and in \[Kasami et al., 1987\] and \[Seki et a/., 1989\] with the definition of multiple context-free grammars (MCFG); both these classes have been inspired by the much more powerful class of generalized context-free grammars (GCFG; see \[Pollard, 1984\]). In the definition of these classes, the generalization goal has been combined with few theoretically motivated constraints, among which the requirement of efficient parsability; this paper is concerned with such a requirement. We show that from the perpective of efficient parsability, a gap is still found between MCS and some subclasses of LCFRS.</Paragraph> <Paragraph position="2"> More precisely, the class of LCFRS is carefully studied along two interesting dimensions, to be precisely defined in the following: a) the fan-out of the grammar and b) the production length. From previous work (see \[Vijay-Shanker et al., 1987\]) we know that the recognition problem for LCFRS is in P when both dimensions are bounded. 1 We complete the picture by observing NP-hardness for all the three remaining cases. If P~NP, our result reveals an undesired dissimilarity between well known formalisms like TAG, HG, LIG and others for which the recognition problem is known to be in P (see \[Vijay-Shanker, 1987\] and \[Vijay-Shanker and Weir, 1992\]) and the subclass of LCFRS that is intended to generalize these formalisms. We investigate the source of the suspected additional complexity and derive some other practical consequences from the obtained resuits. null 1 p is the class of all languages decidable in deterministic polynomial time; NP is the class of all languages decidable in nondeterministic polynomial time.</Paragraph> </Section> <Section position="4" start_page="89" end_page="92" type="metho"> <SectionTitle> 2 TECHNICAL RESULTS </SectionTitle> <Paragraph position="0"> This section presents two technical results that are . the most important in this paper. A full discussion of some interesting implications for recognition and parsing is deferred to Section 3. Due to the scope of the paper, proofs of Theorems 1 and 2 below are not carried out in all their details: we only present formal specifications for the studied reductions and discuss the intuitive ideas behind them.</Paragraph> <Section position="1" start_page="89" end_page="89" type="sub_section"> <SectionTitle> 2.1 PRELIMINARIES </SectionTitle> <Paragraph position="0"> Different formalisms in which rewriting is applied independently of the context have been proposed in computational linguistics for the treatment of Natural Language, where the definition of elementary rewriting operation varies from system to system.</Paragraph> <Paragraph position="1"> The class of linear context-free rewriting systems (LCFRS) has been defined in \[Vijay-Shanker et al., 1987\] with the intention of capturing through a generalization common properties that are shared by all these formalisms.</Paragraph> <Paragraph position="2"> The basic idea underlying the definition of LCFRS is to impose two major restrictions on rewriting.</Paragraph> <Paragraph position="3"> First of all, rewriting operations are applied in the derivation of a string in a way that is independent of the context. As a second restriction, rewriting operations are generalized by means of abstract composition operations that are linear and nonerasing.</Paragraph> <Paragraph position="4"> In a LCFR system, both restrictions are realized by defining an underlying context-free grammar where each production is associated with a function that encodes a composition operation having the above properties. The following definition is essentially the same as the one proposed in \[Vijay-Shanker et al., 1987\].</Paragraph> <Paragraph position="5"> Definition 1 A rewriting system G = (VN, VT,</Paragraph> <Paragraph position="7"> a finite set of terminal symbols, S E VN is the start symbol; every symbol A E VN is associated with an integer ~o(A) > O, called the fan-out of A; (it) P is afinite set of productions of the form A --+ f(B1, B2,...,Br), r >_ O, A, Bi E VN, 1 < i < r, with the following restrictions: (a) f is a function in C deg, where D = (V~.) C/, C/ is the sum of the fan-out of all Bi's and</Paragraph> <Paragraph position="9"> grouping into ~(A) sequences of all and only the elements in the sequence zx,1, ... ,Zr,~o(v,),ax, ...,ao, a >__ O, where aiEVT, l <i<a.</Paragraph> <Paragraph position="10"> The languages generated by LCFR systems are called LCFR languages. We assume that the starting symbol has unitary fan-out. Every LCFR system G is naturally associated with an underlying context-free grammar Gu. The usual context-free derivation relation, written =C/'a, , will be used in the following to denote underlying derivations in G. We will also use the reflexive and transitive closure of such a relation, written :=~a, * As a convention, whenever the evaluation of all functions involved in an underlying derivation starting with A results in a ~(A)-tuple w of terminal strings, we will say that * A derives w and write A =~a w. Given a nonterminal A E VN, the language L(A) is the set of all ~(A)-tuples to such that A =~a w. The language generated by G, L(G), is the set L(S). Finally, we will call LCFRS(k) the class of all LCFRS's with fan-out bounded by k, k > 0 and r-LCFRS the class of all LCFRS's whose productions have right-hand side length bounded by r, r > 0.</Paragraph> </Section> <Section position="2" start_page="89" end_page="90" type="sub_section"> <SectionTitle> 2.2 HARDNESS FOR NP </SectionTitle> <Paragraph position="0"> The membership problem for the class of linear context-free rewriting systems is represented by means of a formal language LRM as follows. Let G be a grammar in LCFRS and w be a string in V.~, for some alphabet V~; the pair (G, w) belongs to LRM if and only if w E L(G). Set LRM naturally represents the problem of the recognition of a linear context-free rewriting language when we take into account both the grammar and the string as input variables. In the following we will also study the decision problems LRM(k) and r-LRM, defined in the obvious way. The next statement is a characterization of r-LRM.</Paragraph> <Paragraph position="1"> Theorem 1 3SAT _<p I-LRM.</Paragraph> <Paragraph position="2"> Outline of the proof. Let (U, C) be an arbitrary instance ofthe 3SAT problem, where U = {Ul,..., up} is a set of variables and C = {Cl,...c,} is a set of clauses; each clause in C is represented by a string of length three over the alphabet of all literals, Lu = {uz,~l,...,up,~p}. The main idea in the following reduction is to use the derivations of the grammar to guess truth assignments for U and to use the fan-out of the nonterminal symbols to work out the dependencies among different clauses in C. For every 1 < k < p_ let .Ak = {c i \[ uk is a substring of ci} and let .Ak = {c i \[ ~k is a substring of cj}; let also w = clc2 ...ca. We define a linear context-free rewriting system G = (tiN, C, P, S) such that VN = {~/i, Fi \[ 1 < i < p + 1} U {S}, every nonterminal (but S) has fan-out n and P contains the following productions (fz denotes the identity function on (C*)a):</Paragraph> <Paragraph position="4"> --. h(fk+x), where 7(k'i)(xx, .... z,) = (Zl,... ,xici,... ,z,); (iv) Tp+l --*/p+10, A+10, where fp+10 = (~,&quot;', C). From the definition of G it directly follows that w E L(G) implies the existence of a truth-assignment that satisfies C. The converse fact can he shown starting from a truth assignment that satisfies C and constructing a derivation for w using (finite) induction on the size of U. The fact that (G, w) can he constructed in polynomial deterministic time is also straightforward (note that each function fO) or 7~ j) in G can he specified by an integer j, 1 _~ j _~ n). D The next result is a characterization of LRM(k) for every k ~ 2.</Paragraph> <Paragraph position="5"> Theorem 2 3SAT _<e LRM(2).</Paragraph> <Paragraph position="6"> Outline of the proof. Let (U,C) be a generic instance of the 3SAT problem, U = {ul,... ,up} and C = {Cl,...,Cn} being defined as in the proof of Theorem 1. The idea in the studied reduction is the following. We define a rather complex string w(X)w(2).., w(P)we, where we is a representation of the set C and w (1) controls the truth assignment for the variable ui, 1 < i < p. Then we construct a grammar G such that w(i) can be derived by G only in two possible ways and only by using the first string components of a set of nonterminals N(0 of fan-out two. In this way the derivation of the substring w(X)w(2) ... w(p) by nonterminals N(1),..., N (p) corresponds to a guess of a truth assignment for U. Most important, the right string components of non-terminals in N (i) derive the symbols within we that are compatible with the truth-assignment chosen for ui. In the following we specify the instance (G, w) of LRM(2) that is associated to (U, C) by our reduction. null For every 1 _< i _< p, let .Ai = {cj \[ ui is included in cj} and ~i = {cj \[ ~i is included in cj}; let also ml = \[.Ai\[ + IAil. Let Q = {ai,bi \[ 1 <_ i _< p} be an alphabet of not already used symbols; for every 1 <_ i <_ p, let w(O denote a sequence of mi + 1 alternating symbols ai and bi, i.e. w(O E (aibl) + U (albi)*ai. Let G -- (VN, QUC, P, S); we define VN ---- {S} U {a~ i) I 1 <_ i <_ p, 1 <_ j <_ mi} and w = w(t)w(=)...w(P)cxc2...ea. In order to specify the productions in P, we need to introduce further notation. We define a function a such that, for every 1 _< i _< p, the clauses Ca(i,1),Ca(i,2),'&quot;Ca(i,lAd) are all the clauses in .Ai and the clauses ea(i,l.a,l+l),...ca(i,m0 are all the clauses in ~i. For every 1 < i < p, let 7(i, 1) = albi and let 7(i, h) = ai (resp. bl) if h is even (resp. odd), 2 < h < mi; let also T(i, h) = ai (resp. bi) ifh is odd (resp. even), 1 < h < mi - 1, and let ~(i, mi) = albi (resp. biai) if mi is odd (resp. even). Finally, let P z = ~&quot;~i=1 mi. The following productions define set P (the example in Figure 1 shows the two possible ways of deriving by means of P the substring w(0 and the corresponding part of Cl ... ca).</Paragraph> <Paragraph position="7"> corresponding to the choice ui = trne/false. This forces the grammar to guess a subset of the clauses contained in ,Ai/.Ai, in such a way that all of the clauses in C are derived only once if and only if there exists a truth-assignment that satisfies C.</Paragraph> <Paragraph position="8"> where f is a function of 2z string variables defined as</Paragraph> <Paragraph position="10"> and for every 1 _ j _< n, yj is any sequence of all variables y(i) such that ~(i, h) = j.</Paragraph> <Paragraph position="11"> It is easy to see that \[GI and I wl are polynomially related to I UI and I C l- From a derivation of w G L(G), we can exhibit a truth assignment that satisfies C simply by reading the derivation of the prefix string w(X)w(2)...w (p). Conversely, starting from a truth assignment that satisfies C we can prove w E L(G) by means of (finite) induction on IU l: this part requires a careful inspection of all items in the definition of G. ra</Paragraph> </Section> <Section position="3" start_page="90" end_page="92" type="sub_section"> <SectionTitle> 2.3 COMPLETENESS FOR NP </SectionTitle> <Paragraph position="0"> The previous results entail NP-hardness for the decision problem represented by language LRM; here we are concerned with the issue of NP-completeness.</Paragraph> <Paragraph position="1"> Although in the general case membership of LRM in NP remains an open question, we discuss in the following a normal form for the class LCFRS that enforces completeness for NP (i.e. the proposed normal form does not affect the hardness result discussed above). The result entails NP-completeness for problems r-LRM (r > 1) and LRM(k) (k > 2).</Paragraph> <Paragraph position="2"> We start with some definitions. In a linear context-free rewriting system G, a derivation A =~G w such that w is a tuple of null strings is called a null derivation. A cyclic derivation has the underlying form A ::~a. aAfl, where both ~ and derive tuples of empty strings and the overall effect of the evaluation of the functions involved in the derivation is a bare permutation of the string components of tuples in L(A) (no recombination of components is admitted). A cyclic derivation is minimal if it is not composed of other cyclic derivations. Because of null derivations in G, a derivation A :~a w can have length not bounded by any polynomial in \[G I; this peculiarity is inherited from context-free languages (see for example \[Sippu and Soisalon-Soininen, 1988\]). The same effect on the length of a derivation can be caused by the use of cyclic subderivations: in fact there exist permutations of k elements whose period is not bounded by any polynomial in k. Let A f and C be the set of all nonterminals that can start a null or a cyclic derivation respectively; it can be shown that both these sets can be constructed in deterministic polynomial time by using standard algorithms for the computation of graph closure.</Paragraph> <Paragraph position="3"> For every A E C, let C(A) be the set of all permutations associated with minimal cyclic productions starting with A. We define a normal form for the class LCFRS by imposing some bound on the length of minimal cyclic derivations: this does not alter the weak generative power of the formalism, the only consequence being the one of imposing some canonical base for (underlying) cyclic derivations. On the basis of such a restriction, representations for sets C(A) can be constructed in deterministic polynomial time, again by graph closure computation.</Paragraph> <Paragraph position="4"> Under the above assumption, we outline here a proof of LRMENP. Given an instance (G, w) of the LRM problem, a nondeterministic Turing machine M can decide whether w E L(G) in time polynomial in I(G, w) l as follows. M guesses a &quot;compressed&quot; representation p for a derivation S ~c w such that: (i) null subderivations within p' are represented by just one step in p, and (ii) cyclic derivations within p' are represented in p by just one step that is associated with a guessed permutation of the string components of the involved tuple.</Paragraph> <Paragraph position="5"> We can show that p is size bounded by a polynomial in I (G, w)\[. Furthermore, we can verify in deterministic polynomial time whether p is a valid derivation of w in G. The not obvious part is verifying the permutation guessed in (ii) above. This requires a test for membership in the group generated by permutations in C(A): such a problem can be solved in deterministic polynomial time (see \[Furst et ai., 19801).</Paragraph> </Section> </Section> <Section position="5" start_page="92" end_page="92" type="metho"> <SectionTitle> 3 IMPLICATIONS </SectionTitle> <Paragraph position="0"> In the previous section we have presented general results regarding the membership problem for two subclasses of the class LCFRS. Here we want to discuss the interesting status of &quot;crossing dependencies&quot; within formal languages, on the base of the above results. Furthermore, we will also derive some observations concerning the existence of highly efficient algorithms for the recognition of fan-out and production-length bounded LCFR languages, a problem which is already known to be in the class P.</Paragraph> <Section position="1" start_page="92" end_page="92" type="sub_section"> <SectionTitle> 3.1 CROSSING </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="6" start_page="92" end_page="93" type="metho"> <SectionTitle> CONFIGURATIONS </SectionTitle> <Paragraph position="0"> As seen in Section 2, LCFRS(2) is the class of all LCFRS of fan-out bounded by two, and the membership problem for the corresponding class of languages is NP-complete. Since LCFRS(1) = CFG and the membership problem for context-free languages is in P, we want to know what is added to the definition of LCFRS(2) that accounts for the difference (assuming that a difference exists between P and NP). We show in the following how a binary relation on (sub)strings derived by a grammar in LCFRS(2) is defined in a natural way and, by discussing the previous result, we will argue that the additional complexity that is perhaps found within LCFRS(2) is due to the lack of constraints on the way pairs of strings in the defined relation can be composed within these systems.</Paragraph> <Paragraph position="1"> Let G E LCFRS(2); in the general case, any non-terminal in G having fan-out two derives a set of pair of strings; these sets define a binary relation that is called here co-occurrence. Given two pairs (Wl, w'l) and (w~, w'~) of strings in the co-occurrence relation, there are basically two ways of composing their string components within a rule of G: either by nesting (wrapping) one pair within the other, e.g. wlw2w~w~l, or by creating a crossing configuration, e.g. wlw2w'lw~; note how in a crossing configuration the co-occurrence dependencies between the substrings are &quot;crossed&quot;. A close inspection of the construction exhibited by Theorem 2 shows that grammars containing an unbounded number of crossing configurations can be computationally complex if no restriction is provided on the way these configurations are mutually composed. An intuitive idea of why such a lack of restriction can lead to the definition of complex systems is given in the following. null In \[Seki et al., 1989\] a tabular method has been presented for the recognition of general LCFR languages as a generalization of the well known CYK algorithm for the recognition of CFG's (see for instance \[Younger, 1967\] and \[Aho and Ullman, 1972\]). In the following we will apply such a general method to the recognition of LCFRS(2), with the aim of having an intuitive understanding of why it might be difficult to parse unrestricted crossing configurations. Let w be an input string of length n. In Figure 2, the case of a production Pl : A --* f ( B1, B2, . . . , Br ) is depicted in which a number r of crossing configurations are composed in a way that is easy to recognize; in fact the right-hand side of Pl can be recognized step by step. For a symbol X, assume</Paragraph> <Paragraph position="3"> a production Pl : A ~ f(B1, B2,..., Br) where each of the right-hand side nonterminals has fan-out two.</Paragraph> <Paragraph position="4"> that the sequence X, (il, i2),..., (iq-1, iq) means X derives the substrings of w that matches the positions (i1,i2),..., (iq-l,iq) within w; assume also that A\[t\] denotes the result of the t-th step in the recognition of pl's right-hand side, 1 < t < r. Then each elementary step in the recognition of Pl can be schematically represented as an inference rule as follows: A\[t\], (ia, i,+a), (S',, J,+*) * B,+a, (it+a, it+s), (jr+a, Jr+2) Air + 1\], (ia, it+s), (jl, Jr+2) O) The computation in (1) involves six indices ranging over {1..n}; therefore in the recognition process such step will be computed no more than O(n 6) times.</Paragraph> <Paragraph position="6"> production P2 : A ~ f(B1, Bs,..., Br); every non-terminal Bi has fan-out two.</Paragraph> <Paragraph position="7"> On the contrary, Figure 3 presents a production P2 defined in such a way that its recognition is considerably more complex. Note that the co-occurrence of the two strings derived by Ba is crossed once, the co-occurrence of the two strings derived by B2 is crossed twice, and so on; in fact crossing dependencies in P2 are sparse in the sense that the adjacency property found in production Pl is lost. This forces a tabular method as the one discussed above to keep track of the distribution of the co-occurrences recognized so far, by using an unbounded number of index pairs.</Paragraph> <Paragraph position="8"> Few among the first steps in the recognition of ps's right-hand side are as follows: A\[2\], (i1, i4), (i5, i6) Bz, li4,i51, lis,igl At3\], (it, i6), (is, i9) A\[3\], (il, i6), (is, i9) B4,(i6, ir),{il,,im} A\[4\], (il, i7), (is, i9), (iai, i12) A\[4\], (it, i7), (is, i9), (ixl, i\]2) /35, (i7, is), (ilz, i14) (2) a\[51, (it, i9), (/ix, it2), (ilz, i14) From Figure 3 we can see that a different order in the recognition of A by means of production P2 will not improve the computation.</Paragraph> <Paragraph position="9"> Our argument about crossing configurations shows why it might be that recognition/parsing of LCFRS(2) cannot be done efficiently. If this is true, we have a gap between LCFR systems and well known mildly context-sensitive formalisms whose membership problem is known to have polynomial solutions. We conclude that, in the general case, the addition of restrictions on crossing configurations should be seriously considered for the class LCFRS. As a final remark, we derive from Theorem 2 a weak generative result. An open question about LCFRS(k) is the existence of a canonical bilinear form: up to our knowledge no construction is known that, given a grammar G E LCFRS(k) returns a weakly equivalent grammar G ~ E 2-LCFRS(k).</Paragraph> <Paragraph position="10"> Since we know that the membership problem for 2-LCFRS(k) is in P, Theorem 2 entails that the construction under investigation cannot take polynomial time, unless P=NP. The reader can easily work out the details.</Paragraph> <Section position="1" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 3.2 RECOGNITION OF r-LCFRS(k) </SectionTitle> <Paragraph position="0"> Recall from Section 2 that the class r-LCFRS(k) is defined by the simultaneous imposition to the class LCFRS of bounds k and r on the fan-out and on the length of production's right-hand side respectively.</Paragraph> <Paragraph position="1"> These classes have been discussed in \[Vijay-Shanker et al., 1987\], where the membership problem for the corresponding languages has been shown to be in P, for every fixed k and p. By introducing the notion of degree of a grammar in LCFRS, actual polynomial upper-bounds have been derived in \[Seki et al., 1989\]: this work entails the existence of an integer function u(r, k) such that the membership problem for r-LCFRS(k) can be solved in (deterministic) time O(IGIIwlU(r'k)). Since we know that the membership problems for r-LCFRS and LCFRS(k) are NP-hard, the fact that u(r, k) is a (strictly increasing) non-asymptotic function is quite expected.</Paragraph> <Paragraph position="2"> With the aim of finding efficient parsing algorithms, in the following we want to know to which extent the polynomial upper-bounds mentioned above can be improved. Let us consider for the moment the class 2-LCFRS(k); if we restrict ourselves to the normal form discussed in Section 2.3, we know that the recognition problem for this class is NP-complete. Assume that we have found an optimal recognizer for this class that runs in worst case time I(G, w, k); therefore function I determines the best lower-bound for our problem. Two cases then arises. In a first case we have that ! is not bounded by any polynomial p in \]G I and Iwl: we can easily derive that PcNP. In fact if the converse is true, then there exists a Turing machine M that is able to recognize 2-LCFRS in deterministic time I(G, w)I q, for some q. For every k > 0, construct a Turing machine M (k) in the following way. Given (G, w) as input, M (~) tests whether G E2-LCFRS(k) (which 94is trivial); if the test fails, M(t) rejects, otherwise it simulates M on input (G, w). We see that M (k) is a recognizer for the class 2-LCFRS(k) that runs in deterministic time I(G, w)I q. Now select k such that, for a worst case input w E ~* and G E 2-LCFRS(k), we have l(G, w,k) > I(G, w)Iq: we have a contradiction, because M (k) will be a recognizer for 2-LCFRS(k) that runs in less than the lower-bound claimed for this class. In the second case, on the other hand, we have that l is bounded by some polynomial p in \[G \[ and I w I; a similar argument applies, exhibiting a proof that P=NP.</Paragraph> <Paragraph position="3"> From the previous argument we see that finding the '&quot;oest&quot; recognizer for 2-LCFRS(k) is as difficult as solving the P vs. NP question, an extremely difficult problem. The argument applies as well to r-LCFRS(k) in general; we have then evidence that considerable improvement of the known recognition techniques for r-LCFRS(k) can be a very difficult task.</Paragraph> </Section> </Section> class="xml-element"></Paper>