File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1302_metho.xml
Size: 24,283 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1302"> <Title>Context-Free Parsing through Regular Approximation</Title> <Section position="4" start_page="13" end_page="18" type="metho"> <SectionTitle> 3 The Structure of Parse Trees </SectionTitle> <Paragraph position="0"> We define a spine in a parse tree to be a path that runs from the root down to some leaf. Our main interest in spines lies in the sequences of grammar symbols at nodes bordering on spines.</Paragraph> <Paragraph position="1"> A simple example is the set of parse trees such as the one in Figure 1 (a), for a 3-line grammar of palindromes. It is intuitively clear that the language is not regular: the grammar symbols to the left of the spine from the root to 6 &quot;communicate&quot; with those to the right of the spine. More precisely, the prefix of the input up to the point where it meets the final node of the spine determines the sutBx after that point, in a way that an unbounded quantity of symbols from the prefix need to be taken into account.</Paragraph> <Paragraph position="2"> A formal explanation for why the grammar may not generate a regular language relies on the following definition \[4\]: Definition 1 A grammar is self-embedding if there is some A E N such that A -~* aA/3, ,for some a # e and/3 # e.</Paragraph> <Paragraph position="3"> In order to avoid the somewhat unfortunate term nonselfembedding (or noncenter-embedding \[11\]) we define a strongly regular grammar to be a grammar that is not self-embedding. Strong regularity informally means that when a section of a spine in a parse tree repeats itself, then either no grammar symbols occur to the left of that section of the spine, or no grammar symbols occur to the right. This prevents the &quot;unbounded communication&quot; between the two sides of the spine exemplified by the palindrome grammar.</Paragraph> <Paragraph position="4"> We now prove that strongly regular grammars generate regular languages. For an arbitrary grammar, we define the set of re.cursive nonterminals as: We determine the partition A f of N consisting of subsets Na,N2,... ,Nk, for some k > 0, of mutually recursive nontermlnals:</Paragraph> <Paragraph position="6"> N, U N2 U . . . U N~ = N Vi\[Ni ~ 0\] and Vi,j\[i ~ j =~ N, N N~ = Of 3i\[A E NiAB E Nil C/~ 3aa,131,a2,~\[A ~* alB131AB --+* a~Al~\], for all A,B E -N We now define the function recursive from .N&quot; to the set {left, right, self, cyclic}:</Paragraph> <Paragraph position="8"> When recursive(Ni) - left, Ni consists of only left-recursive nonterminals, which does not mean it cannot also contain right-recursive nonterminals, but in that case right recursion amounts to application of unit rules. When recursive(Ni) = cyclic, it is only such unit rules that take part in the recursion.</Paragraph> <Paragraph position="9"> That recursive(Ni) = self, for some i, is a sufficient and. necessary condition for the grammar to be self-embedding. Therefore, we have to prove that if recursive(Ni) * {left, right, cyclic}, for all i, then the gra.rnmar generates a regular language. Our proof differs from an existing proof \[3\] in that it is fully constructive: Figure 2 presents an algorithm for creating a fiaite transducer that recognizes as input all strings from the language generated by the grammar, and produces output strings of a form to be discussed shortly.</Paragraph> <Paragraph position="10"> The process is initiated at the start symbol, and from there the process descends the grammar in all ways until terminals are encountered. Descending the grammar is straightforward in the case of rules of which the left-hand side is not a recursive nonterminal: the subautomata found recursively for members in the right-hand side will be connected. In the case of recursive nonterminals, the process depends on whether the nonterminals in the corresponding set from A/&quot; are mutually left-recursive or right-recursive; ff they are both, which means they are cyclic, then either subprocess can be applied; in the code in Figure 2 cyclic and left-recursive subsets Ni are treated lmiformly.</Paragraph> <Paragraph position="11"> We discuss the case that the nonterminals are left-recursive or cyclic. One new state is created for each nonte _rrnlnal in the set. The transitions that are created for terminals and nonterminals not in Ni are connected in a way that is reminiscent of the construction of left-coruer parsers \[lr\].</Paragraph> <Paragraph position="12"> The output of the transducer consists of a list of filter items interspersed with input symbols. A filter item is a rule with a distinguished position in the right-hand side, indicated by a diamond. The part to the left of the diamond generates a part of the input just to the left of the current input position. The part to the right of the diamond potentially generates a subsequent part of the input. A string consisting of filter items and input symbols can be seen as a representation of a parse, different from some existing representations \[11, 9, 12\].</Paragraph> <Paragraph position="13"> At this point we use only initial filter items, from the set:</Paragraph> <Paragraph position="15"> This definitien implies that for every rule there is exactly one initial filter item. The diamond holds the rightmost position, unless we are dealing with a right-recursive rule.</Paragraph> <Paragraph position="16"> An example is given in Figure 3. Four states have been labelled according to the names they are given in procedure make_!st. There are two states that are labelled qs. This can be explained by the fact that nonterminal B can be reached by descending the grammar from S in two essentially distinct ways.</Paragraph> <Paragraph position="18"> then let a -- A U {(qo,~le, ql)} elseif a = a, some a E ,~ then let zl = A U {(q0, ala, ql)} elseif a = X~, some X E V,/~ E V* such that \]/~\[ > 0 then let q = fresh.state; make_fst(qo, X, q); make_\]st(q, ~, ql ) else let A = a; (* a must consist of a single nonterminal *) if A E Ni, some i then for each B E Ni do let qs = fresh_state end; if recursive( N~ ) = right then for each (B --~ X1 ...Xm) E P such that B E NiAX1,...,Xm ~ Ni</Paragraph> <Paragraph position="20"> else for each (A -~ ~) E P (* A is not recursive *) do let q = fresh_state; make_\]st(qo, l~,q); let A = A U {(q,c\[(A -+/3 o),ql)} end end end end.</Paragraph> <Paragraph position="21"> procedure fresh.state(): create some fresh object q; let K = K U {q}; return q end.</Paragraph> <Paragraph position="22"> Figure 2. Transformation/Tom a strongly regular grammar G -- (E, N, P, S) to a finite transducer T = (K, ,~, ~ U Ii,it, A, s, F). 4 Tabular Simulation of Finite Transducers After a finite transducer has been obtained, it may sometimes be turned into a deterministic transducer \[13\]. However, this is not always possible since not all regular transductions can be</Paragraph> <Paragraph position="24"> described by means of deterministic finite transducers. In this case, input can be processed by Simulating a nondetermlni~ic transducer in a tabular way.</Paragraph> <Paragraph position="25"> Assume we have a finite transducer 7&quot; = (K, 271, ~2,/1, s, F) and an input string al...a,. We create two tables. The first table K ~ contains entries of the form (i, qx), where 0 < i < n and ql E K. Such an entry indicates that the transducer may be in state qz after reading~mput from position 0 up to i. The second table/1~ contain.~ entries of the form ((i, ql),v, (J, q2)), where v E ~. Such an entry indicates that furthermore the transducer may go from state ql to q2 in a single step, by reading the input from position i to position j while producing v as output.</Paragraph> <Paragraph position="26"> The preferred way of looking at these two tables is as a set of states and a set of transitions of a finite automaton ~&quot; = (K ', ~2,/1', (0, s), F'), where F' is a subset of {n} x F.</Paragraph> <Paragraph position="27"> Initially K ~ = {(0, s)} and A' = 0. Then the following is repeated until no more new elements can be added to K ~ or/1': 1. We choose a state (i, ql) E K ~ and a transition (ql,a~+l-..ajlv, q2 ) ~/1.</Paragraph> <Paragraph position="28"> 2. We add a state (J, q2) to K ~ and a transition ((i, ql),V, (j, q2)) to/1' if not already present. We then define F' = K' N ({n} x F). The input al... an is recognized by T when F' is nonempty. The language accepted by ~&quot; is the set of output strings that al .-. an is associated with by T \[1\].</Paragraph> <Paragraph position="29"> Before continuing with the next phases of processing, as presented in the following sections, we may first reduce the automaton, i.e. we may remove the transitions that do not contribute to any paths from (O,s) to a state in F'. For simplifying the discussion in the next section, we further assume that ~&quot; is transformed such that all transitions (q, v, q~) E A' Satisfy Iv I = 1. For the running example of Figure 3 we may then obtain the finite automaton indicated by the thick lines in Figure 4. (Two f-transitions were implicitly eliminated and the automaton has been reduced.) The time demand of the construction of ~ from 7&quot; and ax&quot;. an is linear measured both in n and in the size of 7&quot;. Note that in general the language accepted by 5 r may be infinite in case the grammar is cyclic.</Paragraph> </Section> <Section position="5" start_page="18" end_page="19" type="metho"> <SectionTitle> 5 Retrieving a Parse Forest </SectionTitle> <Paragraph position="0"> Using &quot;the compact representation of all possible output strings discussed above, we can obtain the structure of the input according to the context-free gTammtlr; by &quot;structure&quot; of the input we mean the collection of all parse trees. Again, we use a tabular representation, called a parse /o,zst \[6, 10, 2\].</Paragraph> <Paragraph position="1"> Our particular kind of parse forest is a table U consisting of dott~ items of the form \[q, A ~ t~,/~, q'\], where q and q' are states from K' and A ~ cl/~ is a rule. The dot indicates to how far recognition of the fight-hand side has progressed. To be more precise, the meaning of the above dotted item is that the input symbols on a path from q to q' can be derived from/~.</Paragraph> <Paragraph position="2"> Note that recognition of fight-hand sides is done from right to left, i.e. in. reversed order with respect to Earley's algorithm \[6\].</Paragraph> <Paragraph position="3"> For a certain instance of a rule, the initial position of the dot is given by the position of the diamond in the corresponding filter item.</Paragraph> <Paragraph position="4"> There are several ways to construct U. For presentational reasons our algorithm will be relatively simple, in the style of the CYK algorithm \[8\]: 1. Initially U is empty.</Paragraph> <Paragraph position="5"> 2. We perform one of the following until no more new elements can be added to U: (a) We choose a tran.qition (q, A -~ a o, q~) E A' and add an item \[q, A -+ ct., q~\] to U. (b) We choose a transition (q, A -~ ct o B, q') E A' and an item \[q', B -@ * 7, q&quot;\] E U and add an item \[q, A -~ ti * B, q&quot;\] to U.</Paragraph> <Paragraph position="6"> (c) We choose a transition (q, a, q') E A' and an item \[q', A ~ aa */3, q&quot;\] E U and add an item \[q, A -~ a * a/~, q'\] to U.</Paragraph> <Paragraph position="7"> (d) We choose a pair of items \[q, B ~ * 7, q'\], \[q~, A -~ clB. ~, q&quot;\] E U and add an item \[q, A -+ a * B~, q&quot;\] to U.</Paragraph> <Paragraph position="9"> Assume the gr~rnm~r is G - (~, N, P, S). The following is to be performed for each set N~ E A f such that recursive(Ni) = self.</Paragraph> <Paragraph position="10"> 1. Add the following nonterminals to N: A~, A~s, ~- -* As and AB for all A, B E Ni. 2. Add the following rules to P, for all A, B, C, D, E E Ni: - A--+A~; -- null - As - BB &quot;-4 - As -.4 -+ A~--cY1...YmC~, for all (C --~ Y1...Ym) E P, with Y1,... ,Ym ~ Ni; --* ~ Y1...YmE~, for all (D --+ aCY1...Y,~E~) E P, with Y1,... ,Yrn ~ Ni; --~ BA; \]&quot;1...YmCB, for all (n -~ I&quot;1...YmC~) E P, with 1&quot;1,-.. ,Y,,, C/ Ni; -@ CsY1 ...Ym, for all (A ~ aCY1...Ym) E P, with Y1,.-.,Ym ~ Ni; 3. Remove from P the old rules of the form A--~ a, where A E Ni. 4. Reduce the grammar.</Paragraph> <Paragraph position="11"> The items produced for the running example are represented as the thin lines in Figure 4.</Paragraph> </Section> <Section position="6" start_page="19" end_page="19" type="metho"> <SectionTitle> I 6 Approximating a Context-Free Language </SectionTitle> <Paragraph position="0"> Section 3 presented a sufficient condition for the generated language to be regular, and explained when this condition is violated. This suggests how to change an arbitrary grammar so that it will come to satisfy the condition.</Paragraph> <Paragraph position="1"> The intuition is that the &quot;unbounded communication&quot; between the left and right sides of spines is broken. This is done by a transformation that operates separately on each set Ni such that recursive(Ni) = self, as indicated in Figure 5. After this, the grammar will be strongly regular.</Paragraph> <Paragraph position="2"> Consider the grammar of palindromes in the left half Of Figure 1. The approximation algorithm leads to the grammar in the right half. Figure 1 (b) shows the effect on the structure of 4-- parse trees. Note that the left sides of former spines are treated by the new nonterminal Ss and the right sides by the new nonterminal Ss.</Paragraph> <Paragraph position="3"> This example deals with the special case that each nonterminal can lead to at most one recursive call of itself. The general case is more complicated and is treated elsewhere \[15\].</Paragraph> </Section> <Section position="7" start_page="19" end_page="22" type="metho"> <SectionTitle> 7 Obtaining Correct Parse Trees </SectionTitle> <Paragraph position="0"> In Section 5 we discussed how the table resulting from simulating the transducer should be interpreted in order to obtain a parse forest. However, we assumed then that the transducer had been constructed from a grammar that was strongly regular. In case the original grammar is not strongly regular we have to approach this task in a different way.</Paragraph> <Paragraph position="1"> One possibility is to first apply the grammar transformation from the previous section and subsequently perform the 2-phase process as before. However, this approach results in a parse forest that reflects the structure of the transformed grammar rather than that of the original grammar.</Paragraph> <Paragraph position="2"> The second and preferred approach is to incorporate the grammar transformation into the construction of the transducer. The accepted language is then the same as in the case of the first approach, but the symbols that occur in the output carry information about the rules from the original grammar.</Paragraph> <Paragraph position="3"> How the construction of the finite transducer from Figure 2 needs to be changed is indicated in Figure 6. We only show the part of the code which deals with the case that ~ consists of a single nonterminal.</Paragraph> <Paragraph position="4"> For nontermlnals which are not in a set Ni such that recursive(Ni) = self, the same treatment as before is applied. Upon encountering a nonterminal B E Ni such that recursive(Ni) = self, we consider the structure of the grammar if it is transformed according to Figure 5. This transformation creates new sets of recursive nonterminals, which have to be treated according to Figure 2 depending on whether they may be left-recursive or right-recursive.</Paragraph> <Paragraph position="5"> For example, given a fixed nonterminal B E Ni, for some i such that recursive(Ni) = self, the set of nonterminals A~ and A~, for any A E Ni, together form a set M in the transformed grammar for which recursive (M) = right. We may therefore construct the transducer as dictated by Figure 2 for this case. In particular, this relates to the rules of the form A~ -+ Ac~...YmC~, -, CA and -, BA.</Paragraph> <Paragraph position="6"> 4.-Note that a nonterminal of the form Ac does not belong to M but to another set, say M1, which in the transformed grammar satisfies recursive(M1) = right (or recursive(M1) = cyclic). Similarly, a nonterminal of the form CA belongs to a set, say M2, which satisfies recursive (M2) = left (or recursive(M2) = cyclic). Treatment of these nonterminals occurs in a deeper level of recursion of make_fst, and appears as separate cases in Figure 6.</Paragraph> <Paragraph position="7"> It is important to remember that the sets Ni in Figure 6 always refer to the nature of recursion in the original grammar; the transformed grammar is merely implicit in the given construction of the transducer, and helps us to understand the construction in terms of Figure 2. In addition to hnit, filter items from the following set are used: I~d = {B ~ oe o C~ I (B --r aCfl) ~ P A 3i\[recursive(Ni) = self A B, C ~ Ni\]} The meaning of the dianaond is largely unchanged with regard to Section 3. For example, for --t the rule D -r aC~... YmEt~, which corresponds to the rule A~ ~ CA Y1... YrnEts of the transformed grammar, the filter item D --~ aC~ ... Y,n o E/~ is output, which indicates that an instance of ~ ... Ym (or an approximation thereof) has just been read, which is potentially preceded by an instance of aC and followed by an instance of E/~. On the other hand, upon encountering a rule such as A~ --~ BA, which is an artifact of the grammar transformation, no output symbol is generated.</Paragraph> <Paragraph position="8"> For retrieving the forest from ~&quot; we need to take into account the additional form of filter item. Now the following steps are required: (a) We choose (q, A --~ ~ o, q~) E ~ and add \[q, A -~ a e, q~\] to U.</Paragraph> <Paragraph position="9"> (b) We choose (q, A -~ ~ o B, ~) E ~, such that (A -~ a o B) E I~it, and \[q~, B -~ * ~, ~\] E U and add \[q, A -~ ~ * B, q&quot;\] to U.</Paragraph> <Paragraph position="11"> else (* a must consist of a single nonterminal *) if a is of form A E Ni, some i, and recursive(Ne) E {right, left, cyclic) then ...treatment as in Figure 2...</Paragraph> <Paragraph position="12"> elseif a is of form B E Ni, some i, and recursive(N~) = self then (* we implicitly replace B by B~ according to B -~ B~ *) for each A E Ne do let qA~ &quot;- fresh.state, qd, e = fresh_state end; for each A E Ni and (C -~ ~... Y,n) E P such that C E Ni ^ ~,..., Y,n C/ Ni 4-- do let q = fresh_state; make_Jst(qA~ , Ac ~... Ym, q); let A = A U {(q,e\[(C ~ ~... Ym o),qc,s)} 4- end; (* for A~ ~ Ac ~... Y~CTB *) for each A E Ni and (D ~ aCYI ... YmE~) E P such that C, D , E E Ni A ~ , . . . , Ym C/ Ni --t do let q = fresh_state; make_/st(qA~ , CA ~... Ym, q); let A =/t U {(q, e\[(D ~ aC~ ... Ym o E/~), qE~)} end; (* for ATe --~ ~ Y1 ...Y, nE~ *) for each A E Ni do make-fst(qA~s,BA, ql ) (* for ATe -~BA *) end; let ~ = ~ u ((q0,~le, qE:)} elseif a is of form DE such that D, B E Ne, some i then for each A E Ni do let qA~ = fresh.state end; for each (A -~ Yz... YmC~) E P such that A, C E Ni A ~,..., Ym ~ Ni do let q = fresh_state; make_Jst(q~8 , ~ . . . Ym, q).; let A = AU {(q,e\[(A -~ ~ ...Ym o C/~),q~a) } 4-- 4- end; (* for AB-~ Y1... Ym CB *) 4- let ,4 = /tU{(q~s,e\[E, ql)}; (* for BB--+ ~ *) let a = ~u {(q0,Ele, q_~ )} - JJB ---F elseif c~ is of form DB such that D, B ~ Ni, some i then for each A E Ne do let qA~ = fresh_state end; for each (A ~ aCYz... I/m) E P such that A, C E Ni ^ ~,..., Ym ~ N~ do let q = fresh.state; make..fst(q~s , Y~ . . . Ym, q);</Paragraph> <Paragraph position="14"> (c) We choose (q, a, q') E A' and \[q~, A ~ aa * E, q~'\] E U and add \[q, A -~ a * a~, q&quot;\] to U.</Paragraph> <Paragraph position="15"> (d) We choose \[q, B --, * 7., q~\], \[q/, A --, ~,B * E, q&quot;\] E U and: - if (A --> ct o B~) E I, nid, then add \[q&quot;~, A --, a * BE, q&quot;\] to U for each (q&quot;', A -+ a e BE, q) E A ~, and - otherwise, add \[q, A -* a * BE, q&quot;\] to U.</Paragraph> </Section> <Section position="8" start_page="22" end_page="23" type="metho"> <SectionTitle> 8 Empirical Results </SectionTitle> <Paragraph position="0"> The implementation was completed recently. Initial experiments allow some tentative conclusions, reported here.</Paragraph> <Paragraph position="1"> We have compared the 2-phase algorithm to a traditional tabular context-free parsing algorithm. In order to allow a fair comparison, we have taken a mixed parsing strategy that applies a set of dotted items comparable to that of Section 7. Ass~ming the input is given by al ... an as before, the steps are given by: (a) We choose i,'such that 0 ~ i _< n, and (A -~ ~ o) E/~n~t and add \[i, A ~ ~ o, i\] to U. (b) We choose \[i,B -4 * 7,J\] E U and (A -* a o B) E I~,.t, and add \[i,A ~ a * B,j\] to U. (c) We choose \[i + 1,A ~ aa~+l * E,j\] E U and add \[i,A --* a * a~+lE, j\] to U. (d) We choose \[i,B --* * 7,j\], \[j, A ~ ~B * E,k\] E U and add \[i,A ---> a * BE, k\] to U. For the experiments we have taken a grammar for German, generated automatically through EBL, of which a considerable part contaln.q self-embedding. The transducer was determinized and minimized as if it were a finite automaton, i.e. in a transition (q, vlw , q~) the pair vlw is treated as one symbol, and the pair ele is treated as the empty string. The test sentences were obtained using a random generator \[14\].</Paragraph> <Paragraph position="2"> For a given input sentence, we define T1 and T2 to be the number of steps that are performed for the respective phases of the 2-phase algorithm: first, the creation of 3 r from the input al-..an, and second, the creation of U from ~'. We define Tcf: to be the number of steps that are performed for the direct construction of table U from al... an by the above tabular algorithm.</Paragraph> <Paragraph position="3"> Concerning the two processes with context-free power, viz. To! and T2, we have observed that in the majority of cases there is a reduction in the number of steps from To! to T2. This can be a reduction from several hundreds of steps to less than 10. In individual cases however, especially for long sentences, T2 can be larger than To!. This can be explained by the fact that ~r may have many more states than that the input sentence has positions, which leads to less sharing of computation.</Paragraph> <Paragraph position="4"> Adding T1 and T2 in many cases leads to higher numbers of steps than To!. At this stage we cannot say whether this implies that the 2-phase idea is not useful. Many refinements, especially concerning the reduction of the number of states of 3 r in order to enhance sharing of computation, have as yet not been explored.</Paragraph> <Paragraph position="5"> In this context, we observe that the size of the repertoire of filter items has conflicting consequences for the overall complexity. If T outputs no filter items, then it reduces to a recognizer, which can be determinized. Consequently, T1 will be equal to the sentence length, but T2 will be no less than (and in fact identical to) To!. If on the other hand T outputs many types of filter item, then determinization and minimization is more difficult and consequently yr may be large and both T1 and T2 may be high.</Paragraph> </Section> <Section position="9" start_page="23" end_page="23" type="metho"> <SectionTitle> II II II Acknowledgements </SectionTitle> <Paragraph position="0"> Parts of this research were carried out within the framework of the Priority Programme Language and Speech Technology (TST), while the author was employed at the University of Groningen. The TST-Programme is sponsored by NWO (Dutch Organization for Scientific Research). This work was further funded by the German Federal Ministry of Education, Science, Research and Technology (BMBF) in the framework of the VERBMOBIL Project under Grant 01 IV 701 V0. The responsibility for the contents lies with the author.</Paragraph> </Section> class="xml-element"></Paper>