File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/j00-1003_intro.xml
Size: 30,533 bytes
Last Modified: 2025-10-06 14:00:51
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-1003"> <Title>Practical Experiments with Regular Approximation of Context-Free Languages</Title> <Section position="4" start_page="22" end_page="32" type="intro"> <SectionTitle> 4. Methods of Regular Approximation </SectionTitle> <Paragraph position="0"> This section describes a number of methods for approximating a context-free grammar by means of a finite automaton. Some published methods did not mention self-embedding explicitly as the source of nonregularity for the language, and suggested that approximations should be applied globally for the complete grammar. Where this is the case, we adapt the method so that it is more selective and deals with self-embedding locally.</Paragraph> <Paragraph position="1"> The approximations are integrated into the construction of the finite automaton from the grammar, which was described in the previous section. A separate incarnation of the approximation process is activated upon finding a nonterminal A such that A E Ni and recursive(Ni) = self, for some i. This incarnation then only pertains to the set of rules of the form B --* c~, where B E Ni. In other words, nonterminals not in Ni are treated by this incarnation of the approximation process as if they were terminals.</Paragraph> <Section position="1" start_page="22" end_page="24" type="sub_section"> <SectionTitle> 4.1 Superset Approximation Based on RTNs </SectionTitle> <Paragraph position="0"> The following approximation was proposed in Nederhof (1997). The presentation here, however, differs substantially from the earlier publication, which treated the approximation process entirely on the level of context-free grammars: a self-embedding grammar was transformed in such a way that it was no longer self-embedding. A finite automaton was then obtained from the grammar by the algorithm discussed above.</Paragraph> <Paragraph position="1"> The presentation here is based on recursive transition networks (RTNs) (Woods 1970). We can see a context-free grammar as an RTN as follows: We introduce two states qA and q~ for each nonterminal A, and m + 1 states q0 ..... qm for each rule A --* X1 .. * Xm. The states for a rule A ~ X 1 . . . X m are connected with each other and to the states for the left-hand side A by one transition (qA, c, q0), a transition (qi-1, Xi, qi) for each i such that 1 < i < m, and one transition (qm, e,q~A). (Actually, some epsilon transitions are avoided in our implementation, but we will not be concerned with such optimizations here.) In this way, we obtain a finite automaton with initial state qA and final state q~ for each nonterminal A and its defining rules A --* X1 * * * Xm. This automaton can be seen as one component of the RTN. The complete RTN is obtained by the collection of all such finite automata for different nonterminals.</Paragraph> <Paragraph position="2"> An approximation now results if we join all the components in one big automaton, and if we approximate the usual mechanism of recursion by replacing each transition (q, A, q') by two transitions (q, c, qA) and (q~, e, q'). The construction is illustrated in Figure 5.</Paragraph> <Paragraph position="3"> In terms of the original grammar, this approximation can be informally explained as follows: Suppose we have three rules B --* c~Afl, B I ~ c~IAfl ~, and A ~ % Top-down, left-to-right parsing would proceed, for example, by recognizing a in the first rule; it would then descend into rule A ~ % and recognize &quot;y; it would then return to the first rule and subsequently process ft. In the approximation, however, the finite automaton &quot;forgets&quot; which rule it came from when it starts to recognize % so that it may subsequently recognize fl' in the second rule.</Paragraph> <Paragraph position="4"> Application of the RTN method for the grammar in (a). The RTN is given in (b), and (c) presents the approximating finite automaton. We assume A is the start symbol and therefore qA becomes the initial state and q~ becomes the final state in the approximating automaton. For the sake of presentational convenience, the above describes a construction working on the complete grammar. However, our implementation applies the construction separately for each nonterminal in a set Ni such that recursive(Ni) = self, which leads to a separate subautomaton of the compact representation (Section 3).</Paragraph> <Paragraph position="5"> See Nederhof (1998) for a variant of this approximation that constructs finite transducers rather than finite automata.</Paragraph> <Paragraph position="6"> We have further implemented a parameterized version of the RTN approximation.</Paragraph> <Paragraph position="7"> A state of the nondeterministic automaton is now also associated to a list H of length IHI strictly smaller than a number d, which is the parameter to the method. This list represents a history of rule positions that were encountered in the computation leading to the present state.</Paragraph> <Paragraph position="8"> More precisely, we define an item to be an object of the form \[A ~ a * fl\], where A ~ aft is a rule from the grammar. These are the same objects as the &quot;dotted&quot; productions of Earley (1970). The dot indicates a position in the right-hand side.</Paragraph> <Paragraph position="9"> The unparameterized RTN method had one state qI for each item/, and two states qA and q~ for each nonterminal A. The parameterized RTN method has one state qrH for each item I and each list of items H that represents a valid history for reaching I, and two states qaH and q~H for each nonterminal A and each list of items H that represents a valid history for reaching A. Such a valid history is defined to be a list Nederhof Experiments with Regular Approximation H with 0 < \[HI < d that represents a series of positions in rules that could have been invoked before reaching I or A, respectively. More precisely, if we set H =/1 ... In, then each Im (1 < m < n) should be of the form \[Am ~ olin * Bmflm\] and for 1 < m < n we should have Am -- Bm+l. Furthermore, for a state qiH with I = \[A --* a * fl\] we demand A = B1 if n > 0. For a state qAH we demand A -- B1 if n > 0. (Strictly speaking, states qAH and qrH, with \[HI < d - 1 and I = \[A --+ a * fl\], will only be needed if AIH \] is the start symbol in the case IH\[ > 0, or if A is the start symbol in the case H = c.) The transitions of the automaton that pertain to terminals in right-hand sides of rules are very similar to those in the case of the unparameterized method: For a state qIH with I of the form \[A ~ a * aft\], we create a transition (q~H, a, qi,H), with</Paragraph> <Paragraph position="11"> Similarly, we create epsilon transitions that connect left-hand sides and right-hand sides of rules: For each state qAa there is a transition (qAH, e, qIH) for each item I = \[A --* * a\], for some a, and for each state of the form qI,u, with I' = \[A ~ a *\], there is a transition (qFa, c, q~H).</Paragraph> <Paragraph position="12"> For transitions that pertain to nonterminals in the right-hand sides of rules, we need to manipulate the histories. For a state qIH with I of the form \[A ~ a * Bfl\], we create two epsilon transitions. One is (qIH, c, qBn,), where H' is defined to be IH if \[IH\[ < d, and to be the first d - 1 items of IH, otherwise. Informally, we extend the history by the item I representing the rule position that we have just come from, but the oldest information in the history is discarded if the history becomes too long. The second transition is (q'BH,, ~, q~'H), with I' = \[A --* aB * fl\].</Paragraph> <Paragraph position="13"> If the start symbol is S, the initial state is qs and the final state is q~ (after the symbol S in the subscripts we find empty lists of items). Note that the parameterized method with d -- 1 concurs with the unparameterized method, since the lists of items then remain empty.</Paragraph> <Paragraph position="14"> An example with parameter d -- 2 is given in Figure 6. For the unparameterized method, each I = \[A --* a * fl\] corresponded to one state (Figure 5). Since reaching A can have three different histories of length shorter than 2 (the empty history, since A is the start symbol; the history of coming from the rule position given by item \[A -~ c * A\]; and the history of coming from the rule position given by item \[B ~ d * Ae\]), in Figure 6 we now have three states of the form qI~ for each I -- \[A ~ a * fl\], as well as three states of the form qA~r and q~H&quot; The higher we choose d, the more precise the approximation is, since the histories allow the automaton to simulate part of the mechanism of recursion from the original grammar, and the maximum length of the histories corresponds to the number of levels of recursion that can be simulated accurately.</Paragraph> </Section> <Section position="2" start_page="24" end_page="24" type="sub_section"> <SectionTitle> 4.2 Refinement of RTN Superset Approximation </SectionTitle> <Paragraph position="0"> We rephrase the method of Grimley-Evans (1997) as follows: First, we construct the approximating finite automaton according to the unparameterized RTN method above.</Paragraph> <Paragraph position="1"> Then an additional mechanism is introduced that ensures for each rule A --~ X1 * .. Xm separately that the list of visits to the states qo,.. * * qm satisfies some reasonable criteria: a visit to qi, with 0 < i < m, should be followed by one to qi+l or q0. The latter option amounts to a nested incarnation of the rule. There is a complementary condition for what should precede a visit to qi, with 0 < i < m.</Paragraph> <Paragraph position="2"> Since only pairs of consecutive visits to states from the set {q0 ..... qm} are considered, finite-state techniques suffice to implement such conditions. This can be realized by attaching histories to the states as in the case of the parameterized RTN method above, but now each history is a set rather than a list, and can contain at most one</Paragraph> <Paragraph position="4"> Application of the parameterized RTN method with d = 2. We again assume A is the start symbol. States qm have not been labeled in order to avoid cluttering the picture.</Paragraph> <Paragraph position="5"> firmed by our own experiments, the nondeterministic finite automata resulting from this method may be quite large, even for small grammars. The explanation is that the number of such histories is exponential in the number of rules.</Paragraph> <Paragraph position="6"> We have refined the method with respect to the original publication by applying the construction separately for each nonterminal in a set Ni such that recursive(Ni) = self.</Paragraph> </Section> <Section position="3" start_page="24" end_page="30" type="sub_section"> <SectionTitle> 4.3 Subset Approximation by Transforming the Grammar </SectionTitle> <Paragraph position="0"> Putting restrictions on spines is another way to obtain a regular language. Several methods can be defined. The first method we present investigates spines in a very detailed way. It eliminates from the language only those sentences for which a sub-derivation is required of the form B --~* aBfl, for some a ~ C/ and fl ~ e. The motivation is that such sentences do not occur frequently in practice, since these subderivations make them difficult for people to comprehend (Resnik 1992). Their exclusion will therefore not lead to much loss of coverage of typical sentences, especially for simple application domains.</Paragraph> <Paragraph position="1"> We express the method in terms of a grammar transformation in Figure 7. The effect of this transformation is that a nonterminal A is tagged with a set of pairs (B, Q), where B is a nonterminal occurring higher in the spine; for any given B, at most one such pair (B, Q) can be contained in the set. The set Q may contain the element l to indicate that something to the left of the part of the spine from B to A Nederhof Experiments with Regular Approximation We are given a grammar G = (E,N, P, S). The following is to be performed for each set Ni EAf such that recursive(Ni) = self.</Paragraph> <Paragraph position="2"> . For each A E Ni and each F E 2 (Nix2~l''}), add the following nonterminal to N.</Paragraph> <Paragraph position="3"> * A F .</Paragraph> <Paragraph position="4"> 2. For each A E Ni, add the following rule to P.</Paragraph> <Paragraph position="5"> * A---~A 0.</Paragraph> <Paragraph position="6"> . For each (A --* o~0A1o~1A2... C~m-lAmCrm) E P such that A, A1 .... ,Am E Ni and no symbols from c~0 .... , am are members of Ni, and each F such that (A, (l, r}) ~ F, add the following rule to P.</Paragraph> <Paragraph position="7"> a F F1 Fm o~0A 1 oq... A m O~m, where, for 1 G j _< m, -- Fj= {(B, QUC~U~F) I (B,Q) E F'}; F' = FU {(A, 0)} if -~3Q\[(A,Q) E F\], and F' = F otherwise; -- = 0 if c~0AlC~l...Aj-I~j-1 = c, and ~ = {l} otherwise; -- QJr = 0 if o/.jaj+lOLj+l...AmOL m = PS, and QJr = {r} otherwise.</Paragraph> <Paragraph position="8"> 4. Remove from P the old rules of the form A --* c~, where A E Ni. 5. Reduce the grammar.</Paragraph> <Paragraph position="9"> Subset approximation by transforming the grammar.</Paragraph> <Paragraph position="10"> was generated. Similarly, r E Q indicates that something to the right was generated. If Q = {l, r}, then we have obtained a derivation B --** c~Afl, for some c~ ~ c and fl ~ ~, and further occurrences of B below A should be blocked in order to avoid a derivation with self-embedding.</Paragraph> <Paragraph position="11"> An example is given in Figure 8. The original grammar is implicit in the depicted parse tree on the left, and contains at least the rules S --+ A a, A --, b B, B -* C, and C --* S. This grammar is self-embedding, since we have a subderivation S --~* bSa. We explain how FB is obtained from FA in the rule A ~ --* b B r'. We first construct F' = {(S, {r}), (A, 0)} from FA = {(S, (r})} by adding (A, 0), since no other pair of the form (A, Q) was already present. To the left of the occurrence of B in the original rule A --* b B we find a nonempty string b. This means that we have to add l to all second components of pairs in F', which gives us FB = {(S, (l, r}), (A, {l})}. In the transformed grammar, the lower occurrence of S in the tree is tagged with the set {(S, {I, r}), (A, {l}), (B, 0), (C, 0)}. The meaning is that higher up in the spine, we will find the nonterminals S, A, B, and C. The pair (A, (1}) indicates that since we saw A on the spine, something to the left has been generated, namely, b. The pair (B, 0) indicates that nothing either to the left or to the right has been generated since we saw B. The pair (S, {1, r}) indicates that both to the left and to the right something has been generated (namely, b on the left and a on the right). Since this indicates that an {(S, {l, r}), (A, {/})} {(S, {l, r}), (A, {/}), (B, 0)} {(S, {l, r}), (A, {/}), (B, 0), (C, 0)} Figure 8 A parse tree m a self-embedding grammar (a), and the corresponding parse tree in the transformed grammar (b), for the transformation from Figure 7. For the moment we ignore step 5 of Figure 7, i.e., reduction of the transformed grammar.</Paragraph> <Paragraph position="12"> offending subderivation S --** c~Sfl has been found, further completion of the parse tree is blocked: the transformed grammar will not have any rules with left-hand side S {(S'{I'r})'(A'{I})'(B'O)'(C'O)}. In fact, after the grammar is reduced, any parse tree that is constructed can no longer even contain a node labeled by S {(s'U'r})'(a'{O)'(B'deg)'(c'deg)}, or any nodes with labels of the form A r such that (A, {l,r}) c F.</Paragraph> <Paragraph position="13"> One could generalize this approximation in such a way that not all self-embedding is blocked, but only self-embedding occurring, say, twice in a row, in the sense of a subderivation of the form A --** alAfll --+* oqol2Afl2fll. We will not do so here, because already for the basic case above, the transformed grammar can be huge due to the high number of nonterminals of the form A F that may result; the number of such nonterminals is exponential in the size of Ni.</Paragraph> <Paragraph position="14"> We therefore present, in Figure 9, an alternative approximation that has a lower complexity. By parameter d, it restricts the number of rules along a spine that may generate something to the left and to the right. We do not, however, restrict pure left recursion and pure right recursion. Between two occurrences of an arbitrary rule, we allow left recursion followed by right recursion (which leads to tag r followed by tag rl), or right recursion followed by left recursion (which leads to tag l followed by tag lr).</Paragraph> <Paragraph position="15"> An example is given in Figure 10. As before, the rules of the grammar are implicit in the depicted parse tree. At the top of the derivation we find S. In the transformed grammar, we first have to apply S --* S -r'deg. The derivation starts with a rule S --* A a, which generates a string (a) to the right of a nonterminal (A). Before we can apply zero or more of such rules, we first have to apply a unit rule S T,deg --* S r,deg in the transformed grammar. For zero or more rules that subsequently generate something on the left, such as A ~ b B, we have to obtain a superscript containing rl, and in the example this is done by applying A r,deg ~ A rl,deg. Now we are finished with pure left recursion and pure right recursion, and apply B rl,O ---+ B +-,0. This allows us to apply one unconstrained rule, which appears in the transformed grammar as B +-,deg ---* c S T'I d. Nederhof Experiments with Regular Approximation We are given a grammar G = (G, N, P, S). The following is to be performed for each set Ni C .IV&quot; such that recursive(Ni) = self. The value d stands for the maximum number of unconstrained rules along a spine, possibly alternated with a series of left-recursive rules followed by a series of right-recursive rules, or vice versa.</Paragraph> <Paragraph position="16"> 1. For each A c Ni, each Q E { T, l, r, It, rl, 3_ }, and each f such that 0 < f < d, add the following nonterminals to N.</Paragraph> <Paragraph position="17"> * AQ,f.</Paragraph> <Paragraph position="18"> 2. For each A E Ni, add the following rule to P.</Paragraph> <Paragraph position="19"> * A ---+ A T'0.</Paragraph> <Paragraph position="20"> 3. For each A E Ni and f such that 0 G f G d, add the following rules to P. 4. For each (A -+ Ba) ~ P such that A, B c Ni and no symbols from ~ are members of Ni, eachf such that 0 <f G d, and each Q E {r, lr}, add the following rule to P.</Paragraph> <Paragraph position="21"> * AQd ~ BQ/a.</Paragraph> <Paragraph position="22"> 5. For each (A --+ c~B) E P such that A, B E Ni and no symbols from c~ are members of Ni, eachf such that 0 Gf < d, and each Q c {l, rl}, add the following rule to P.</Paragraph> <Paragraph position="23"> * Aqd ~ c~BQ,f.</Paragraph> <Paragraph position="24"> 6. For each (A -~ o~0AloqA2... O~m-lAmC~m) C P such that A, A1 ..... Am E Ni and no symbols from s0 ..... C~m are members of Ni, and each f such that 0 < f G d, add the following rule to P, provided m = 0 vf < d.</Paragraph> <Paragraph position="25"> * A+-/ c~0Alq-d+lc~l AT,f+1 ---4 . . .~l m ' OLm .</Paragraph> <Paragraph position="26"> 7. Remove from P the old rules of the form A ~ c~, where A E Ni.</Paragraph> <Paragraph position="27"> 8. Reduce the grammar.</Paragraph> <Paragraph position="28"> A simpler subset approximation by transforming the grammar.</Paragraph> <Paragraph position="29"> Now the counter f has been increased from 0 at the start of the subderivation to 1 at the end. Depending on the value d that we choose, we cannot build derivations by repeating subderivation S --+* b c S d a an unlimited number of times: at some point the counter will exceed d. If we choose d = 0, then already the derivation at</Paragraph> <Paragraph position="31"> A parse tree in a self-embedding grammar (a), and the corresponding parse tree in the transformed grammar (b), for the simple subset approximation from Figure 9.</Paragraph> <Paragraph position="32"> Figure 10 (b) is no longer possible, since no nonterminal in the transformed grammar would contain 1 in its superscript.</Paragraph> <Paragraph position="33"> Because of the demonstrated increase of the counter f, this transformation is guaranteed to remove self-embedding from the grammar. However, it is not as selective as the transformation we saw before, in the sense that it may also block subderivations that are not of the form A --** ~Afl. Consider for example the subderivation from Figure 10, but replacing the lower occurrence of S by any other nonterminal C that is mutually recursive with S, A, and B. Such a subderivation S ---** b c C d a would also be blocked by choosing d = 0. In general, increasing d allows more of such derivations that are not of the form A ~&quot; o~Afl but also allows more derivations that are of that form.</Paragraph> <Paragraph position="34"> The reason for considering this transformation rather than any other that eliminates self-embedding is purely pragmatic: of the many variants we have tried that yield nontrivial subset approximations, this transformation has the lowest complexity in terms of the sizes of intermediate structures and of the resulting finite automata. null In the actual implementation, we have integrated the grammar transformation and the construction of the finite automaton, which avoids reanalysis of the grammar to determine the partition of mutually recursive nonterminals after transformation. This integration makes use, for example, of the fact that for fixed Ni and fixed f, the set of nonterminals of the form A,f, with A c Ni, is (potentially) mutually right-recursive. A set of such nonterminals can therefore be treated as the corresponding case from Figure 2, assuming the value right.</Paragraph> <Paragraph position="35"> The full formulation of the integrated grammar transformation and construction of the finite automaton is rather long and is therefore not given here. A very similar formulation, for another grammar transformation, is given in Nederhof (1998).</Paragraph> </Section> <Section position="4" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 4.4 Superset Approximation through Pushdown Automata </SectionTitle> <Paragraph position="0"> The distinction between context-free languages and regular languages can be seen in terms of the distinction between pushdown automata and finite automata. Pushdown automata maintain a stack that is potentially unbounded in height, which allows more complex languages to be recognized than in the case of finite automata. Regular approximation can be achieved by restricting the height of the stack, as we will see in Section 4.5, or by ignoring the distinction between several stacks when they become too high.</Paragraph> <Paragraph position="1"> More specifically, the method proposed by Pereira and Wright (1997) first constructs an LR automaton, which is a special case of a pushdown automaton. Then, stacks that may be constructed in the course of recognition of a string are computed one by one. However, stacks that contain two occurrences of a stack symbol are identified with the shorter stack that results by removing the part of the stack between the two occurrences, including one of the two occurrences. This process defines a congruence relation on stacks, with a finite number of congruence classes. This congruence relation directly defines a finite automaton: each class is translated to a unique state of the nondeterministic finite automaton, shift actions are translated to transitions labeled with terminals, and reduce actions are translated to epsilon transitions.</Paragraph> <Paragraph position="2"> The method has a high complexity. First, construction of an LR automaton, of which the size is exponential in the size of the grammar, may be a prohibitively expensive task (Nederhof and Satta 1996). This is, however, only a fraction of the effort needed to compute the congruence classes, of which the number is in turn exponential in the size of the LR automaton. If the resulting nondeterministic automaton is determinized, we obtain a third source of exponential behavior. The time and space complexity of the method are thereby bounded by a triple exponential function in the size of the grammar. This theoretical analysis seems to be in keeping with the high costs of applying this method in practice, as will be shown later in this article.</Paragraph> <Paragraph position="3"> As proposed by Pereira and Wright (1997), our implementation applies the approximation separately for each nonterminal occurring in a set Ni that reveals selfembedding. null A different superset approximation based on LR automata was proposed by Baker (1981) and rediscovered by Heckert (1994). Each individual stack symbol is now translated to one state of the nondeterministic finite automaton. It can be argued theoretically that this approximation differs from the unparameterized RTN approximation from Section 4.1 only under certain conditions that are not likely to occur very often in practice. This consideration is confirmed by our experiments to be discussed later.</Paragraph> <Paragraph position="4"> Our implementation differs from the original algorithm in that the approximation is applied separately for each nonterminal in a set Ni that reveals self-embedding.</Paragraph> <Paragraph position="5"> A generalization of this method was suggested by Bermudez and Schimpf (1990).</Paragraph> <Paragraph position="6"> For a fixed number d > 0 we investigate sequences of d top-most elements of stacks that may arise in the LR automaton, and we translate these to states of the finite automaton. More precisely, we define another congruence relation on stacks, such that we have one congruence class for each sequence of d stack symbols and this class contains all stacks that have that sequence as d top-most elements; we have a separate class for each stack that contains fewer than d elements. As before, each congruence class is translated to one state of the nondeterministic finite automaton. Note that the case d = 1 is equivalent to the approximation in Baker (1981).</Paragraph> <Paragraph position="7"> If we replace the LR automaton by a certain type of automaton that performs top-down recognition, then the method in Bermudez and Schimpf (1990) amounts to the parameterized RTN method from Section 4.1; note that the histories from Section 4.1 in fact function as stacks, the items being the stack symbols.</Paragraph> </Section> <Section position="5" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 4.5 Subset Approximation through Pushdown Automata </SectionTitle> <Paragraph position="0"> By restricting the height of the stack of a pushdown automaton, one obstructs recognition of a set of strings in the context-free language, and therefore a subset approximation results. This idea was proposed by Krauwer and des Tombe (1981), Langendoen and Langsam (1987), and Pulman (1986), and was rediscovered by Black (1989) and recently by Johnson (1998). Since the latest publication in this area is more explicit in its presentation, we will base our treatment on this, instead of going to the historical roots of the method.</Paragraph> <Paragraph position="1"> One first constructs a modified left-corner recognizer from the grammar, in the form of a pushdown automaton. The stack height is bounded by a low number; Johnson (1998) claims a suitable number would be 5. The motivation for using the left-corner strategy is that the height of the stack maintained by a left-corner parser is already bounded by a constant in the absence of self-embedding. If the artificial bound imposed by the approximation method is chosen to be larger than or equal to this natural bound, then the approximation may be exact.</Paragraph> <Paragraph position="2"> Our own implementation is more refined than the published algorithms mentioned above, in that it defines a separate left-corner recognizer for each nonterminal A such that A E Ni and recursive(Ni) = self, some i. In the construction of one such recognizer, nonterminals that do not belong to Ni are treated as terminals, as in all other methods discussed here.</Paragraph> </Section> <Section position="6" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 4.6 Superset Approximation by N-grams </SectionTitle> <Paragraph position="0"> An approximation from Seyfarth and Bermudez (1995) can be explained as follows.</Paragraph> <Paragraph position="1"> Define the set of all terminals reachable from nonterminal A to be ~A = {a I 3c~, iliA --** o~afl\]}. We now approximate the set of strings derivable from A by G~, which is the set of strings consisting of terminals from GA. Our implementation is made slightly more sophisticated by taking ~A to be {X \] 3B, c~,fl\[B E Ni A B ~ oLXfl A X ~ Ni\]}, for each A such that A E Ni and recursive(Ni) = self, for some i. That is, each X E ~A is a terminal, or a nonterminal not in the same set Ni as A, but immediately reachable from set Ni, through B E Ni.</Paragraph> <Paragraph position="2"> This method can be generalized, inspired by Stolcke and Segal (1994), who derive N-gram probabilities from stochastic context-free grammars. By ignoring the probabilities, each N = 1, 2, 3 .... gives rise to a superset approximation that can be described as follows: The set of strings derivable from a nonterminal A is approximated by the set of strings al ... an such that * for each substring v = ai+l ... ai+N (0 < i < n -- N) we have A --+* wvy, for some w and y, * for each prefix v = al ... ai (0 < i < n) such that i < N we have A -** vy, for some y, and * for each suffix v = ai+l ... an (0 < i < n) such that n - i < N we have a ---~* wv, for some w.</Paragraph> <Paragraph position="3"> (Again, the algorithms that we actually implemented are more refined and take into account the sets Ni.) The approximation from Seyfarth and Bermudez (1995) can be seen as the case N = 1, which will henceforth be called the unigram method. We have also experimented with the cases N = 2 and N = 3, which will be called the bigram and trigram methods.</Paragraph> </Section> </Section> class="xml-element"></Paper>