File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1069_metho.xml
Size: 16,687 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1069"> <Title>Probabilistic Parsing Strategies</Title> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 3 Parsing Strategies </SectionTitle> <Paragraph position="0"> The term &quot;parsing strategy&quot; is often used informally to refer to a class of parsing algorithms that behave similarly in some way. In this paper, we assign a formal meaning to this term, relying on the observation by (Lang, 1974) and (Billot and Lang, 1989) that many parsing algorithms for CFGs can be described in two steps. The first is a construction of push-down devices from CFGs, and the second is a method for handling nondeterminism (e.g. back-tracking or dynamic programming). Parsing algorithms that handle nondeterminism in different ways but apply the same construction of push-down devices from CFGs are seen as realizations of the same parsing strategy.</Paragraph> <Paragraph position="1"> Thus, we define a parsing strategy to be a function S that maps a reduced CFG G = ( 1; N; S; R) to a pairS(G) = (A;f) consisting of a reduced</Paragraph> <Paragraph position="3"> tion f that maps a subset of 2 to a subset of R , with the following properties: R 2.</Paragraph> <Paragraph position="4"> For each string w 2 1 and each complete computation c on w, f(out(c)) = d is a (leftmost) derivation of w. Furthermore, each symbol from R occurs as often in out(c) as it occurs in d.</Paragraph> <Paragraph position="5"> Conversely, for each string w 2 1 and each derivation d of w, there is precisely one complete computation c on w such that f(out(c)) = d.</Paragraph> <Paragraph position="6"> If c is a complete computation, we will write f(c) to denote f(out(c)). The conditions above then imply thatf is a bijection from complete computations to complete derivations. Note that output strings of (complete) computations may contain symbols that are not in R, and the symbols that are in R may occur in a different order in v than in f(v) = d. The purpose of the symbols in 2 R is to help this process of reordering of symbols from R in v, as needed for instance in the case of the left-corner parsing strategy (see (Nijholt, 1980, pp. 22-23) for discussion).</Paragraph> <Paragraph position="7"> A probabilistic parsing strategy is defined to be a function S that maps a reduced, proper and consistent PCFG (G;pG) to a triple S(G;pG) = (A;pA;f), where (A;pA) is a reduced, proper and consistent PPDT, with the same properties as a (non-probabilistic) parsing strategy, and in addition: For each complete derivation d and each complete computation c such that f(c) = d, pG(d) equals pA(c).</Paragraph> <Paragraph position="8"> In other words, a complete computation has the same probability as the complete derivation that it is mapped to by function f. An implication of this property is that for each string w 2 1, the probabilities assigned to that string by (G;pG) and (A;pA) are equal.</Paragraph> <Paragraph position="9"> We say that probabilistic parsing strategyS0 is an extension of parsing strategyS if for each reduced CFGGand probability functionpG we haveS(G) = (A;f) if and only if S0(G;pG) = (A;pA;f) for some pA.</Paragraph> </Section> <Section position="5" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Correct-Prefix Property </SectionTitle> <Paragraph position="0"> In this section we present a necessary condition for the probabilistic extension of a parsing strategy. For a given PDT, we say a computation c is dead if formally, a dead computation is a computation that cannot be continued to become a complete computation. We say that a PDT has the correct-prefix property (CPP) if it does not allow any dead computations. We also say that a parsing strategy has the CPP if it maps each reduced CFG to a PDT that has the CPP.</Paragraph> <Paragraph position="1"> Lemma 1 For each reduced CFG G, there is a probability function pG such that PCFG (G;pG) is proper and consistent, and pG(d) > 0 for all complete derivations d.</Paragraph> <Paragraph position="2"> Proof. Since G is reduced, there is a finite set D consisting of complete derivations d, such that for each rule in G there is at least one d 2 D in which occurs. Let n ;d be the number of occurrences of rule in derivation d2D, and let n be d2D n ;d, the total number of occurrences of in D. LetnA be the sum ofn for all rules withAin the left-hand side. A probability function pG can be defined through &quot;maximum-likelihood estimation&quot; such that pG( ) = n nA for each rule = A! .</Paragraph> <Paragraph position="3"> For all nonterminals A, =A! pG( ) = =A! n nA = nAnA = 1, which means that the PCFG (G;pG) is proper. Furthermore, it has been shown in (Chi and Geman, 1998; S'anchez and Bened'i, 1997) that a PCFG (G;pG) is consistent if pG was obtained by maximum-likelihood estimation using a set of derivations. Finally, since n > 0 for each , also pG( ) > 0 for each , and pG(d) > 0 for all complete derivations d.</Paragraph> <Paragraph position="4"> We say a computation is a shortest dead computation if it is dead and none of its proper prefixes is dead. Note that each dead computation has a unique prefix that is a shortest dead computation. For a PDT A, let TA be the union of the set of all complete computations and the set of all shortest dead computations.</Paragraph> <Paragraph position="5"> Lemma 2 For each proper PPDT (A;pA), c2TA pA(c) 1.</Paragraph> <Paragraph position="6"> Proof. The proof is a trivial variant of the proof that for a proper PCFG (G;pG), the sum ofpG(d) for all derivationsdcannot exceed 1, which is shown by (Booth and Thompson, 1973).</Paragraph> <Paragraph position="7"> From this, the main result of this section follows. Theorem 3 A parsing strategy that lacks the CPP cannot be extended to become a probabilistic parsing strategy.</Paragraph> <Paragraph position="8"> Proof. Take a parsing strategySthat does not have the CPP. Then there is a reduced CFGG = ( 1;N; S;R), withS(G) = (A;f) for someAand f, and a shortest dead computation c allowed byA.</Paragraph> <Paragraph position="9"> It follows from Lemma 1 that there is a probability function pG such that (G;pG) is a proper and consistent PCFG and pG(d) > 0 for all complete derivations d. Assume we also have a probability function pA such that (A;pA) is a proper and consistent PPDT andpA(c0) = pG(f(c0)) for each complete computation c0. SinceAis reduced, each transition must occur in some complete computation c0. Furthermore, for each complete computation c0 there is a complete derivation d such that f(c0) = d, and pA(c0) = pG(d) > 0. Therefore, pA( ) > 0 for each transition , and pA(c) > 0, where c is the above-mentioned dead computation.</Paragraph> <Paragraph position="10"> Due to Lemma 2, 1 c02TA pA(c0)</Paragraph> <Paragraph position="12"> w2 1 pG(w). This is in contradiction with the consistency of (G;pG). Hence, a probability function pAwith the properties we required above cannot exist, and thereforeS cannot be extended to become a probabilistic parsing strategy.</Paragraph> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> 5 Strong Predictiveness </SectionTitle> <Paragraph position="0"> In this section we present our main result, which is a sufficient condition allowing the probabilistic extension of a parsing strategy. We start with a technical result that was proven in (Abney et al., 1999; Chi, 1999; Nederhof and Satta, 2003).</Paragraph> <Paragraph position="1"> Lemma 4 Given a non-proper PCFG (G;pG),G = ( ;N;S;R), there is a probability functionp0G such that PCFG (G;p0G) is proper and, for every complete derivation d, p0G(d) = 1C pG(d), where C =P S)d0w;w2 pG(d 0).</Paragraph> <Paragraph position="2"> Note that if PCFG (G;pG) in the above lemma is consistent, then C = 1 and (G;p0G) and (G;pG) define the same distribution on derivations. The normalization procedure underlying Lemma 4 makes use of quantitiesPA)dw;w2 pG(d) for each A2 N. These quantities can be computed to any degree of precision, as discussed for instance in (Booth and Thompson, 1973) and (Stolcke, 1995). Thus normalization of a PCFG can be effectively computed. For a fixed PDT, we define the binary relation ; on stack symbols by: Y ; Y0 if and only if (Y;w;&quot;) ' (Y0;&quot;;v) for some w 2 1 and v 2 2. In words, some subcomputation of the PDT may start with stack Y and end with stack Y0. Note that all stacks that occur in such a subcomputation must have height of 1 or more. We say that a (P)PDA or a (P)PDT has the strong predictiveness property (SPP) if the existence of three transitions</Paragraph> <Paragraph position="4"> mally, this means that when a subcomputation starts with some stack and some push transition , then solely on the basis of we can uniquely determine what stack symbol Z1 = Z2 will be on top of the stack in the firstly reached configuration with stack height equal toj j. Another way of looking at it is that no information may flow from higher stack elements to lower stack elements that was not already predicted before these higher stack elements came into being, hence the term &quot;strong predictiveness&quot;. We say that a parsing strategy has the SPP if it maps each reduced CFG to a PDT with the SPP.</Paragraph> <Paragraph position="5"> Theorem 5 Any parsing strategy that has the CPP and the SPP can be extended to become a probabilistic parsing strategy.</Paragraph> <Paragraph position="6"> Proof. Consider a parsing strategy S that has the CPP and the SPP, and a proper, consistent and reduced PCFG (G;pG), G = ( 1; N; S; R). Let</Paragraph> <Paragraph position="8"> We will show that there is a probability function pA such that (A;pA) is a proper and consistent PPDT, and pA(c) = pG(f(c)) for all complete computations c.</Paragraph> <Paragraph position="9"> We first construct a PPDT (A;p0A) as follows.</Paragraph> <Paragraph position="10"> For each scan transition = X x;y7! Y in , let p0A( ) = pG(y) in case y 2 R, and p0A( ) = 1 otherwise. For all remaining transitions 2 , let p0A( ) = 1. Note that (A;p0A) may be non-proper. Still, from the definition off it follows that, for each complete computation c, we have</Paragraph> <Paragraph position="12"> and so our PPDT is consistent.</Paragraph> <Paragraph position="13"> We now map (A;p0A) to a language-equivalent</Paragraph> <Paragraph position="15"> contains the following rules with the specified associated probabilities: X ! YZ with pG0(X ! YZ ) = p0A(X 7! XY ), for each X 7! XY 2 with Z the unique stack symbol such that there is at least one transition XY07!Z with Y ;Y0; X ! xY with pG0(X ! xY ) = p0A(X x7! Y ), for each transition X x7!Y 2 ; Y !&quot; with pG0(X !&quot;) = 1, for each stack symbol Y such that there is at least one transition XY 7!Z2 or such that Y = Xfin.</Paragraph> <Paragraph position="16"> It is not difficult to see that there exists a bijection f0 from complete computations of A to complete derivations ofG0, and that we have</Paragraph> <Paragraph position="18"> for each complete computation c. Thus (G0;pG0) is consistent. However, note that (G0;pG0) is not proper.</Paragraph> <Paragraph position="19"> By Lemma 4, we can construct a new PCFG (G0;p0G0) that is proper and consistent, and such that pG0(d) = p0G0(d), for each complete derivation d of G0. Thus, for each complete computationcofA, we</Paragraph> <Paragraph position="21"> We now transfer back the probabilities of rules of (G0;p0G0) to the transitions ofA. Formally, we define a new probability function pA such that, for each 2 , pA( ) = p0G0( ), where is the rule in R0 that has been constructed from as specified above.</Paragraph> <Paragraph position="22"> It is easy to see that PPDT (A;pA) is now proper.</Paragraph> <Paragraph position="23"> Furthermore, for each complete computation c ofA we have</Paragraph> <Paragraph position="25"> and so (A;pA) is also consistent. By combining equations (1) to (4) we conclude that, for each complete computation c of A, pA(c) = p0G0(f0(c)) =</Paragraph> <Paragraph position="27"> strategyS can be probabilistically extended.</Paragraph> <Paragraph position="28"> Note that the construction in the proof above can be effectively computed (see discussion in Section 4 for effective computation of normalized PCFGs).</Paragraph> <Paragraph position="29"> The definition of p0A in the proof of Theorem 5 relies on the strings output byA. This is the main reason why we needed to consider PDTs rather than PDAs. Now assume an appropriate probability function pA has been computed, such that the source PCFG and (A;pA) define equivalent distributions on derivations/computations. Then the probabilities assigned to strings over the input alphabet are also equal. We may subsequently ignore the output strings if the application at hand merely requires probabilistic recognition rather than probabilistic transduction, or in other words, we may simplify PDTs to PDAs.</Paragraph> <Paragraph position="30"> The proof of Theorem 5 also leads to the observation that parsing strategies with the CPP and the SPP as well as their probabilistic extensions can be described as grammar transformations, as follows.</Paragraph> <Paragraph position="31"> A given (P)CFG is mapped to an equivalent (P)PDT by a (probabilistic) parsing strategy. By ignoring the output components of swap transitions we obtain a (P)PDA, which can be mapped to an equivalent (P)CFG as shown above. This observation gives rise to an extension with probabilities of the work on covers by (Nijholt, 1980; Leermakers, 1989).</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> 6 Applications </SectionTitle> <Paragraph position="0"> Many well-known parsing strategies with the CPP also have the SPP. This is for instance the case for top-down parsing and left-corner parsing. As discussed in the introduction, it has already been shown that for any PCFG G, there are equivalent PPDTs implementing these strategies, as reported in (Abney et al., 1999) and (Tendeau, 1995), respectively. Those results more simply follow now from our general characterization. Furthermore, PLR parsing (Soisalon-Soininen and Ukkonen, 1979; Nederhof, 1994) can be expressed in our framework as a parsing strategy with the CPP and the SPP, and thus we obtain as a new result that this strategy allows probabilistic extension.</Paragraph> <Paragraph position="1"> The above strategies are in contrast to the LR parsing strategy, which has the CPP but lacks the SPP, and therefore falls outside our sufficient condition. As we have already seen in the introduction, it turns out that LR parsing cannot be extended to become a probabilistic parsing strategy. Related to LR parsing is ELR parsing (Purdom and Brown, 1981; Nederhof, 1994), which also lacks the SPP. By an argument similar to the one provided for LR, we can show that also ELR parsing cannot be extended to become a probabilistic parsing strategy. (See (Tendeau, 1997) for earlier observations related to this.) These two cases might suggest that the sufficient condition in Theorem 5 is tight in practice.</Paragraph> <Paragraph position="2"> Decidability of the CPP and the SPP obviously depends on how a parsing strategy is specified. As far as we know, in all practical cases of parsing strategies these properties can be easily decided.</Paragraph> <Paragraph position="3"> Also, observe that our results do not depend on the general behaviour of a parsing strategy S, but just on its &quot;point-wise&quot; behaviour on each input CFG. Specifically, if S does not have the CPP and the SPP, but for some fixed CFG G of interest we obtain a PDT A that has the CPP and the SPP, then we can still apply the construction in Theorem 5.</Paragraph> <Paragraph position="4"> In this way, any probability function pG associated withG can be converted into a probability function pA, such that the resulting PCFG and PPDT induce equivalent distributions. We point out that decidability of the CPP and the SPP for a fixed PDT can be efficiently decided using dynamic programming.</Paragraph> <Paragraph position="5"> One more consequence of our results is this. As discussed in the introduction, the properness condition reduces the number of parameters of a PPDT.</Paragraph> <Paragraph position="6"> However, our results show that if the PPDT has the CPP and the SPP then the properness assumption is not restrictive, i.e., by lifting properness we do not gain new distributions with respect to those induced by the underlying PCFG.</Paragraph> </Section> class="xml-element"></Paper>