File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1044_metho.xml
Size: 8,440 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1044"> <Title>for Psycholinguistics</Title> <Section position="4" start_page="343" end_page="345" type="metho"> <SectionTitle> 3 Estimation of PCFGs </SectionTitle> <Paragraph position="0"> In this section we give a brief overview of some estimation methods for PCFGs. These methods will be later investigated to show that they always provide consistent PCFGs.</Paragraph> <Paragraph position="1"> In natural language processing applications, estimation of a PCFG is usually carried out on the basis of a tree bank, which in this paper we assume to be a sample, that is, a finite multiset, of complete derivations. Let D be such a sample, and let D be the underlying set of derivations. For d [?] D, we let f(d,D) be the multiplicity of d in D, that is, the number of occurrences of d in D. We define</Paragraph> <Paragraph position="3"> and let f(A,D) =summationtextaf(A - a,D).</Paragraph> <Paragraph position="4"> Consider a CFG G = (N,S,R,S) defined by all and only the nonterminals, terminals and rules observed in D. The criterion of maximum likelihood estimation (MLE) prescribes the construction of a PCFGG = (G,pG) such that pG maximizes the likelihood of D, defined as</Paragraph> <Paragraph position="6"> subject to the properness conditions summationtextapG(A a) = 1 for eachA [?] N. The maximization problem above has a unique solution, provided by the estimator (see for instance (Chi and Geman, 1998))</Paragraph> <Paragraph position="8"> We refer to this as the supervised MLE method.</Paragraph> <Paragraph position="9"> In applications in which a tree bank is not available, one might still use the MLE criterion to train a PCFG in an unsupervised way, on the basis of a sample of unannotated sentences, also called a corpus. Let us call C such a sample and C the underlying set of sentences. For w [?] C, we let f(w,C) be the multiplicity of w in C.</Paragraph> <Paragraph position="10"> Assume a CFG G = (N,S,R,S) that is able to generate all of the sentences in C, and possibly more. The MLE criterion prescribes the construction of a PCFG G = (G,pG) such that pG maximizes the likelihood of C, defined as</Paragraph> <Paragraph position="12"> subject to the properness conditions as in the supervised case above. The above maximization problem provides a system of |R |nonlinear equations (see (Chi and Geman, 1998))</Paragraph> <Paragraph position="14"> where Ep denotes an expectation computed under distribution p, and pG(d|w) is the probability of derivation d conditioned by sentence w (so that pG(d|w) > 0 only if y(d) = w). The solution to the above system is not unique, because of the nonlinearity. Furthermore, each solution of (9) identifies a point where the curve in (8) has partial derivatives of zero, but this does not necessarily correspond to a local maximum, let alone an absolute maximum. (A point with partial derivatives of zero that is not a local maximum could be a local minimum or even a so-called saddle point.) In practice, this system is typically solved by means of an iterative algorithm called inside/outside (Charniak, 1993), whichimplementstheexpectationmaximization (EM) method (Dempster et al., 1977). Starting with an initial function pG that probabilistically extends G, a so-called growth transformation is computed, defined as</Paragraph> <Paragraph position="16"> Following (Baum, 1972), one can show that pG(C) [?] pG(C). Thus, byiteratingthegrowthtransformation above, we are guaranteed to reach a local maximum for (8), or possibly a saddle point. We refer to this as the unsupervised MLE method.</Paragraph> <Paragraph position="17"> We now discuss a third estimation method for PCFGs, which was proposed in (Corazza and Satta, 2006). This method can be viewed as a generalization of the supervised MLE method to probability distributions defined over infinite sets of complete derivations. Let D be an infinite set of complete derivations using nonterminal symbols in N, start symbol S [?] N and terminal symbols in S.</Paragraph> <Paragraph position="18"> We assume that the set of rules that are observed in D is drawn from some finite set R. Let pD be a probability distribution defined over D, that is, a function from set D to interval [0,1] such thatsummationtext d[?]D pD(d) = 1.</Paragraph> <Paragraph position="19"> Consider the CFG G = (N,S,R,S). Note that D [?] D(G). We wish to extend G to some PCFG G = (G,pG) in such a way that pD is approximated bypG (viewed as a distribution over complete derivations) as well as possible according to some criterion. One possible criterion is minimization of the cross-entropy between pD and pG, defined as the expectation, under distribution pD, of the information of the derivations in D computed under distribution pG, that is</Paragraph> <Paragraph position="21"> We thus want to assign to the parameters pG(A a), A - a [?] R, the values that minimize (11), sub-ject to the conditionssummationtextapG(A - a) = 1 for each A [?] N. Note that minimization of the cross-entropy above is equivalent to minimization of the Kullback-Leibler distance between pD and pG. Also note that the likelihood of an infinite set of derivations would always be zero and therefore cannot be considered here.</Paragraph> <Paragraph position="22"> The solution to the above minimization problem provides the estimator</Paragraph> <Paragraph position="24"> A proof of this result appears in (Corazza and Satta, 2006), and is briefly summarized in Appendix A, in order to make this paper self-contained. We call the above estimator the cross-entropy minimization method.</Paragraph> <Paragraph position="25"> The cross-entropy minimization method can be viewed as a generalization of the supervised MLE method in (7), as shown in what follows. Let D and D bedefinedasforthesupervisedMLEmethod. We define a distribution over D as</Paragraph> <Paragraph position="27"> Distribution pD is usually called the empirical distributionassociated withD. Applying the estimator in (12) to pD, we obtain</Paragraph> <Paragraph position="29"> This is the supervised MLE estimator in (7). This reminds us of the well-known fact that maximizing the likelihood of a (finite) sample through a PCFG distribution amounts to minimizing the cross-entropy between the empirical distribution of the sample and the PCFG distribution itself.</Paragraph> </Section> <Section position="5" start_page="345" end_page="346" type="metho"> <SectionTitle> 4 Renormalization </SectionTitle> <Paragraph position="0"> In this section we recall a renormalization technique for PCFGs that was used before in (Abney et al., 1999), (Chi, 1999) and (Nederhof and Satta, 2003) for different purposes, and is exploited in the next section to prove our main results. In the remainder of this section, we assume a fixed, not necessarily proper PCFGG = (G,pG), with G = (N,S,S,R).</Paragraph> <Paragraph position="1"> We define the renormalization of G as the PCFG</Paragraph> <Paragraph position="3"> It is not difficult to see that R(G) is a proper PCFG.</Paragraph> <Paragraph position="4"> We now show an important property of R(G), discussed before in (Nederhof and Satta, 2003) in the context of so-called weighted context-free grammars. null Lemma 1 For each derivation d with A d= w, A [?] N and w [?] S[?], we have</Paragraph> <Paragraph position="6"> Proof. The proof is by induction on the length ofd, written |d|. If |d |= 1 we must have d = (A - w), and thus pR(d) = pR(A - w). In this case, the statement of the lemma directly follows from (15).</Paragraph> <Paragraph position="7"> Assume now |d |> 1 and let pi = (A - a) be the first rule used in d. Note that there must be at least one nonterminal symbol in a. We can then write a as u0A1u1A2***uq[?]1Aquq, for q [?] 1, Ai [?] N, 1 [?] i [?] q, and uj [?] S[?], 0 [?] j [?] q. In words, A1,...,Aq are all of the occurrences of nonterminals in a, as they appear from left to right. Consequently, we can write d in the form d = pi * d1***dq for some derivations di, w = u0w1u1w2***uq[?]1wquq. Below we use the fact that pR(uj e= uj) = pG(uj e= uj) = 1 for each j with 0 [?] j [?] q, and further using the definition of pR and the inductive hypothesis, we can</Paragraph> <Paragraph position="9"> As an easy corollary of Lemma 1, we have that R(G) is a consistent PCFG, as we can write</Paragraph> <Paragraph position="11"/> </Section> class="xml-element"></Paper>