File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1043_metho.xml

Size: 13,381 bytes

Last Modified: 2025-10-06 14:10:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1043">
  <Title>Cross-Entropy and Estimation of Probabilistic Context-Free Grammars</Title>
  <Section position="4" start_page="335" end_page="337" type="metho">
    <SectionTitle>
3 Estimation based on cross-entropy
</SectionTitle>
    <Paragraph position="0"> Let T be an infinite set of (finite) trees with internal nodes labeled by symbols in N, root nodes labeled by S [?] N and leaf nodes labeled by symbols  in S. We assume that the set of rules that are observed in the trees inT is drawn from some finite set R. Let pT be a probability distribution defined over T, that is, a function from T to set [0,1] such thatsummationtext t[?]T pT(t) = 1.</Paragraph>
    <Paragraph position="1"> The skeleton CFG underlying T is defined as G = (N,S,R,S). Note that we have T [?] T(G) and,inthegeneralcase,theremightbetreesinT(G) that do not appear in T. We wish anyway to approximate distribution pT the best we can, by turning G into some proper PCFG G = (G,pG) and setting parameters pG(A - a) appropriately, for each</Paragraph>
    <Paragraph position="3"> One possible criterion is to choose pG in such a way that the cross-entropy between pT and pG is minimized, where we now view pG as a probability distribution defined over T(G). The cross-entropy between pT and pG is defined as the expectation under distributionpT of the information, computed under distribution pG, of the trees in T(G)</Paragraph>
    <Paragraph position="5"> Since G should be proper, the minimization of (5) is subject to the constraintssummationtexta pG(A - a) = 1, for each A [?] N.</Paragraph>
    <Paragraph position="6"> To solve the minimization problem above, we use Lagrange multipliers lA for each A [?] N and define the form</Paragraph>
    <Paragraph position="8"> We now view [?] as a function of all the lA and the pG(A - a), and consider all the partial derivatives of [?]. For each A [?] N we have</Paragraph>
    <Paragraph position="10"> We now need to solve a system of|N|+|R|equations obtained by setting to zero all of the above partial derivatives. From each equation [?][?][?]pG(A-a) = 0</Paragraph>
    <Paragraph position="12"> We sum over all strings a such that (A - a) [?] R</Paragraph>
    <Paragraph position="14"> From each equation [?][?][?]lA = 0 we obtainsummationtext a pG(A - a) = 1 for each A [?] N (our original constraints). Combining with (8) we obtain</Paragraph>
    <Paragraph position="16"> The equations in (10) define the desired estimator for our PCFG, assigning to each rule A - a a probability specified as the ratio between the expected number of A - a and the expected number of A, under the distribution pT. We remark here that the minimization of the cross-entropy above is equivalent to the minimization of the Kullback-Leibler distance between pT and pG, viewed as tree distributions. Also, note that the likelihood of an infinite set of derivations would always be zero and therefore cannot be considered here.</Paragraph>
    <Paragraph position="17"> To be used in the next section, we now show that the PCFG G obtained as above is consistent. The line of our argument below follows a proof provided in (Chi and Geman, 1998) for the maximum likelihood estimator based on finite tree distributions. Without loss of generality, we assume that in G the start symbol S is never used in the right-hand side of a rule.</Paragraph>
    <Paragraph position="18"> For each A [?] N, let qA be the probability that a derivation in G rooted in A fails to terminate. We can then write</Paragraph>
    <Paragraph position="20"> The inequality follows from the fact that the events considered in the right-hand side of (11) are not mutually exclusive. Combining (10) and (11) we obtain</Paragraph>
    <Paragraph position="22"> where fc(B,t) indicates the number of times a node labeled by nonterminal B appears in the derivation tree t as a child of some other node.</Paragraph>
    <Paragraph position="23"> From our assumptions on the start symbol S, we have that S only appears at the root of the trees in T(G). Then it is easy to see that, for every A negationslash= S, we have EpTfc(A,t) = EpTf(A,t), while</Paragraph>
    <Paragraph position="25"> from which we conclude qS = 0, thus implying the consistency of G.</Paragraph>
  </Section>
  <Section position="5" start_page="337" end_page="339" type="metho">
    <SectionTitle>
4 Cross-entropy and derivational entropy
</SectionTitle>
    <Paragraph position="0"> In this section we present the main result of the paper. We show that, when G = (G,pG) is estimated by minimizing the cross-entropy in (5), then such cross-entropy takes the same value as the derivational entropy of G, defined in (3).</Paragraph>
    <Paragraph position="1"> In (Nederhof and Satta, 2004) relations are derived for the exact computation ofHd(pG). For later use, we report these relations below, under the assumption that G is consistent (see Section 3). We</Paragraph>
    <Paragraph position="3"> in(4). ForeachA [?] N, quantityoutG(A)isthesum of the probabilities of all trees generated by G, having root labeled by S and having a yield composed of terminal symbols with an unexpanded occurrence of nonterminal A. Again, we assume that symbol S does not appear in any of the right-hand sides of the rules in R. This means that S only appears at the root of the trees in T(G). Under this condition, quantities outG(A) can be exactly computed by solving the following system of linear equations</Paragraph>
    <Paragraph position="5"> We can now prove the equality</Paragraph>
    <Paragraph position="7"> where G is the PCFG estimated by minimizing the cross-entropy in (5), as described in Section 3.</Paragraph>
    <Paragraph position="8"> We start from the definition of cross-entropy</Paragraph>
    <Paragraph position="10"> Comparing (19) with (13) we see that, in order to prove the equality in (16), we need to show relations</Paragraph>
    <Paragraph position="12"> for every A [?] N. We have already observed in Section 3 that, under our assumption on the start symbol S, we have</Paragraph>
    <Paragraph position="14"> We now observe that, for any A [?] N with A negationslash= S and any t [?] T(G), we have</Paragraph>
    <Paragraph position="16"> Once more we use relation (18), which replaced in (23) provides</Paragraph>
    <Paragraph position="18"> Notice that the linear system in (14) and (15) and the linear system in (21) and (24) are the same. Thus we conclude that quantities EpT f(A,t) and outG(A) are the same for each A [?] N. This completes our proof of the equality in (16). Some examples will be discussed in Section 6.</Paragraph>
    <Paragraph position="19"> Besides its theoretical significance, the equality in (16) can also be exploited in the computation of the cross-entropy in practical applications. In fact, cross-entropy is used as a measure of tightness in comparing different models. In case of estimation from an infinite distribution pT, the definition of the cross-entropy H(pT ||pG) contains an infinite summation, which is problematic for the computation of such quantity. In standard practice, this problem is overcome by generating a finite sampleT(n) of large sizen, throughthedistributionpT, andthencomputing the approximation (Manning and Sch&amp;quot;utze, 1999)  where f(t,T(n)) indicates the multiplicity, that is, the number of occurrences, oftinT(n). However, in practical applications n must be very large in order to have a small error. Based on the results in this section, we can instead compute the exact value of H(pT ||pG) by computing the derivational entropy Hd(pG), using relation (13) and solving the linear system in (14) and (15), which takes cubic time in the number of nonterminals of the grammar.</Paragraph>
  </Section>
  <Section position="6" start_page="339" end_page="339" type="metho">
    <SectionTitle>
5 Estimation based on likelihood
</SectionTitle>
    <Paragraph position="0"> In natural language processing applications, the estimation of a PCFG is usually carried out on the basis of a finite sample of trees, called tree bank. The so-called maximum likelihood estimation (MLE) method is exploited, which maximizes the likelihood of the observed data. In this section we show that the MLE method is a special case of the estimation method presented in Section 3, and that the results of Section 4 also hold for the MLE method.</Paragraph>
    <Paragraph position="1"> Let T be a tree sample, and let T be the underlying set of trees. For t [?] T, we let f(t,T ) be the multiplicity of t in T . We define</Paragraph>
    <Paragraph position="3"> and let f(A,T ) = summationtexta f(A - a,T ). We can induce from T a probability distribution pT , defined over T, by letting for each t [?] T</Paragraph>
    <Paragraph position="5"> the empirical distribution of T .</Paragraph>
    <Paragraph position="6"> Assume that the trees in T have internal nodes labeled by symbols in N, root nodes labeled by S and leaf nodes labeled by symbols in S. Let also R be the finite set of rules that are observed in T . We define the skeleton CFG underlying T as G = (N,S,R,S). In the MLE method we probabilistically extend the skeleton CFG G by means of a function pG that maximizes the likelihood of T , defined as</Paragraph>
    <Paragraph position="8"> subject to the usual properness conditions on pG.</Paragraph>
    <Paragraph position="9"> Such maximization provides the estimator (see for instance (Chi and Geman, 1998))</Paragraph>
    <Paragraph position="11"> Letusconsidertheestimatorin(10). Ifwereplace distribution pT with our empirical distribution pT , we derive</Paragraph>
    <Paragraph position="13"> This is precisely the estimator in (28).</Paragraph>
    <Paragraph position="14"> From relation (29) we conclude that the MLE method can be seen as a special case of the general estimatorinSection3, withtheinputdistributiondefined over a finite set of trees. We can also derive the well-known fact that, in the finite case, the maximization of the likelihood pG(T ) corresponds to the minimization of the cross-entropy H(pT ||pG).</Paragraph>
    <Paragraph position="15"> Let nowG = (G,pG) be a PCFG trained onT using the MLE method. Again from relation (29) and Section 3 we have that G is consistent. This result has been firstly shown in (Chaudhuri et al., 1983) and later, with a different proof technique, in (Chi and Geman, 1998). We can then transfer the results of Section 4 to the supervised MLE method, showing the equality</Paragraph>
    <Paragraph position="17"> This result was not previously known in the literature on statistical parsing of natural language. Some examples will be discussed in Section 6.</Paragraph>
  </Section>
  <Section position="7" start_page="339" end_page="341" type="metho">
    <SectionTitle>
6 Some examples
</SectionTitle>
    <Paragraph position="0"> In this section we discuss a simple example with the aim of clarifying the theoretical results in the previous sections. For a real number q with 0 &lt; q &lt; 1,  entropies for three different corpora.</Paragraph>
    <Paragraph position="1"> consider the CFG G defined by the two rules S aS and S - a, and letGq = (G,pG,q) be the probabilistic extension of G with pG,q(S - aS) = q and pG,q(S - a) = 1 [?]q. This grammar is unambiguous and consistent, and each tree t generated by G has probability pG,q(t) = qi * (1 [?]q), where i [?] 0 is the number of occurrences of rule S - aS in t.</Paragraph>
    <Paragraph position="2"> We use below the following well-known relations</Paragraph>
    <Paragraph position="4"> The derivational entropy of Gq can be directly computed from its definition as</Paragraph>
    <Paragraph position="6"> See Figure 1 for a plot of Hd(pG,q) as a function of q.</Paragraph>
    <Paragraph position="7"> If a tree bank is given, composed of occurrences of trees generated by G, the value of q can be estimated by applying the MLE or, equivalently, by minimizingthecross-entropy. Weconsiderhereseveral tree banks, to exemplify the behaviour of the cross-entropy depending on the structure of the sample of trees. The first tree bank T contains a single tree t with a single occurrence of rule S - aS and a single occurrence of rule S - a. We then have pT (t) = 1 and pG,q(t) = q * (1 [?] q). The cross-entropy between distributions pT and pG,q is then H(pT ,pG,q) = [?]logq *(1[?]q) = [?]logq [?]log(1[?]q). (34) The cross-entropy H(pT ,pG,q), viewed as a function of q, is a convex-[?] function and is plotted in Figure 1 (line indicated by Kd = 1, see below). We can obtain its minimum by finding a zero for the first</Paragraph>
    <Paragraph position="9"> which gives q = 0.5. Note from Figure 1 that the minimum of H(pT ,pG,q) crosses the line corresponding to the derivational entropy, as should be expected from the result in Section 4.</Paragraph>
    <Paragraph position="10"> More in general, for integers d &gt; 0 and K &gt; 0, consider a tree sample Td,K consisting of d trees ti, 1 [?] i [?] d. Each ti contains ki [?] 0 occurrences of rule S - aS and one occurrence of rule S - a.</Paragraph>
    <Paragraph position="11"> Thus we have pTd,K(ti) = 1d and pG,q(ti) = qki * (1[?]q). We letsummationtextdi=1 ki = K. The cross-entropy is</Paragraph>
    <Paragraph position="13"> In Figure 1 we plot H(pTd,K,pG,q) in the case Kd = 0.5 and in the case Kd = 1.5. Again, we have that these curves intersect with the curve corresponding to the derivational entropy Hd(pG,q) at the points were they take their minimum values.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML