File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2025_metho.xml
Size: 12,062 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2025"> <Title>Theoretical Evaluation of Estimation Methods for Data-Oriented Parsing</Title> <Section position="3" start_page="0" end_page="185" type="metho"> <SectionTitle> 2 Estimation Methods </SectionTitle> <Paragraph position="0"> The DOP model and STSG formalism are described in detail elsewhere, for instance in (Bod, 1998). The main difference with PCFGs is that multiple derivations, using elementary trees with a variety of sizes, can yield the same parse tree.</Paragraph> <Paragraph position="1"> The probability of a parse p is therefore given by: P(p) = summationtextd:^d=p P(d), where ^d is the tree derived byderivation d,P(d) = producttextt[?]d w(t) andw(t) gives the weights of elementary trees t, which are combined in the derivation d (here treated as a multiset). null</Paragraph> <Section position="1" start_page="0" end_page="184" type="sub_section"> <SectionTitle> 2.1 DOP1 </SectionTitle> <Paragraph position="0"> In Bod's original DOP implementation (Bod, 1993; Bod, 1998), henceforth DOP1, the weights of an elementary tree t is defined as its relative frequency (relative to other subtrees with the same root label) in the tree bank. That is, the weight</Paragraph> <Paragraph position="2"> where fi = f(ti) gives the frequency of subtree ti in a corpus, and r(ti) is the root label of ti.</Paragraph> <Paragraph position="3"> In his critique of this method, (Johnson, 2002) considers a situation where there is an STSG G (the target grammar) with a specific set of sub-trees (t1 ...tN) and specific values of the weights (w1 ...wN) . He evaluates an estimation procedure which produces a grammar Gprime (the estimated grammar), by looking at the difference between the weights of G and the expected weights of Gprime.</Paragraph> <Paragraph position="4"> Johnson's test for consistency is thus based on comparing the weight-distributions between target grammar and estimated grammar2. I will therefore refer to this test as the &quot;weight-distribution test&quot;. (Johnson, 2002) looks at an example grammar G [?] STSG with the subtrees as in figure 1. Johnson considers the case where the weights of all trees of the target grammar G are 0, except for w7, which is necessarily 1, and w4 and w6 which are w4 = p and w6 = 1 [?] p. He finds that the expected values of the weights w4 and w6 of the estimated grammar Gprime are:</Paragraph> <Paragraph position="6"> which are not equal to their target values for all values of p where 0 < p < 1. This analysis thus shows that DOP1is unable to recover the true weights of the given STSG, and hence the inconsistency of the estimator with respect to the class of STSGs.</Paragraph> <Paragraph position="7"> Although usually cited as showing the inadequacy of DOP1, Johnson's example is in fact 2More precisely, it is based on evaluating the estimator's behavior for any weight-distribution possible in the STSG model. (Prescher et al., 2003) give a more formal treatment of bias and consistency in the context of DOP.</Paragraph> <Paragraph position="8"> not suitable to distinguish DOP1 from alternative methods, because no possible estimation procedure can recover the true weights in the case considered. In the example there are only two complete trees that can be observed in the training data, corresponding to the trees t1 and t5. It is easy to see that when generating examples with the grammar in figure 1, the relative frequencies3 f1 ...f4 of the subtrees t1 ...t4 must all be the same, and equal to the frequency of the complete tree t1 which can be composed in the following ways from the subtrees in the original grammar:</Paragraph> <Paragraph position="10"> It follows that the expected frequencies of each of these subtrees are:</Paragraph> <Paragraph position="12"> Similarly, the other frequencies are given by:</Paragraph> <Paragraph position="14"> From these equations it is immediately clear that, regardless of the amount of training data, the problem is simply underdetermined. The values of 6 weights w1 ...w6 (w7 = 1) given only 2 frequencies f1 and f5 (and the constraint thatsummationtext</Paragraph> <Paragraph position="16"> possible estimation method will be able to reliably recover the true weights.</Paragraph> <Paragraph position="17"> The relevant test is whether for all possible STSGs and in the limit of infinite data, the expected relative frequencies of trees given the estimated grammar, equal the observed relative frequencies. I will refer to this test as the &quot;frequencydistribution test&quot;. As it turns out, the DOP1 method also fails this more lenient test. The easiest wayto show this, using again figure 1, isas follows. The weights wprime1 ...wprime7 of grammar Gprime will by definition - be set to the relative frequencies of the corresponding subtrees:</Paragraph> <Paragraph position="19"> to the size of the corpus.</Paragraph> <Paragraph position="20"> Thegrammar Gprime will thus produce the complete trees t1 and t5 with expected frequencies:</Paragraph> <Paragraph position="22"> Now consider the two possible complete trees t1 and t5, and the fraction of their frequencies f1/f5. In the estimated grammar Gprime this fraction becomes:</Paragraph> <Paragraph position="24"> That is, in the limit of infinite data, the estimation procedure not only -understandably- fails to find the target grammar amongst the many grammars that could have produced the observed frequencies, it in fact chooses a grammar that could never have produced these observed frequencies at all. This example shows the DOP1 method is biased and inconsistent for the STSG class in the frequency-distribution test4.</Paragraph> </Section> <Section position="2" start_page="184" end_page="185" type="sub_section"> <SectionTitle> 2.2 Correction-factor approaches </SectionTitle> <Paragraph position="0"> Based on similar observation, (Bonnema et al., 1999; Bod, 2003) propose alternative estimation methods, which involve a correction factor to move probability mass from larger subtrees to smaller ones. For instance, Bonnema et al. replace</Paragraph> <Paragraph position="2"> where N(ti) gives the number of internal nodes in ti (such that 2[?]N(ti) is inversely proportional to the number of possible derivations of ti). Similarly, (Bod, 2003) changes the way frequencies fi are counted, with a similar effect. This approach solves the specific problem shown in equation (11). However, the following example shows that the correction-factor approaches cannot solve the more general problem.</Paragraph> <Paragraph position="3"> 4Note that there are settings of the weights w1 . ..w7 that generate a frequency-distribution that could also have been generated with a PCFG. The example given applies to such distribution as well, and therefore also shows the inconsistency of the DOP1 method for PCFG distributions. factor approaches Consider the STSG in figure 2. The expected frequencies f1 ...f4 are here given by:</Paragraph> <Paragraph position="5"> Frequencies f5 ...f11 are again simple combinations of the frequencies f1 ...f4. Observations of these frequencies therefore do not add any extra information, and the problem of finding the weights of the target grammar is in general again underdetermined. But consider the situation where f3 = f4 = 0 and f1 > 0 and f2 > 0.</Paragraph> <Paragraph position="6"> This constrains the possible solutions enormously.</Paragraph> <Paragraph position="7"> If we solve the following equations for w3 ...w11 with the constraint that probabilities with the same root label add up to 1: (i.e. summationtext9i=1(wi) = 1,</Paragraph> <Paragraph position="9"> we find, in addition to the obvious w3 = w4 = 0, the following solutions: w10 = w6 = w7 = w9 =</Paragraph> <Paragraph position="11"> serve no occurrences of trees t3 and t4 in the training sample, we know that at least one subtree in each derivation of these strings must have weight zero. However, any estimation method that uses the (relative) frequencies of subtrees and a (nonzero) correction factor that is based on the size of the subtrees, will give non-zero probabilities to all weights w5 ...w11 if f1 > 0 and f2 > 0, as we assumed. In other words, these weight estimation methods for STSGs are also biased and inconsistent in the frequency-distribution test.</Paragraph> </Section> <Section position="3" start_page="185" end_page="185" type="sub_section"> <SectionTitle> 2.3 Shortest derivation estimators </SectionTitle> <Paragraph position="0"> Because the STSG formalism allows elementary trees of arbitrary size, every parse tree in a tree bank could in principle be incorporated in an STSG grammar. That is, we can define a trivial estimator with the following weights:</Paragraph> <Paragraph position="2"> Such an estimator is not particularly interesting, because it does not generalize beyond the training data. It is a point to note, however, that this estimator is unbiased and consistent in the frequency-distribution test. (Prescher et al., 2003) prove that any unbiased estimator that uses the &quot;all subtrees&quot; representation has the same property, and conclude that lack of bias is not a desired property.</Paragraph> <Paragraph position="3"> (Zollmann and Sima'an, 2005) propose an estimator based on held-out estimation. The training corpus is split into an estimation corpus EC and a held out corpus HC. The HC corpus is parsed by searching for the shortest derivation of each sentence, using only fragments from EC. The elementary trees of the estimated STSG are assigned weights according to their usage frequencies u1,...,uN in these shortest derivations:</Paragraph> <Paragraph position="5"> This approach solves the problem with bias described above, while still allowing for consistency, as Zollmann & Sima'an prove. However, their proof only concerns consistency in the frequency-distribution test. As the corpus EC grows to be infinitely large, every parse tree in HC will also be found in EC, and the shortest derivation will therefore in the limit only involve a single elementary tree: the parse tree itself. Target STSGs with non-zero weights on smaller elementary trees will thus not be identified correctly, even with an infinitely large training set. In other words, the Zollmann & Sima'an method, and other methods that converge to the &quot;complete parse tree&quot; solution such as LS-DOP (Bod, 2003) and BackOff-DOP (Sima'an and Buratto, 2003), are inconsistent in the weight-distribution test.</Paragraph> </Section> </Section> <Section position="4" start_page="185" end_page="185" type="metho"> <SectionTitle> 3 Discussion & Conclusions </SectionTitle> <Paragraph position="0"> A desideratum for parameter estimation methods isthat they converge tothe correct parameters with infinitely many data - that is, we like an estimator to be consistent. The STSG formalism, however, allows for many different derivations of the same parse tree, and for many different grammars to generate the same frequency-distribution. Consistency in the weight-distribution test is therefore too stringent a criterion. We have shown that DOP1 and methods based on correction factors also fail the weaker frequency-distribution test.</Paragraph> <Paragraph position="1"> However, the only current estimation methods that are consistent in the frequency-distribution test, have the linguistically undesirable property of converging to a distribution with all probability mass in complete parse trees. Although these method fail the weight-distribution test for the whole class of STSGs, we argued earlier that this test is not the appropriate test either. Both estimation methods for STSGs and the criteria for evaluating them, thus require thorough rethinking. In forthcoming work we therefore study yet another estimator, and the linguistically motivated evaluation criterion of convergence to a maximally general STSG consistent with the training data5.</Paragraph> </Section> class="xml-element"></Paper>