File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/j98-2002_metho.xml
Size: 22,762 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-2002"> <Title>Generalizing Case Frames Using a Thesaurus and the MDL Principle</Title> <Section position="5" start_page="222" end_page="222" type="metho"> <SectionTitle> \[BIRD, INSECT\] </SectionTitle> <Paragraph position="0"> \[BIRD, bug, bee, insect\] \[swallow, crow, eagle, bird, INSECT\] \[swallow, crow, eagle, bird, bug, bee, insect\] 1983, 1984, 1986, 1989), which has various desirable properties, as will be described later. 3 MDL is a principle of data compression and statistical estimation from information theory, which states that the best probability model for given data is that which requires the least code length in bits for the encoding of the model itself and the given data observed through it. 4 The former is the model description length and the latter the data description length.</Paragraph> <Paragraph position="1"> In our current problem, it tends to be the case, in general, that a model nearer the root of the thesaurus tree, such as that in Figure 6, is simpler (in terms of the number of parameters), but tends to have a poorer fit to the data. In contrast, a model nearer the leaves of the thesaurus tree, such as that in Figure 4, is more complex, but tends to have a better fit to the data. Table 2 shows the number of free parameters and the KL distance from the empirical distribution of the data (namely, the word-based distribution estimated by MLE) shown in Figure 2 for each of the five tree cut models. 5 In the table, one can see that there is a trade-off between the simplicity of a model and the goodness of fit to the data.</Paragraph> <Paragraph position="2"> In the MDL framework, the model description length is an indicator of model 3 Estimation strategies related to MDL have been independently proposed and studied by various authors (Solomonoff 1964; Wallace and Boulton 1968; Schwarz 1978; Wallace and Freeman 1992).</Paragraph> </Section> <Section position="6" start_page="222" end_page="229" type="metho"> <SectionTitle> 4 We refer the interested reader to Quinlan and Rivest (1989) for an introduction to the MDL principle. 5 The KL distance (alsO known as KL-divergence or relative entropy), which is widely used in </SectionTitle> <Paragraph position="0"> information theory and statistics, is a measure of distance between two distributions (e.g., Cover and Thomas 1991). It is always normegative and is zero if and only if the two distributions are identical, but is asymmetric and hence not a metric (the usual notion of distance).</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 24, Number 2 complexity, while the data description length indicates goodness of fit to the data. The MDL principle stipulates that the model that minimizes the sum total of the description lengths should be the best model (both for data compression and statistical estimation). In the remainder of this section, we will describe how we apply MDL to our current problem. We will then discuss the rationale behind using MDL in our present context.</Paragraph> <Section position="1" start_page="223" end_page="224" type="sub_section"> <SectionTitle> 3.1 Calculating Description Length </SectionTitle> <Paragraph position="0"> We first show how the description length for a model is calculated. We use S to denote a sample (or set of data), which is a multiset of examples, each of which is an occurrence of a noun at a given slot r of a given verb v (i.e., duplication is allowed).</Paragraph> <Paragraph position="1"> We let ISI denote the size of S as a multiset, and n E S indicate the inclusion of n in S as a multiset. For example, the column labeled slot_value in Table 1 represents a sample S for the subject slot offly, and in this case ISI = 10.</Paragraph> <Paragraph position="2"> Given a sample S and a tree cut F, we employ MLE to estimate the parameters of the corresponding tree cut model ~,I = (F, 0), where 6 denotes the estimated parameters.</Paragraph> <Paragraph position="3"> The total description length L(/~,I, S) of the tree cut model/vl and the sample S observed through M is computed as the sum of the model description length L(P), parameter description length L(0 I P), and data description length L(S I F, 6):</Paragraph> <Paragraph position="5"> Note that we sometimes refer to L(F) + L(0 I F) as the model description length.</Paragraph> <Paragraph position="6"> The model description length L(F) is a subjective quantity, which depends on the coding scheme employed. Here, we choose to assign the same code length to each cut and let:</Paragraph> <Paragraph position="8"> where ~ denotes the set of all cuts in the thesaurus tree T. 6 This corresponds to assuming that each tree cut model is equally likely a priori, in the Bayesian interpretation of MDL. (See Section 3.4.) The parameter description length L(O I F) is calculated by: k L(0 I r) = ~ x log IsI (9) where ISI denotes the sample size and k denotes the number of free parameters in the tree cut model, i.e., k equals the number of nodes in P minus one. It is known to be best to use this number of bits to describe probability parameters in order to minimize the expected total description length (Rissanen 1984, 1986). An intuitive explanation of this is that the standard deviation of the maximum-likelihood estimator of each parameter is of the order ~, and hence describing each parameter using more than 1 1 log ISI bits would be wasteful for the estimation accuracy possible with - log x/~ - 2 the given sample size.</Paragraph> <Paragraph position="9"> Finally, the data description length L(S I F, 0) is calculated by:</Paragraph> <Paragraph position="11"> 6 Here and throughout, log denotes the logarithm to the base 2. For reasons why Equation 8 holds, see, for example, Quinlan and Rivest (1989).</Paragraph> <Paragraph position="12"> Li and Abe Generalizing Case Frames Table 3 Calculating the description length for the model of Figure 5.</Paragraph> <Paragraph position="13"> C BIRD bug bee insect</Paragraph> <Paragraph position="15"> where for simplicity we write P(n) for PM(n \[ v, r). Recall that P(n) is obtained by MLE, namely, by normalizing the frequencies:</Paragraph> <Paragraph position="17"> wheref(C) denotes the total frequency of nouns in class C in the sample S, and F is a tree cut. We note that, in fact, the maximum-likelihood estimate is one that minimizes the data description length L(S I F, 0).</Paragraph> <Paragraph position="18"> With description length defined in the above manner, we wish to select a model with the minimum description length and output it as the result of generalization.</Paragraph> <Paragraph position="19"> Since we assume here that every tree cut has an equal L(P), technically we need only calculate and compare L'(/\[d, S) = L(~ I F) + L(S t F, ~) as the description length. For simplicity, we will sometimes write just L'(F) for L'(7\[/I, S), where I ~ is the tree cut of M, when ~,I and S are clear from context.</Paragraph> <Paragraph position="20"> The description lengths for the data in Figure 1 using various tree cut models of the thesaurus tree in Figure 3 are shown in Table 4. (Table 3 shows how the description length is calculated for the model of tree cut \]BIRD, bug, bee, insect\].) These figures indicate that the model in Figure 6 is the best model, according to MDL. Thus, given the data in Table 1 as input, the generalization result shown in Table 5 is obtained. null</Paragraph> </Section> <Section position="2" start_page="224" end_page="227" type="sub_section"> <SectionTitle> 3.2 An Efficient Algorithm </SectionTitle> <Paragraph position="0"> In generalizing values of a case flame slot using MDL, we could, in principle, calculate the description length of every possible tree cut model and output a model with the minimum description length as the generalization result, if computation time were of no concern. But since the number of cuts in a thesaurus tree is exponential in the size of the tree (for example, it is easy to verify that for a complete b-ary tree of depth d it is of the order o(2ba-1)), it is impractical to do so. Nonetheless, we were able to devise a \[BIRD, bug, bee, insect\] 4.98 23.22 28.20 \[swallow, crow, eagle, bird, INSECT\] 6.64 22.39 29.03 \[swallow, crow, eagle, bird, bug, bee, insect\] 9.97 19.22 29.19 Table 5 Generalization result.</Paragraph> <Paragraph position="1"> verb slot~name slot_value probability fly argl BIRD 0.8 fly argl INSECT 0.2 Here we let t denote a thesaurus (sub)tree, root(t) the root of the tree t. Initially t is set to the entire tree.</Paragraph> <Paragraph position="2"> Also input to the algorithm is a co-occurrence data.</Paragraph> <Paragraph position="3"> algorithm Find-MDL(t) := cut 1. if 2. t is a leaf node 3. then 4. retum(\[t\]) 5. else 6. For each child tree ti of t ci :=Find-MDL(ti) 7. c:= append(ci) 8. if 9. L'(\[root(t)\]) < L'(c) 10. then 11. return(\[root(t)\]) 12. else 13. return(c) The algorithm: Find-MDL.</Paragraph> <Paragraph position="4"> simple and efficient algorithm based on dynamic programming, which is guaranteed to find a model with the minimum description length. Our algorithm, which we call Find-MDL, recursively finds the optimal MDL model for each child subtree of a given tree and appends all the optimal models of these sub-trees and returns the appended models, unless collapsing all the lowerqevel optimal models into a model consisting of a single node (the root node of the given tree) reduces the total description length, in which case it does so. The details of the algorithm are given in Figure 7. Note that for simplicity we describe Find-MDL as outputting a tree cut, rather than a complete tree cut model. Note in the above algorithm that the parameter description length is calculated as An example application of Find-MDL.</Paragraph> <Paragraph position="5"> log ISI, where k + 1 is the number of nodes in the current cut, both when t is the 2 entire tree and when it is a proper subtree. This contrasts with the fact that the number of free parameters is k for the former, while it is k + 1 for the latter. For the purpose of finding a tree cut with the minimum description length, however, this distinction can be ignored (see Appendix A).</Paragraph> <Paragraph position="6"> Figure 8 illustrates how the algorithm works (on the co-occurrence data shown at the bottom): In the recursive application of Find-MDL on the subtree rooted at AIRPLANE, the if-clause on line 9 evaluates to true since L'(\[AIRPLANE\]) = 32.27, L'(~et, helicopter, airplane\]) = 32.72, and hence \[AIRPLANE\] is returned. Then in the call to Find-MDL on the subtree rooted at ARTIFACT, the same if-clause evaluates to false since L'(\[VEHICLE, AIRPLANE\]) = 40.97, L'(\[ARTIFACT\]) = 41.09, and hence \[VEHICLE, AIRPLANE\] is returned.</Paragraph> <Paragraph position="7"> Concerning the above algorithm, we show that the following proposition holds: Proposition 1 The algorithm Find-MDL terminates in time O(N x ISI), where N denotes the number of leaf nodes in the input thesaurus tree T and ISI denotes the input sample size, and outputs a tree cut model of T with the minimum description length (with respect to the encoding scheme described in Section 3.1).</Paragraph> <Paragraph position="8"> Here we will give an intuitive explanation of why the proposition holds, and give the formal proof in Appendix A. The MLE of each node (class) is obtained simply by dividing the frequency of nouns within that class by the total sample size. Thus, the parameter estimation for each subtree can be done independently from the estimation of the parameters outside the subtree. The data description length for a subtree thus depends solely on the tree cut within that subtree, and its calculation can be performed independently for each subtree. As for the parameter description length for a subtree, it depends only on the number of classes in the tree cut within that subtree, and hence can be computed independently as well. The formal proof proceeds by mathematical induction, which verifies that the optimal model in any (sub)tree is either the model Computational Linguistics Volume 24, Number 2 consisting of the root of the tree or the model obtained by appending the optimal submodels for its child subtrees. 7</Paragraph> </Section> <Section position="3" start_page="227" end_page="228" type="sub_section"> <SectionTitle> 3.3 Estimation, Generalization, and MDL </SectionTitle> <Paragraph position="0"> When a discrete model (a partition F of the set of nouns W&quot; in our present context) is fixed, and the estimation problem involves only the estimation of probability parameters, the classic maximum-likelihood estimation (MLE) is known to be satisfactory. In particular, the estimation of a word-based model is one such problem, since the partition is fixed and the size of the partition equals \[.M\[. Furthermore, for a fixed discrete model, it is known that MLE coincides with MDL: Given data S = {xi : i = 1 ..... m}, MLE estimates parameter P, which maximizes the likelihood with respect to the data; that is: m</Paragraph> <Paragraph position="2"> It is easy to see that P also satisfies:</Paragraph> <Paragraph position="4"> This is nothing but the MDL estimate in this case, since ~i~1 -log P(xi) is the data description length.</Paragraph> <Paragraph position="5"> When the estimation problem involves model selection, i.e., the choice of a tree cut in the present context, MDUs behavior significantly deviates from that of MLE. This is because MDL insists on minimizing the sum total of the data description length and the model description length, while MLE is still equivalent to minimizing the data description length only. So, for our problem of estimating a tree cut model, MDL tends to select a model that is reasonably simple yet fits the data quite well, whereas the model selected by MLE will be a word-based model (or a tree cut model equivalent to the word-based modelS), as it will always manage to fit the data.</Paragraph> <Paragraph position="6"> In statistical terms, the superiority of MDL as an estimation method is related to the fact we noted earlier that even though MLE can provide the best fit to the given data, the estimation accuracy of the parameters is poor, when applied on a sample of modest size, as there are too many parameters to estimate. MLE is likely to estimate most parameters to be zero, and thus suffers from the data sparseness problem. Note in Table 4, that MDL avoids this problem by taking into account the model complexity as well as the fit to the data.</Paragraph> <Paragraph position="7"> MDL stipulates that the model with the minimum description length should be selected both for data compression and estimation. This intimate connection between estimation and data compression can also be thought of as that between estimation and generalization, since in order to compress information, generalization is necessary. In our current problem, this corresponds to the generalization of individual nouns present in case frame instances in the data as classes of nouns present in a given thesaurus. For example, given the thesaurus in Figure 3 and frequency data in Figure 1, we would 7 The process of finding the MDL model tends to be computationally demanding and is often intractable. When the model class under consideration is restricted to tree structures, however, dynamic programming is often applicable and the MDL model can be efficiently found. For example, Rissanen (1995) has devised an algorithm for learning decision trees.</Paragraph> <Paragraph position="8"> 8 Consider, for example, the case when the co-occurrence data is given as</Paragraph> <Paragraph position="10"> Li and Abe Generalizing Case Frames like our system to judge that the class BIRD and the noun bee can be the subject slot of the verb fly. The problem of deciding whether to stop generalizing at BIRD and bee, or generalizing further to ANIMAL has been addressed by a number of authors (Webster and Marcus 1989; Velardi, Pazienza, and Fasolo 1991; Nomiyama 1992). Minimization of the total description length provides a disciplined criterion to do this.</Paragraph> <Paragraph position="11"> A remarkable fact about MDL is that theoretical findings have indeed verified that MDL, as an estimation strategy, is near optimal in terms of the rate of convergence of its estimated models to the true model as data size increases. When the true model is included in the class of models considered, the models selected by MDL converge to the true model at the rate of O/~C:~9~_i~!~ where k* is the number of parameters in 2.1Sl J' the true model, and \[S\] the data size, which is near optimal (Barron and Cover 1991; Yamanishi 1992).</Paragraph> <Paragraph position="12"> Thus, in the current problem, MDL provides (a) a way of smoothing probability parameters to solve the data sparseness problem, and at the same time, (b) a way of generalizing nouns in the data to noun classes of an appropriate level, both as a corollary to the near optimal estimation of the distribution of the given data.</Paragraph> </Section> <Section position="4" start_page="228" end_page="229" type="sub_section"> <SectionTitle> 3.4 The Bayesian Interpretation of MDL and the Choice of Encoding Scheme </SectionTitle> <Paragraph position="0"> There is a Bayesian interpretation of MDL: MDL is essentially equivalent to the &quot;posterior mode&quot; in the Bayesian terminology (Rissanen 1989). Given data S and a number of models, the Bayesian estimator (posterior mode) selects a model M that maximizes the posterior probability: = argn~x(P(M). P(S I M)) (15) where P(M) denotes the prior probability of the model M and P(S \[ M) the probability of observing the data S given M. Equivalently, M satisfies</Paragraph> <Paragraph position="2"> This is equivalent to the MDL estimate, if we take - log P(M) to be the model description length. Interpreting - log P(M) as the model description length translates, in the Bayesian estimation, to assigning larger prior probabilities on simpler models, since it is equivalent to assuming that P(M) = (1/2)t(a), where I(M) is the description length of M. (Note that if we assign uniform prior probability P(M) to all models M, then (15) becomes equivalent to (13), giving the maximum-likelihood estimate.) Recall, that in our definition of parameter description length, we assign a shorter parameter description length to a model with a smaller number of parameters k, which admits the above interpretation. As for the model description length (for tree cuts) we assigned an equal code length to each tree cut, which translates to placing no bias on any cut. We could have employed a different coding scheme assigning shorter code lengths to cuts nearer the root. We chose not to do so partly because, for sufficiently large sample sizes, the parameter description length starts dominating the model description length anyway.</Paragraph> <Paragraph position="3"> Another important property of the definition of description length is that it affects not only the effective prior probabilities on the models, but also the procedure for computing the model minimizing the measure. Indeed, our definition of model description length was chosen to be compatible with the dynamic programming technique, namely, its calculation is performable locally for each subtree. For a different choice of coding scheme, it is possible that a simple and efficient MDL algorithm like Computational Linguistics Volume 24, Number 2 Find-MDL may not exist. We believe that our choice of model description length is derived from a natural encoding scheme with reasonable interpretation as Bayesian prior, and at the same time allows an efficient algorithm for finding a model with the minimum description length.</Paragraph> </Section> </Section> <Section position="7" start_page="229" end_page="229" type="metho"> <SectionTitle> 3.5 The Uniform Distribution Assumption and the Level of Generalization </SectionTitle> <Paragraph position="0"> The uniform distribution assumption made in (4), namely that all nouns belonging to a class contained in the tree cut model are assigned the same probability, seems to be rather stringent. If one were to insist that the model be exactly accurate, then it would seem that the true model would be the word-based model resulting from no generalization at all. If we allow approximations, however, it is likely that some reasonable tree cut model with the uniform probability assumption will be a good approximation of the true distribution; in fact, a best model for a given data size. As we remarked earlier, as MDL balances between the fit to the data and the simplicity of the model, one can expect that the model selected by MDL will be a reasonable compromise.</Paragraph> <Paragraph position="1"> Nonetheless, it is still a shortcoming of our model that it contains an oversimplified assumption, and the problem is especially pressing when rare words are involved. Rare words may not be observed at a slot of interest in the data simply because they are rare, and not because they are unfit for that particular slot. 9 To see how rare is too rare for our method, consider the following example.</Paragraph> <Paragraph position="2"> Suppose that the class BIRD contains 10 words, bird, swallow, crow, eagle, parrot, waxwing, etc. Consider co-occurrence data having 8 occurrences of bird, 2 occurrences of swallow, 1 occurrence of crow, 1 occurrence of eagle, and 0 occurrence of all other words, as part of, say, 100 data obtained for the subject slot of verb fly. For this data set, our method would select the model that generalizes bird, swallow, etc. to the class BIRD, since the sum of the data and parameter description lengths for the BIRD subtree is 76.57 + 3.32 = 79.89 if generalized, and 53.73 + 33.22 = 86.95 if not generalized. For comparison, consider the data with 10 occurrences of bird, 3 occurrences of swallow and 1 occurrence of crow, and 0 occurrence of all other words, also as part of 100 data for the subject slot of fly. In this case, our method would select the model that stops generalizing at bird, swallow, eagle, etc., because the description length for the same subtree now is 86.22 + 3.32 = 89.54 if generalized, and 55.04 + 33.22 = 88.26 if not generalized. These examples seem to indicate that our MDL-based method would choose to generalize, even when there are relatively large differences in frequencies of words within a class, but knows enough to stop generalizing when the discrepancy in frequencies is especially noticeable (relative to the given sample size).</Paragraph> </Section> class="xml-element"></Paper>