XML Viewer - j02-2003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/j02-2003_metho.xml
Size: 42,202 bytes
Last Modified: 2025-10-06 14:07:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="J02-2003">
  <Title>c(c) 2002 Association for Computational Linguistics Class-Based Probability Estimation Using a Semantic Hierarchy</Title>
  <Section position="3" start_page="188" end_page="189" type="metho">
    <SectionTitle>
2. The Semantic Hierarchy
</SectionTitle>
    <Paragraph position="0"> The noun hierarchy of WordNet consists of senses, or what Miller (1998) calls lexicalized concepts, organized according to the &amp;quot;is-a-kind-of&amp;quot; relation. Note that we are using concept to refer to a lexicalized concept or sense and not to a set of senses; we use class to refer to a set of senses. There are around 66,000 different concepts in the noun hierarchy  Clark and Weir Class-Based Probability Estimation of WordNet version 1.6. A concept in WordNet is represented by a &amp;quot;synset,&amp;quot; which is the set of synonymous words that can be used to denote that concept. For example, the synset for the concept &lt;cocaine&gt;  is {cocaine, cocain, coke, snow, C}. Let syn(c) be the synset for concept c, and let cn(n)={c |n [?] syn(c) } be the set of concepts that can be denoted by noun n.</Paragraph>
    <Paragraph position="1"> The hierarchy has the structure of a directed acyclic graph (although only around 1% of the nodes have more than one parent), where the edges of the graph constitute what we call the &amp;quot;direct-isa&amp;quot; relation. Let isa be the transitive, reflexive closure of direct-isa; then c  is a hyponym of c. In fact, the hierarchy is not a single hierarchy but instead consists of nine separate subhierarchies, each headed by the most general kind of concept, such as &lt;entity&gt; , &lt;abstraction&gt; , &lt;event&gt; , and &lt;psychological feature&gt; . For the purposes of this work we add a common root dominating the nine subhierarchies, which we denote &lt;root&gt; . There are some important points that need to be clarified regarding the hierarchy. First, every concept in the hierarchy has a nonempty synset (except the notional concept &lt;root&gt; ). Even the most general concepts, such as &lt;entity&gt; , can be denoted by some noun; the synset for &lt;entity&gt; is {entity, something}. Second, there is an important distinction between an individual concept and a set of concepts. For example, the individual concept &lt;entity&gt; should not be confused with the set or class consisting of concepts denoting kinds of entities. To make this distinction clear, we use c = {c prime |c prime isa c } to denote the set of concepts dominated by concept c, including c itself. For example, &lt;animal&gt; is the set consisting of those concepts corresponding to kinds of animals (including &lt;animal&gt; itself).</Paragraph>
    <Paragraph position="2"> The probability of a concept appearing as an argument of a predicate is written p(c | v, r), where c is a concept in WordNet, v is a predicate, and r is an argument position.  The focus in this article is on the arguments of verbs, but the techniques discussed can be applied to any predicate that takes nominal arguments, such as adjectives. The probability p(c  |v, r) is to be interpreted as follows: This is the probability that some noun n in syn(c), when denoting concept c, appears in position r of verb v (given v and r). The example used throughout the article is p(&lt;dog&gt; |run, subj), which is the conditional probability that some noun in the synset of &lt;dog&gt; , when denoting the concept &lt;dog&gt; , appears in the subject position of the verb run. Note that, in practice, no distinction is made between the different senses of a verb (although the techniques do allow such a distinction) and that each use of a noun is assumed to correspond to exactly one concept.</Paragraph>
  </Section>
  <Section position="4" start_page="189" end_page="195" type="metho">
    <SectionTitle>
3. Class-Based Probability Estimation
</SectionTitle>
    <Paragraph position="0"> This section explains how a set of concepts, or class, from WordNet can be used to estimate the probability of an individual concept. More specifically, we explain how a set of concepts c prime , where c prime is some hypernym of concept c, can be used to estimate p(c  |v, r). (Recall that c prime denotes the set of concepts dominated by c prime , including c prime itself.) One possible approach would be simply to substitute c prime for the individual concept c. This is a poor solution, however, since p(c  Computational Linguistics Volume 28, Number 2 some noun denoting a concept in c prime appears in position r of verb v. For example, p(&lt;animal&gt; |run, subj) is the probability that some noun denoting a kind of animal appears in the subject position of the verb run. Probabilities of sets of concepts are obtained by summing over the concepts in the set:</Paragraph>
    <Paragraph position="2"> This means that p(&lt;animal&gt; |run, subj) is likely to be much greater than p(&lt;dog&gt; | run, subj) and thus is not a good approximation of p(&lt;dog&gt; |run, subj).</Paragraph>
    <Paragraph position="3"> What can be done, though, is to condition on sets of concepts. If it can be shown that p(v  |c prime , r), for some hypernym c prime of c, is a reasonable approximation of p(v  |c, r), then we have a way of estimating p(c  |v, r). The probability p(v  |c, r) can be obtained from p(c  |v, r) using Bayes' theorem:</Paragraph>
    <Paragraph position="5"> (2) Since p(c  |r) and p(v  |r) are conditioned on the argument slot only, we assume these can be estimated satisfactorily using relative frequency estimates. Alternatively, a standard smoothing technique such as Good-Turing could be used.</Paragraph>
    <Paragraph position="6">  This leaves p(v | c, r). Continuing with the &lt;dog&gt; example, the proposal is to estimate p(run |&lt;dog&gt; , subj) using a relative-frequency estimate of p(run  |&lt;animal&gt; , subj) or an estimate based on a similar, suitably chosen class. Thus, assuming this choice of class, p(&lt;dog&gt; |run, subj) would be approximated as follows:</Paragraph>
    <Paragraph position="8"> The following derivation shows that if p(v  |c</Paragraph>
    <Paragraph position="10"> Clark and Weir Class-Based Probability Estimation Note that the proof applies only to a tree, since the proof assumes that c prime is partitioned by c prime and the sets of concepts dominated by each of the daughters of c prime , which is not necessarily true for a directed acyclic graph (DAG). WordNet is a DAG but is a close approximation to a tree, and so we assume this will not be a problem in practice.  The derivation in (4)-(9) shows how probabilities conditioned on sets of concepts can remain constant when moving up the hierarchy, and this suggests a way of finding a suitable set, c prime , as a generalization for concept c: Initially set c prime equal to c and move up the hierarchy, changing the value of c prime , until there is a significant change in p(v |</Paragraph>
    <Paragraph position="12"> node.) So when finding a suitable level for the estimation of p(&lt;sandwich&gt; |eat, obj), for example, we first assume that p(eat  |&lt;sandwich&gt; , obj) is a good approximation of p(eat |&lt;sandwich&gt; , obj) and then apply the procedure to p(eat  |&lt;sandwich&gt; , obj).</Paragraph>
    <Paragraph position="13"> A feature of the proposed generalization procedure is that comparing probabilities of the form p(v  |C, r), where C is a class, is closely related to comparing ratios of probabilities of the form p(C  |v, r)/p(C  |r) (for a given verb and argument position):</Paragraph>
    <Paragraph position="15"> Note that, for a given verb and argument position, p(v  |r) is constant across classes.</Paragraph>
    <Paragraph position="16"> Equation (10) is of interest because the ratio p(C  |v, r)/p(C  |r) can be interpreted as a measure of association between the verb v and class C. This ratio is similar to point-wise mutual information (Church and Hanks 1990) and also forms part of Resnik's association score, which will be introduced in Section 6. Thus the generalization procedure can be thought of as one that finds &amp;quot;homogeneous&amp;quot; areas of the hierarchy, that is, areas consisting of classes that are associated to a similar degree with the verb (Clark and Weir 1999).</Paragraph>
    <Paragraph position="17"> Finally, we note that the proposed estimation method does not guarantee that the estimates form a probability distribution over the concepts in the hierarchy, and so a normalization factor is required:</Paragraph>
    <Paragraph position="19"> to denote an estimate obtained using our method (since the technique finds sets of semantically similar senses, or &amp;quot;similarity classes&amp;quot;) and [c, v, r] to denote the class chosen for concept c in position r of verb v; ^p denotes a relative frequency estimate, and C denotes the set of concepts in the hierarchy.</Paragraph>
    <Paragraph position="20"> Before providing the details of the generalization procedure, we give the relative-frequency estimates of the relevant probabilities and deal with the problem of am6 Li and Abe (1998) also develop a theoretical framework that applies only to a tree and turn WordNet into a tree by copying each subgraph with multiple parents. One way to extend the experiments in Section 7 would be to investigate whether this transformation has an impact on the results of those experiments.</Paragraph>
    <Paragraph position="21">  Computational Linguistics Volume 28, Number 2 biguous data. The relative-frequency estimates are as follows:  where f(c, v, r) is the number of (n, v, r) triples in the data in which n is being used to denote c, and V is the set of verbs in the data. The problem is that the estimates are defined in terms of frequencies of senses, whereas the data are assumed to be in the form of (n, v, r) triples: a noun, verb, and argument position. All the data used in this work have been obtained from the British National Corpus (BNC), using the system of Briscoe and Carroll (1997), which consists of a shallow-parsing component that is able to identify verbal arguments.</Paragraph>
    <Paragraph position="22"> We take a simple approach to the problem of estimating the frequencies of senses, by distributing the count for each noun in the data evenly among all senses of the noun:</Paragraph>
    <Paragraph position="24"> f(c, v, r) is an estimate of the number of times that concept c appears in position r of verb v, and |cn(n) |is the cardinality of cn(n). This is the approach taken by Li and Abe (1998), Ribas (1995), and McCarthy (2000).</Paragraph>
    <Paragraph position="25">  Resnik (1998) explains how this apparently crude technique works surprisingly well. Alternative approaches are  described in Clark and Weir (1999) (see also Clark [2001]), Abney and Light (1999), and Ciaramita and Johnson (2000).</Paragraph>
    <Paragraph position="26"> 4. Using a Chi-Square Test to Compare Probabilities  In this section we show how to test whether p(v  |c prime , r) changes significantly when considering a node higher in the hierarchy. Consider the problem of deciding whether p(run  |&lt;canine&gt; , subj) is a good approximation of p(run  |&lt;dog&gt; , subj).(&lt;canine&gt; is the parent of &lt;dog&gt; in WordNet.) To do this, the probabilities p(run  |c</Paragraph>
    <Paragraph position="28"> are the children of &lt;canine&gt; . In this case, the null hypothesis of the test is that the probabilities p(run  |c</Paragraph>
    <Paragraph position="30"> . By judging the strength of the evidence against the null hypothesis, how similar the true probabilities are likely to be can be determined. If the test indicates that the probabilities are sufficiently unlikely to be the same, then the null hypothesis is rejected, and the conclusion is that p(run  |&lt;canine&gt; , subj) is not a good approximation of p(run  |&lt;dog&gt; , subj).</Paragraph>
    <Paragraph position="31"> An example contingency table, based on counts obtained from a subset of the BNC using the system of Briscoe and Carroll, is given in Table 1. (Recall that the frequencies are estimated by distributing the count for a noun equally among the noun's senses; this explains the fractional counts.) One column contains estimates of counts arising 7 Resnik takes a similar approach but divides the count evenly among the noun's senses and all the hypernyms of those senses.</Paragraph>
    <Paragraph position="32">  Clark and Weir Class-Based Probability Estimation Table 1 Contingency table for the children of &lt;canine&gt; in the subject position of run.</Paragraph>
    <Paragraph position="34"/>
    <Paragraph position="36"> appearing in the subject position of the verb run:</Paragraph>
    <Paragraph position="38"> appearing in the subject position of a verb other than run. The figures in brackets are the expected values if the null hypothesis is true.</Paragraph>
    <Paragraph position="39"> There is a choice of which statistic to use in conjunction with the chi-square test. The usual statistic encountered in textbooks is the Pearson chi-square statistic, de-</Paragraph>
    <Paragraph position="41"> The two statistics have similar values when the counts in the contingency table are large (Agresti 1996). The statistics behave differently, however, when the table contains low counts, and, since corpus data are likely to lead to some low counts, the question of which statistic to use is an important one. Dunning (1993) argues for the use of G  will be discussed further there. For now, we continue with the discussion of how the chi-square test is used in the generalization procedure.</Paragraph>
    <Paragraph position="42"> For Table 1, the value of G  is 3.8, and the value of X  is 2.5. Assuming a level of significance of a = 0.05, the critical value is 12.6 (for six degrees of freedom). Thus, for this a value, the null hypothesis would not be rejected for either statistic, and the conclusion would be that there is no reason to suppose that p(run  |&lt;canine&gt; , subj) is not a reasonable approximation of p(run  |&lt;dog&gt; , subj).</Paragraph>
    <Paragraph position="44"> As a further example, Table 2 gives counts for the children of &lt;liquid&gt; in the object position of drink. Again, the counts have been obtained from a subset of the BNC using the system of Briscoe and Carroll. Not all the sets dominated by the children of &lt;liquid&gt; are shown, as some, such as &lt;sheep dip&gt; , never appear in the object position of a verb in the data. This example is designed to show a case in which the null hypothesis is rejected. The value of G  for this table is 29.0, and the value of X  , the null hypothesis is rejected for a values greater than 0.005. This seems reasonable, since the probabilities associated with the children of &lt;liquid&gt; and the object position of drink would be expected to show a lot of variation across the children.</Paragraph>
    <Paragraph position="45"> A key question is how to select the appropriate value for a. One solution is to treat a as a parameter and set it empirically by taking a held-out test set and choosing the value of a that maximizes performance on the relevant task. For example, Clark and Weir (2000) describes a prepositional phrase attachment algorithm that employs probability estimates obtained using the WordNet method described here. To set the value of a, the performance of the algorithm on a development set could be compared across different values of a, and the value that leads to the best performance could be chosen. Note that this approach sets no constraints on the value of a: The value could be as high as 0.995 or as low as 0.0005, depending on the particular application.</Paragraph>
    <Paragraph position="46"> There may be cases in which the conditions for the appropriate application of a chi-square test are not met. One condition that is likely to be violated is the requirement that expected values in the contingency table not be too small. (A rule of thumb often found in textbooks is that the expected values should be greater than five.) One response to this problem is to apply some kind of thresholding and either ignore counts below the threshold, or apply the test only to tables that do not contain low counts. Ribas (1995), Li and Abe (1998), McCarthy (2000), and Wagner (2000) all use some kind of thresholding when dealing with counts in the hierarchy (although not in the context of a chi-square test). Another approach would be to use Fisher's exact test (Agresti 1996; Pedersen 1996), which can be applied to tables regardless of the size of  Clark and Weir Class-Based Probability Estimation the counts they contain. The main problem with this test is that it is computationally expensive, especially for large contingency tables.</Paragraph>
    <Paragraph position="47"> What we have found in practice is that applying the chi-square test to tables dominated by low counts tends to produce an insignificant result, and the null hypothesis is not rejected. The consequences of this for the generalization procedure are that low-count tables tend to result in the procedure moving up to the next node in the hierarchy. But given that the purpose of the generalization is to overcome the sparse-data problem, moving up a node is desirable, and therefore we do not modify the test for tables with low counts.</Paragraph>
    <Paragraph position="48"> The final issue to consider is which chi-square statistic to use. Dunning (1993) argues for the use of G  In addition, Pedersen (2001) questions whether one statistic should be preferred over the other for the bigram acquisition task and cites Cressie and Read (1984), who argue that there are some cases where the Pearson statistic is more reliable than the log-likelihood statistic. Finally, the results of the pseudo-disambiguation experiments presented in Section 7 are at least as good, if not better, when using X</Paragraph>
  </Section>
  <Section position="5" start_page="195" end_page="198" type="metho">
    <SectionTitle>
5. The Generalization Procedure
</SectionTitle>
    <Paragraph position="0"> The procedure for finding a suitable class, c prime , to generalize concept c in position r of verb v works as follows. (We refer to c prime as the &amp;quot;similarity class&amp;quot; of c with respect to v and r and the hypernym c prime as top(c, v, r), since the chosen hypernym sits at the &amp;quot;top&amp;quot; of the similarity class.) Initially, concept c is assigned to a variable top. Then, by working up the hierarchy, successive hypernyms of c are assigned to top, and this process continues until the probabilities associated with the sets of concepts dominated by top and the siblings of top are significantly different. Once a node is reached that results in a significant result for the chi-square test, the procedure stops, and top is returned as top(c, v, r). In cases where a concept has more than one parent, the parent is chosen that results in the lowest value of the chi-square statistic, as this indicates the probabilities are the most similar. The set top(c, v, r) is the similarity class of c for verb v and position r. Figure 1 gives an algorithm for determining top(c, v, r).  statistic is used, together with an a value of 0.05. Initially, top is set to &lt;soup&gt; , and the probabilities corresponding to the children of &lt;dish&gt; are compared: p(stir  |&lt;soup&gt; , obj), p(stir  |&lt;lasagne&gt; , obj), p(stir | &lt;haggis&gt; , obj), and so on for the rest of the children. The chi-square test results in a G  value of 14.5, compared to a critical value of 55.8. Since G  is less than the critical value, the procedure moves up to the next node. This process continues until a significant result is obtained, which first occurs at &lt;substance&gt; when comparing the children of &lt;object&gt; . Thus &lt;substance&gt; is the chosen level of generalization. Now we show how the chosen level of generalization varies with a and how it varies with the size of the data set. A note of clarification is required before presenting the results. In related work on acquiring selectional preferences (Ribas 1995; McCarthy  Clark and Weir Class-Based Probability Estimation 1997; Li and Abe 1998; Wagner 2000), the level of generalization is often determined for a small number of hand-picked verbs and the result compared with the researcher's intuition about the most appropriate level for representing a selectional preference. According to this approach, if &lt;sandwich&gt; were chosen to represent &lt;hotdog&gt; in the object position of eat, this might be considered an undergeneralization, since &lt;food&gt; might be considered more appropriate. For this work we argue that such an evaluation is not appropriate; since the purpose of this work is probability estimation, the most appropriate level is the one that leads to the most accurate estimate, and this may or may not agree with intuition. Furthermore, we show in Section 7 that to generalize unnecessarily can be harmful for some tasks: If we already have lots of data regarding &lt;sandwich&gt; , why generalize any higher? Thus the purpose of this section is not to show that the acquired levels are &amp;quot;correct,&amp;quot; but simply to show how the levels vary with a and the sample size.</Paragraph>
    <Paragraph position="1"> To show how the level of generalization varies with changes in a, top(c, v, obj) was determined for a number of hand-picked (c, v, obj) triples over a range of values for a. The triples were chosen to give a range of strongly and weakly selecting verbs and a range of verb frequencies. The data were again extracted from a subset of the BNC using the system of Briscoe and Carroll (1997), and the G  statistic was used in the chi-square test. The results are shown in Table 3. The number of times the verb occurred with some object is also given in the table.</Paragraph>
    <Paragraph position="2"> The results suggest that the generalization level becomes more specific as a increases. This is to be expected, since, given a contingency table chosen at random, a higher value of a is more likely to lead to a significant result than a lower value of a. We also see that, for some cases, the value of a has little effect on the level. We would expect there to be less change in the level of generalization for strongly selecting verbs, such as drink and eat, and a greater range of levels for weakly selecting verbs such as see. This is because any significant difference in probabilities is likely to be more marked for a strongly selecting verb, and likely to be significant over a wider range of a values. The table only provides anecdotal evidence, but provides some support to this argument.</Paragraph>
    <Paragraph position="3"> To investigate more generally how the level of generalization varies with changes in a, and also with changes in sample size, we took 6, 000 (c, v, obj) triples and calculated the difference in depth between c and top(c, v, r) for each triple. The 6, 000 triples were taken from the first experimental test set described in Section 7, and the training data from this experiment were used to provide the counts. (The test set contains nouns, rather than noun senses, and so the sense of the noun that is most probable given the verb and object slot was used.) An average difference in depth was then calculated. To give an example of how the difference in depth was calculated, suppose &lt;dog&gt; generalized to &lt;placental mammal&gt; via &lt;canine&gt; and &lt;carnivore&gt; ; in this case the difference would be three.</Paragraph>
    <Paragraph position="4"> The results for various levels of a and different sample sizes are shown in Table 4. The figures in each column arise from using the contingency tables based on the complete training data, but with each count in the table multiplied by the percentage at the head of the column. Thus the 50% column is based on contingency tables in which each original count is multiplied by 50%, which is equivalent to using a sample one-half the size of the original training set. Reading across a row shows how the generalization varies with sample size, and reading down a column shows how it varies with a. The results show clearly that the extent of generalization decreases with an increase in the value of a, supporting the trend observed in Table 3. The results also show that the extent of generalization increases with a decrease in sample  size. Again, this is to be expected, since any difference in probability estimates is less likely to be significant for tables with low counts.</Paragraph>
  </Section>
  <Section position="6" start_page="198" end_page="201" type="metho">
    <SectionTitle>
6. Alternative Class-Based Estimation Methods
</SectionTitle>
    <Paragraph position="0"> The approaches used for comparison are that of Resnik (1993, 1998), subsequently developed by Ribas (1995), and that of Li and Abe (1998), which has been adopted by McCarthy (2000). These have been chosen because they directly address the question of how to find a suitable level of generalization in WordNet.</Paragraph>
    <Paragraph position="1">  Clark and Weir Class-Based Probability Estimation The first alternative uses the &amp;quot;association score,&amp;quot; which is a measure of how well a set of concepts, C, satisfies the selectional preferences of a verb, v, for an argument position, r:  A (C, v, r)=p(C  |v, r) log</Paragraph>
    <Paragraph position="3"> An estimate of the association score, ^ A (C, v, r), can be obtained using relative frequency estimates of the probabilities. The key question is how to determine a suitable level of generalization for concept c, or, alternatively, how to find a suitable class to represent concept c (assuming the choice is from those classes that contain all concepts dominated by some hypernym of c). Resnik's solution to this problem (which he neatly refers to as the &amp;quot;vertical-ambiguity&amp;quot; problem) is to choose the class that maximizes the association score.</Paragraph>
    <Paragraph position="4"> It is not clear that the class with the highest association score is always the most appropriate level of generalization. For example, this approach does not always generalize appropriately for arguments that are negatively associated with some verb. To see why, consider the problem of deciding how well the concept &lt;location&gt; satisfies the preferences of the verb eat for its object. Since locations are not the kinds of things that are typically eaten, a suitable level of generalization would correspond to a class that has a low association score with respect to eat. However, &lt;location&gt; is a kind of &lt;entity&gt; in WordNet,  and choosing the class with the highest association score is likely to produce &lt;entity&gt; as the chosen class. This is a problem, because the association score of &lt;entity&gt; with respect to eat may be too high to reflect the fact that &lt;location&gt; is a very unlikely object of the verb.</Paragraph>
    <Paragraph position="5"> Note that the solution to the vertical-ambiguity problem presented in the previous sections is able to generalize appropriately in such cases. Continuing with the eat &lt;location&gt; example, our generalization procedure is unlikely to get as high as &lt;entity&gt; (assuming a reasonable number of examples of eat in the training data), since the probabilities corresponding to the daughters of &lt;entity&gt; are likely to be very different with respect to the object position of eat.</Paragraph>
    <Paragraph position="6"> The second alternative uses the minimum description length (MDL) principle. Li and Abe use MDL to select a set of classes from a hierarchy, together with their associated probabilities, to represent the selectional preferences of a particular verb. The preferences and class-based probabilities are then used to estimate probabilities of the form p(n  |v, r), where n is a noun, v is a verb, and r is an argument slot. Li and Abe's application of MDL requires the hierarchy to be in the form of a thesaurus, in which each leaf node represents a noun and internal nodes represent the class of nouns that the node dominates. The hierarchy is also assumed to be in the form of a tree. The class-based models consist of a partition of the set of nouns (leaf nodes) and a probability associated with each class in the partition. The probabilities are the conditional probabilities of each class, given the relevant verb and argument position. Li and Abe refer to such a partition as a &amp;quot;cut&amp;quot; and the cut together with the probabilities as a &amp;quot;tree cut model.&amp;quot; The probabilities of the classes in a cut, G, satisfy the following constraint:  Possible cut returned by MDL.</Paragraph>
    <Paragraph position="7"> In order to determine the probability of a noun, the probability of a class is assumed to be distributed uniformly among the members of that class:</Paragraph>
    <Paragraph position="9"> Since WordNet is a hierarchy with noun senses, rather than nouns, at the nodes, Li and Abe deal with the issue of word sense ambiguity using the method described in Section 3, by dividing the count for a noun equally among the concepts whose synsets contain the noun. Also, since WordNet is a DAG, Li and Abe turn WordNet into a tree by copying each subgraph with multiple parents. And so that each noun in the data appears (in a synset) at a leaf node, Li and Abe remove those parts of the hierarchy dominated by a noun in the data (but only for that instance of WordNet corresponding to the relevant verb).</Paragraph>
    <Paragraph position="10"> An example cut showing part of the WordNet hierarchy is shown in Figure 3 (based on an example from Li and Abe [1998]; the dashed lines indicate parts of the hierarchy that are not shown in the diagram). This is a possible cut for the object position of the verb eat, and the cut consists of the following classes: &lt;life form&gt; , &lt;solid&gt; , &lt;fluid&gt; , &lt;food&gt; , &lt;artifact&gt; , &lt;space&gt; , &lt;time&gt; , &lt;set&gt; . (The particular choice of classes for the cut in this example is not too important; the example is designed to show how probabilities of senses are estimated from class probabilities.) Since the class in the cut containing &lt;pizza&gt; is &lt;food&gt; , the probability p(&lt;pizza&gt; |eat, obj) would be estimated as p(&lt;food&gt; |eat, obj)/|&lt;food&gt; |.</Paragraph>
    <Paragraph position="11"> Similarly, since the class in the cut containing &lt;mushroom&gt; is &lt;life form&gt; , the probability p(&lt;mushroom&gt; |eat, obj) would be estimated as p(&lt;life form&gt; |eat, obj)/|&lt;life form&gt; |.</Paragraph>
    <Paragraph position="12"> The uniform-distribution assumption (20) means that cuts close to the root of the hierarchy result in a greater smoothing of the probability estimates than cuts near the leaves. Thus there is a trade-off between choosing a model that has a cut near the leaves, which is likely to overfit the data, and a more general (simple) model near the root, which is likely to underfit the data. MDL looks ideally suited to the task of model selection, since it is designed to deal with precisely this trade-off. The simplicity of a model is measured using the model description length, which is an information-theoretic  Clark and Weir Class-Based Probability Estimation term and denotes the number of bits required to encode the model. The fit to the data is measured using the data description length, which is the number of bits required to encode the data (relative to the model). The overall description length is the sum of the model description length and the data description length, and the MDL principle is to select the model with the shortest description length.</Paragraph>
    <Paragraph position="13"> We used McCarthy's (2000) implementation of MDL. So that every noun is represented at a leaf node, McCarthy does not remove parts of the hierarchy, as Li and Abe do, but instead creates new leaf nodes for each synset at an internal node. McCarthy also does not transform WordNet into a tree, which is strictly required for Li and Abe's application of MDL. This did create a problem with overgeneralization: Many of the cuts returned by MDL were overgeneralizing at the &lt;entity&gt; node. The reason is that &lt;person&gt; , which is close to &lt;entity&gt; and dominated by &lt;entity&gt; , has two parents: &lt;life form&gt; and &lt;causal agent&gt; . This DAG-like property was responsible for the overgeneralization, and so we removed the link between &lt;person&gt; and &lt;causal agent&gt; . This appeared to solve the problem, and the results presented later for the average degree of generalization do not show an overgeneralization compared with those given in Li and Abe (1998).</Paragraph>
  </Section>
  <Section position="7" start_page="201" end_page="204" type="metho">
    <SectionTitle>
7. Pseudo-Disambiguation Experiments
</SectionTitle>
    <Paragraph position="0"> The task we used to compare the class-based estimation techniques is a decision task previously used by Pereira, Tishby, and Lee (1993) and Rooth et al. (1999). The task is to decide which of two verbs, v and v prime , is more likely to take a given noun, n,asan object. The test and training data were obtained as follows. A number of verb-direct object pairs were extracted from a subset of the BNC, using the system of Briscoe and Carroll. All those pairs containing a noun not in WordNet were removed, and each verb and argument was lemmatized. This resulted in a data set of around 1.3 million (v, n) pairs.</Paragraph>
    <Paragraph position="1"> To form a test set, 3,000 of these pairs were randomly selected such that each selected pair contained a fairly frequent verb. (Following Pereira, Tishby, and Lee, only those verbs that occurred between 500 and 5,000 times in the data were considered.) Each instance of a selected pair was then deleted from the data to ensure that the test data were unseen. The remaining pairs formed the training data. To complete the test set, a further fairly frequent verb, v prime , was randomly chosen for each (v, n) pair. The random choice was made according to the verb's frequency in the original data set, subject to the condition that the pair (v prime , n) did not occur in the training data. Given the set of (v, n, v prime ) triples, the task is to decide whether (v, n) or (v prime , n) is the correct pair.</Paragraph>
    <Paragraph position="2">  We acknowledge that the task is somewhat artificial, but pseudo-disambiguation tasks of this kind are becoming popular in statistical NLP because of the ease with which training and test data can be created. We also feel that the pseudo-disambiguation task is useful for evaluating the different estimation methods, since it directly addresses the question of how likely a particular predicate is to take a given noun as an argument. An evaluation using a PP attachment task was attempted in Clark and Weir (2000), but the evaluation was limited by the relatively small size of the Penn Treebank.</Paragraph>
    <Paragraph position="3"> 11 We note that this procedure does not guarantee that the correct pair is more likely than the incorrect pair, because of noise in the data from the parser and also because a highly plausible incorrect pair could be generated by chance.</Paragraph>
    <Paragraph position="4">  Note: av.gen. is the average number of generalized levels; sd.gen. is the standard deviation.</Paragraph>
    <Paragraph position="5"> Using our approach, the disambiguation decision for each (v, n, v prime ) triple was made according to the following procedure:  else choose at random If n has more than one sense, the sense is chosen that maximizes the relevant probability estimate; this explains the maximization over cn(n). The probability estimates were obtained using our class-based method, and the G  statistic was used for the chi-square test. This procedure was also used for the MDL alternative, but using the MDL method to estimate the probabilities.</Paragraph>
    <Paragraph position="6"> Using the association score for each test triple, the decision was made according to the following procedure:  else choose at random We use h(c) to denote the set consisting of the hypernyms of c. The inner maximization is over h(c), assuming c is the chosen sense of n, which corresponds to Resnik's method of choosing a set to represent c. The outer maximization is over the senses of n, cn(n), which determines the sense of n by choosing the sense that maximizes the association score.</Paragraph>
    <Paragraph position="7"> The first set of results is given in Table 5. Our technique is referred to as the &amp;quot;similarity class&amp;quot; technique, and the approach using the association score is referred  Note: av.gen. is the average number of generalized levels; sd.gen. is the standard deviation.</Paragraph>
    <Paragraph position="8"> to as &amp;quot;Assoc.&amp;quot; The results are given for a range of a values and demonstrate clearly that the performance of similarity class varies little with changes in a and that similarity class outperforms both MDL and Assoc.</Paragraph>
    <Paragraph position="9">  We also give a score for our approach using a simple generalization procedure, which we call &amp;quot;low class.&amp;quot; The procedure is to select the first class that has a count greater than zero (relative to the verb and argument position), which is likely to return a low level of generalization, on the whole. The results show that our generalization technique only narrowly outperforms the simple alternative. Note that, although low class is based on a very simple generalization method, the estimation method is still using our class-based technique, by applying Bayes' theorem and conditioning on a class, as described in Section 3; the difference is in how the class is chosen. To investigate the results, we calculated the average number of generalized levels for each approach. The number of generalized levels for a concept c (relative to a verb v and argument position r) is the difference in depth between c and top(c, v, r), as explained in Section 5. For each test case, the number of generalized levels for both verbs, v and v prime , was calculated, but only for the chosen sense of n. The results are given in the third column of Table 5 and demonstrate clearly that both MDL and Assoc are generalizing to a greater extent than similarity class. (The fourth column gives a standard deviation figure.) These results suggest that MDL and Assoc are overgeneralizing, at least for the purposes of this task.</Paragraph>
    <Paragraph position="10"> To investigate why the value for a had no impact on the results, we repeated the experiment, but with one fifth of the data. A new data set was created by taking every fifth pair of the original 1.3 million pairs. A test set of 3,000 triples was created from this new data set, as before, but this time only verbs that occurred between 100 and 1,000 times were considered. The results using these test and training data are given in Table 6.</Paragraph>
    <Paragraph position="11"> These results show a variation in performance across values for a, with an optimal performance when a is around 0.75. (Of course, in practice, the value for a would need to be optimized on a held-out set.) But even with this variation, similarity class is still outperforming MDL and Assoc across the whole range of a values. Note that the  a values corresponding to the lowest scores lead to a significant amount of generalization, which provides additional evidence that MDL and Assoc are overgeneralizing for this task. The low-class method scores highly for this data set also, but given that the task is one that apparently favors a low level of generalization, the high score is not too surprising.</Paragraph>
    <Paragraph position="12"> As a final experiment, we compared the task performance using the X  . This suggests a possible explanation for the results presented here and those in Dunning (1993): that the X  statistic provides a less conservative test when counts in the contingency table are low. (By a conservative test we mean one in which the null hypothesis is not easily rejected.) A less conservative test is better suited to the pseudo-disambiguation task, since it results in a lower level of generalization, on the whole, which is good for this task. In contrast, the task that Dunning considers, the discovery of bigrams, is better served by a more conservative test.</Paragraph>
  </Section>
  <Section position="8" start_page="204" end_page="204" type="metho">
    <SectionTitle>
8. Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented a class-based estimation method that incorporates a procedure for finding a suitable level of generalization in WordNet. This method has been shown to provide superior performance on a pseudo-disambiguation task, compared with two alternative approaches. An analysis of the results has shown that the other approaches appear to be overgeneralizing, at least for this task. One of the features of the generalization procedure is the way that a, the level of significance in the chi-square test, is treated as a parameter. This allows some control over the extent of generalization, which can be tailored to particular tasks. We have also shown that the task performance is at least as good when using the Pearson chi-square statistic as when using the log-likelihood chi-square statistic.</Paragraph>
    <Paragraph position="1"> There are a number of ways in which this work could be extended. One possibility would be to use all the classes dominated by the hypernyms of a concept, rather than just one, to estimate the probability of the concept. An estimate would be obtained for each hypernym, and the estimates combined in a linear interpolation. An approach similar to this is taken by Bikel (2000), in the context of statistical parsing.</Paragraph>
    <Paragraph position="2"> There is still room for investigation of the hidden-data problem when data are used that have not been sense disambiguated. In this article, a very simple approach is taken,</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML