File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-1003_metho.xml
Size: 44,580 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-1003"> <Title>c(c) 2004 Association for Computational Linguistics Verb Class Disambiguation Using Informative Priors</Title> <Section position="3" start_page="50" end_page="98" type="metho"> <SectionTitle> 2. The Prior Model </SectionTitle> <Paragraph position="0"> Consider again the sentences in (6). Assuming that we more often write something to someone rather than for someone, we would like to derive Message Transfer as the prevalent class for write rather than Performance. We view the choice of a class for a polysemous verb in a given frame as maximizing the joint probability P(c, f , v), where v is a verb subcategorizing for the frame f and inhabiting more than one Levin</Paragraph> <Paragraph position="2"> Although the terms P(v) and P(f|v) can be estimated from the BNC (P(v) reduces to the number of times a verb is attested in the corpus, and P(f|v) can be obtained through parsing), the estimation of P(c|v, f) is somewhat problematic, since it relies on the frequency F(c, v, f). The latter could be obtained straightforwardly if we had access to a parsed corpus annotated with subcategorization and semantic-class information. Lacking such a corpus we will assume that the semantic class determines the subcategorization patterns of its members independently of their identity (see (11)):</Paragraph> <Paragraph position="4"> The independence assumption is a simplification of Levin's (1993) hypothesis that the argument structure of a given verb is a direct reflection of its meaning. The rationale behind the approximation in (11) is that since class formation is determined on the basis of diathesis alternations, it is the differences in subcategorization structure, rather than the identity of the individual verbs, that determine class likelihood. For example, if we know that some verb subcategorizes for the double object and the prepositional &quot;NP1 V NP2 to NP3&quot; frames, we can guess that it is a member of the Give class or the MessageTransfer class without knowing whether this verb is give, write,ortell.</Paragraph> <Paragraph position="5"> Note that the approximation in (11) assumes that verbs of the same class uniformly subcategorize (or not) for a given frame. This is evidently not true for all classes of verbs. For example, all Give verbs undergo the dative diathesis alternation, and therefore we would expect them to be attested in both the double-object and prepositional frame, but only a subset of Create verbs undergo the benefactive alternation. For example, the verb invent is a Create verb and can be attested only in the benefactive prepositional frame (I will invent a tool for you versus ?I will invent you a tool; see Levin [1993] for details). By applying Bayes' law we write P(c|f) as</Paragraph> <Paragraph position="7"/> <Paragraph position="9"> It is easy to obtain P(v) from the lemmatized BNC (see (a) in Table 2). In order to estimate the probability P(f|v), we need to know how many times a verb is attested with a given frame. We acquired Levin-compatible subcategorization frames from the BNC after performing a coarse-grained mapping between Levin's frame descriptions and surface syntactic patterns without preserving detailed semantic information about argument structure and thematic roles. This resulted in 80 frame types that were grossly compatible with Levin. We used Gsearch (Corley et al. 2001), a tool that facilitates the search of arbitrary part-of-speech-tagged corpora for shallow syntactic patterns based on a user-specified context-free grammar and a syntactic query. We specified a chunk grammar for recognizing the verbal complex, NPs, and PPs and used Gsearch to extract tokens matching the frames specified in Levin. We discarded all frames with a frequency smaller than five, as they were likely to be unreliable given our heuristic approach. The frame probability P(f) (see the denominator in (13) and equation (e) in Table 2) was also estimated on the basis of the Levin-compatible subcategorization frames that were acquired from the BNC.</Paragraph> <Paragraph position="10"> We cannot read off P(f|c) in (13) directly from the corpus, because the corpus is not annotated with verb classes. Nevertheless Levin's (1993) classification records the syntactic frames that are licensed by a given verb class (for example, Give verbs license the double object and the &quot;NP1 V NP2 to NP3&quot; frame) and also the number and type of classes a given verb exhibits (e.g., write inhabits two classes, Performance and Message Transfer). Furthermore, we know how many times a given verb is attested with a certain frame in the corpus, as we have acquired Levin-compatible frames from the BNC (see (b) in Table 2). We first explain how we obtain F(f , c), which we rewrite as the sum of all occurrences of verbs v that are members of class c and are attested in the corpus with frame f (see (c) and (f) in Table 2).</Paragraph> <Paragraph position="11"> For monosemous verbs the count F(c, f , v) reduces to the number of times these verbs have been attested in the corpus with a certain frame. For polysemous verbs, we additionally need to know the class in which they were attested in the corpus.</Paragraph> <Paragraph position="12"> Note that we don't necessarily need an annotated corpus for class-ambiguous verbs whose classes license distinct frames (see example (5)), provided that we have extracted verb frames relatively accurately. For genuinely ambiguous verbs (i.e., verbs licensed by classes that take the same frame), given that we don't have access to a corpus annotated with verb class information, we distribute the frequency of the verb and its frame evenly across its semantic classes:</Paragraph> <Paragraph position="14"> Here F(f , v) is the co-occurrence frequency of a verb and its frame and |classes(v, f)| is the number of classes verb v is a member of when found with frame f . The joint frequency of a class and its frame F(f , c) is then the sum of all verbs that are members of the class c and are attested with frame f in the corpus (see (f) in Table 2). Table 3 shows the estimation of the frequency F(c, f , v) for six verbs that are members of the Give class. Consider for example feed, which is a member of four classes: Give, Gorge, Feeding, and Fit. Of these classes, only Feeding and Give license the double-object and prepositional frames. This is why the co-occurrence frequency of feed with these frames is divided by two. The verb serve inhabits four classes. The double-object frame is licensed by the Give class, whereas the prepositional frame is additionally licensed by the Fulfilling class, and therefore the co-occurrence frequency F(NPVNPtoNP, serve) is equally distributed between these two classes. This is clearly a simplification, since one would expect F(c, f , v) to vary for different verb classes. However, note that according to this estimation, F(f , c) will vary across frames reflecting differences in the likelihood of a class being attested with a certain frame.</Paragraph> <Paragraph position="15"> Both terms P(f|c) and P(c) in (13) rely on the class frequency F(c) (see (c) and (d) in Table 2). We rewrite F(c) as the sum of all verbs attested in the corpus with class c (see (g) in Table 2). For monosemous verbs the estimate of F(v, c) reduces to the count of the verb in the corpus. Once again we cannot estimate F(v, c) for polysemous verbs directly. The task would be straightforward if we had a corpus of verbs, each labeled explicitly with class information. All we have is the overall frequency of a given verb in the BNC and the number of classes it is a member of according to Levin (1993).</Paragraph> <Paragraph position="16"> Since polysemous verbs can generally be the realization of more than one semantic class, counts of semantic classes can be constructed by dividing the contribution from the verb by the number of classes it belongs to (Resnik 1993; Lauer 1995). We rewrite the frequency F(v, c) as shown in (h) in Table 2 and approximate P(c|v), the true distribution of the verb and its classes, as follows:</Paragraph> <Paragraph position="18"> Here, F(v) is the number of times the verb v was observed in the corpus and |classes(v)| is the number of classes c it belongs to. For example, in order to estimate the frequency of the classGive, we consider all verbs that are listed as members of this class in Levin (1993). The class contains thirteen verbs, among which six are polysemous. We will obtain F(Give) by taking into account the verb frequency of the monosemous verbs (|classes(v) |is one in this case) as well as distributing the frequency of the polysemous verbs among their classes. For example, feed inhabits the classes Give, Gorge, Lapata and Brew Verb Class Disambiguation Using Informative Priors Feeding, and Fit and occurs in the corpus 3,263 times. We will increment the count of F(Give) by . Table 3 illustrates the estimation of F(v, c) for six members of the Giveclass. The total frequency of the class is obtained by summing over the individual values of F(v, c) (see equation (g) in Table 2).</Paragraph> <Paragraph position="19"> The approach in (15) relies on the simplifying assumption that the frequency of a verb is distributed evenly across its semantic classes. This is clearly not true for all verbs. Consider, for example, the verb rent, which inhabits classes Give (Frank rented Peter his room) and Get (I rented my flat for my sister). Intuitively speaking, the Give sense of rent is more frequent than the Get sense, however, this is not taken into account in (15), primarily because we do not know the true distribution of the classes for rent. An alternative to (15) is to distribute the verb frequency unequally among verb classes. Even though we don't know how likely classes are in relation to a particular verb, we can approximate how likely classes are in general on the basis of their size (i.e., number of verbs that are members of each class). So then we can distribute a verb's frequency unequally, according to class size. This time we approximate P(c|v) (see (h) in Table 2) by P(c|amb class), the probability of class c given the ambiguity class amb class. The latter represents the set of classes a verb might inhabit: F(v, c) [?] F(v) * P(c|amb class)(16) We collapse verbs into ambiguity classes in order to reduce the number of parameters that must be estimated; we certainly lose information, but the approximation makes it easier to get reliable estimates from limited data. We simply approximate P(c|amb class) using a heuristic based on class size:</Paragraph> <Paragraph position="21"> For each class we recorded the number of its members after discarding verbs whose frequency was less than one per million in the BNC. This gave us a first approximation of the size of each class. We then computed, for each polysemous verb, the total size of the classes of which it was a member. We calculated P(c|amb class) by dividing the former by the latter (see equation (17)). We obtained the class frequency F(c) by multiplying P(c|amb class) by the observed frequency of the verb in the BNC (see equation (16)). As an example, consider again F(Give), which is calculated by summing over all verbs that are members of this class (see (g) in Table 2). In order to add the contribution of the verb feed, we need to distribute its corpus frequency among the classes Give, Gorge, Feed, and Fit. The respective P(c|amb class) values for these classes are . By multiplying these by the frequency of feed in the BNC (3,263), we obtain the values of F(v, c) given in Table 4. Only the frequency F(feed,Give) is relevant for F(Give).</Paragraph> <Paragraph position="22"> The estimation process just described involves at least one gross simplification, since P(c|amb class) is calculated without reference to the identity of the verb in question. For any two verbs that fall into the same set of classes, P(c|amb class) will be the same, even though one or both may be atypical in its distribution across the classes. Furthermore, the estimation tends to favor large classes, again irrespectively of the identity of the verb in question. For example, the verb carry has three classes, Carry,</Paragraph> </Section> <Section position="4" start_page="98" end_page="98" type="metho"> <SectionTitle> 3 Our use of ambiguity classes is inspired by a similar use in hidden Markov model-based </SectionTitle> <Paragraph position="0"> part-of-speech tagging (Kupiec 1992).</Paragraph> <Paragraph position="1"> Fit, and Cost. Intuitively speaking, the Carry class is the most frequent (e.g., Smoking can impair the blood which carries oxygen to the brain; I carry sugar lumps around with me). However, since the Fit class (e.g., Thameslink presently carries 20,000 passengers daily) is larger than the Carry class, it will be given a higher probability (.45 versus .4). Our estimation scheme is clearly a simplification, but it is an empirical question how much it matters. Tables 5 and 6 show the ten most frequent classes as estimated using (15) and (16). We explore the contribution of the two estimation schemes for P(c) in Experiments 1 and 2.</Paragraph> <Paragraph position="2"> The probabilities P(f|c) and P(f|v) will be unreliable when the frequencies F(f , v) and F(f , c) are small and will be undefined when the frequencies are zero. Following Hindle and Rooth (1993), we smooth the observed frequencies as shown in Table 7.</Paragraph> <Paragraph position="4"> across all classes. We do not claim that this scheme is perfect, but any deficiencies it may have are almost certainly masked by the effects of approximations and simplifications elsewhere in the system.</Paragraph> <Paragraph position="5"> We evaluated the performance of the model on all verbs listed in Levin (1993) that are polysemous (i.e., members of more than one class) and take frames characteristic of the widely studied dative and benefactive alternations (Pinker 1989; Boguraev and Briscoe 1989; Levin 1993; Goldberg 1995; Briscoe and Copestake 1999) and of the less well-known conative and possessor object alternations (see examples (1)-(4)). All four alternations seem fairly productive; that is, a large number of verbs undergo these alternations, according to Levin. A large number of classes license the frames that are relevant for these alternations and the verbs that inhabit these classes are likely to exhibit class ambiguity: 20 classes license the double object frame, 22 license the prepositional frame &quot;NP1 V NP2 to NP3,&quot; 17 classes license the benefactive &quot;NP1 V NP2 for NP3&quot; frame, 118 (out of 200) classes license the transitive frame, and 15 classes license the conative &quot;NP1 V at NP2&quot; frame.</Paragraph> <Paragraph position="6"> In Experiment 1 we use the model to test the hypothesis that subcategorization information can be used to disambiguate polysemous verbs. In particular, we concentrate on verbs like serve (see example (5)) that can be disambiguated solely on the basis of their frame. In Experiment 2 we focus on verbs that are genuinely ambiguous; that is, they inhabit a single frame and yet can be members of more than one semantic class (e.g., write, study, see examples (6)-(9)). In this case, we use the probabilistic model to assign a probability to each class the verb inhabits. The class with the highest probability represents the dominant meaning for a given verb.</Paragraph> <Paragraph position="7"> 3. Experiment 1: Using Subcategorization to Resolve Verb Class Ambiguity</Paragraph> <Section position="1" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 3.1 Method </SectionTitle> <Paragraph position="0"> In this experiment we focused solely on verbs whose meaning can be potentially disambiguated by taking into account their subcategorization frame. A model that performs badly on this task cannot be expected to produce any meaningful results for genuinely ambiguous verbs.</Paragraph> <Paragraph position="1"> We considered 128 verbs with the double-object frame (2.72 average class ambiguity), 101 verbs with the prepositional frame &quot;NP1 V NP2 to NP3&quot; (2.59 average class ambiguity), 113 verbs with the frame &quot;NP1 V NP2 for NP3&quot; (2.63 average class ambiguity), 42 verbs with the frame &quot;NP1 V at NP3&quot; (3.05 average class ambiguity), and 39 verbs with the transitive frame (2.28 average class ambiguity). The task was the following: Given that we know the frame of a given verb, can we predict its semantic class? In other words by varying the class c in the term P(c, f , v), we are trying to see whether the class that maximizes it is the one predicted by the lexical semantics and the argument structure of the verb in question. The model's responses were evaluated against Levin's (1993) classification. The model's performance was considered correct if it agreed with Levin in assigning a verb to an appropriate class given a particular frame. Recall from Section 2 that we proposed two approaches for the estimation of the class probability P(c). We explore the influence of P(c) by obtaining two sets of results corresponding to the two estimation schemes.</Paragraph> </Section> <Section position="2" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> The model's accuracy is shown in Tables 8 and 9. The results in Table 8 were obtained using the estimation scheme for P(c) that relies on the even distribution of the frequency of a verb across its semantic classes (see equation (15)). The results in Table 9 were obtained using an alternative scheme that distributes verb frequency unequally among verb classes by taking class size into account (see equation (16)).</Paragraph> <Paragraph position="1"> As mentioned in Section 3.1, the results were based on comparison of the model's performance against Levin's (1993) classification. We also compared the results to the baseline of choosing the most likely class P(c) (without taking subcategorization information into account). The latter was determined on the basis of the approximations described in Section 2 (see equation (9) in Table 2, as well as equations (15), (16), and (17)).</Paragraph> <Paragraph position="2"> The model achieved an accuracy of 93.9% using either type of estimation for P(c).</Paragraph> <Paragraph position="3"> It also outperformed the baseline by 38.1% (see Table 8) and 37.2% (see Table 9). One might expect an accuracy of 100%, since these verbs can be disambiguated solely on the basis of their frame. However, the performance of our model achieves a lesser accuracy, mainly because of the way we estimate the terms P(c) and P(f|c): We overemphasize the importance of class information without taking into account how individual verbs distribute across classes. Furthermore, we rely on frame frequencies acquired from the BNC, using shallow syntactic analysis, which means that the correspondence between Lapata and Brew Verb Class Disambiguation Using Informative Priors Levin's (1993) frames and our acquired frames is not one to one. Except for the fact that our frames do not preserve much of the linguistic information detailed Levin, the number of frames acquired for a given verb can be a subset or superset of the frames available in Levin. Note that the two estimation schemes yield comparable performances. This is a positive result given the importance of P(c) in the estimation of P(c, f , v).</Paragraph> <Paragraph position="4"> A more demanding task for our probabilistic model will be with genuinely ambiguous verbs (i.e., verbs for which the mapping between meaning and subcategorization is not one to one). Although native speakers may have intuitions about the dominant interpretation for a given verb, this information is entirely absent from Levin (1993) and from the corpus on which our model is trained. In Experiment 2 we show how our model can be used to recover this information.</Paragraph> <Paragraph position="5"> 4. Experiment 2: Using Corpus Distributions to Derive Verb Class Preferences</Paragraph> </Section> <Section position="3" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 4.1 Method </SectionTitle> <Paragraph position="0"> We evaluated the performance of our model on 67 genuinely ambiguous verbs, that is, verbs that inhabit a single frame and can be members of more than one semantic class (e.g., write). These verbs are listed in Levin (1993) and undergo the dative, benefactive, conative, and possessor object alternations. As in Experiment 1, we considered verbs with the double-object frame (3.27 average class ambiguity), verbs with the frame &quot;NP1 V NP2 to NP3&quot; (2.94 average class ambiguity), verbs with the frame &quot;NP1 V NP2 for NP3&quot; (2.42 average class ambiguity), verbs with the frame &quot;NP1 V at NP3&quot; (2.71 average class ambiguity), and transitive verbs (2.77 average class ambiguity). The model's predictions were compared against manually annotated data that was used only for testing purposes. The model was trained without access to a disambiguated corpus. More specifically, corpus tokens characteristic of the verb and frame in question were randomly sampled from the BNC and annotated with class information so as to derive the true distribution of the verb's classes in a particular frame. We describe the verb selection procedure as follows.</Paragraph> <Paragraph position="1"> Given the restriction that these verbs be semantically ambiguous in a specific syntactic frame, we could not simply sample from the entire BNC, since this would decrease the chances of finding the verb in the frame we are interested in. Instead, a stratified sample was used: For all class-ambiguous verbs, tokens were randomly sampled from the parsed data used for the acquisition of verb frame frequencies. The model was evaluated on verbs for which a reliable sample could be obtained. This meant that verbs had to have a frame frequency larger than 50. For verbs exceeding this threshold 100 tokens were randomly selected and annotated with verb class information. For verbs with frame frequency less than 100 and more than 50, no sampling took place; the entire set of tokens was manually annotated. This selection procedure resulted in 14 verbs with the double-object frame, 16 verbs with the frame &quot;NP1 V NP2 to NP3,&quot; 2 verbs with the frame &quot;NP1 V NP2 for NP3,&quot; 1 verb with the frame &quot;NP1 V at NP3,&quot; and 80 verbs with the transitive frame. From the transitive verbs we further randomly selected 34 verbs; these were manually annotated and used for evaluating the model's performance.</Paragraph> <Paragraph position="2"> The selected tokens were annotated with class information by two judges, both linguistics graduate students. The classes were taken from Levin (1993) and augmented 4 Although the model can yield predictions for any number of verbs, evaluation could not be performed for all 80 verbs, as to perform such evaluation, our judges would have had to annotate 8,000 corpus tokens.</Paragraph> <Paragraph position="3"> Computational Linguistics Volume 30, Number 1 with the class Other, which was reserved for either corpus tokens that had the wrong frame or those for which the classes in question were not applicable. The judges were given annotation guidelines (for each verb) but no prior training (for details on the annotation study see Lapata [2001]). The annotation provided a gold standard for evaluating the model's performance and enabled us to test whether humans agree on the class annotation task. We measured the judges' agreement on the annotation task using the kappa coefficient (Cohen 1960). In general, the agreement on the class annotation task was good, with kappa values ranging from .66 to 1.00 (the mean kappa</Paragraph> <Paragraph position="5"/> </Section> </Section> <Section position="5" start_page="98" end_page="98" type="metho"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> We counted the performance of our model as correct if it agreed with the &quot;most preferred,&quot; that is, the most frequent, verb class, as determined in the manually annotated corpus sample by taking the average of the responses of both judges. As an example, consider the verb feed, which in the double-object frame is ambiguous between the classes Feed and Give. According to the model, Feed is the most likely class for feed.</Paragraph> <Paragraph position="1"> Out of 100 instances of the verb feed in the double-object frame, 61 were manually assigned the Feed class, 32 were assigned the Give class, and 6 were parsing mistakes (and therefore assigned the class Other). In this case the model's outcome is considered correct given that the corpus tokens also reveal a preference for the Feed (i.e., the Feed instances outnumber the Give ones).</Paragraph> <Paragraph position="2"> As in Experiment 1, we explored the influence of the parameter P(c) on the model's performance by obtaining two sets of results corresponding to the two estimation schemes discussed in Section 2. The model's accuracy is shown in Tables 10 and 11.</Paragraph> <Paragraph position="3"> The results in Table 10 were obtained using the estimation scheme for P(c) that relies on the even distribution of a verb's frequency across its semantic classes (see equation (15)). The results in Table 11 were obtained using a scheme that distributes verb frequency unequally among verb classes by taking class size into account (see equation (16)). As in Experiment 1, the results were compared to a simple baseline that defaults to the most likely class without taking verb frame information into account (see equation (g) in Table 2 as well as equations (15), (16), and (17)).</Paragraph> <Paragraph position="4"> The model achieved an accuracy of 74.6% using the estimation scheme of equal distribution and a accuracy of 73.1% using the estimation scheme of unequal distribution. The difference between the two estimation schemes is not statistically significant</Paragraph> <Paragraph position="6"> (67)=2.17, p = .84). Table 12 gives the distribution of classes for 12 polysemous verbs taking the double-object frame as obtained from the manual annotation of corpus tokens together with interannotator agreement (k). We also give the (log-transformed) probabilities of these classes as derived by the model.</Paragraph> <Paragraph position="7"> The presence of the symbol [?] indicates that the model's class preference for a given verb agrees with its distribution in the corpus. The absence of [?] indicates disagreement. For the comparison shown in Table 12, model class preferences were derived using the equal-distribution estimation scheme for P(c) (see equation (15)).</Paragraph> <Paragraph position="8"> As shown in Table 12, the model's predictions are generally borne out in the corpus data. Misclassifications are due mainly to the fact that the model does not take verb class dependencies into account. Consider, for example, the verb cook. According to the model the most likely class for cook is Build. Although it may generally be the case that Build verbs (e.g., make, assemble, build) are more frequent than Prepare verbs 5 No probabilities are given for the Other class; this is not a Levin class, however, it was used by the annotators, mainly to indicate parsing errors.</Paragraph> <Paragraph position="9"> Computational Linguistics Volume 30, Number 1 (e.g., bake, roast, boil), the situation is reversed for cook. The same is true for the verb shoot, which when attested in the double-object frame is more likely to be a Throw verb (Jamie shot Mary a glance) rather than a Get verb (I will shoot you two birds). Notice that our model is not context sensitive; that is, it does not derive class rankings tailored to specific verbs, primarily because this information is not readily available in the corpus, as explained in Section 2. However, we have effectively built a prior model of the joint distribution of verbs, their classes, and their syntactic frames that can be useful for disambiguating polysemous verbs in context. We describe our class disambiguation experiments as follows.</Paragraph> </Section> <Section position="6" start_page="98" end_page="98" type="metho"> <SectionTitle> 5. Class Disambiguation </SectionTitle> <Paragraph position="0"> In the previous sections we focused on deriving a model of the distribution of Levin classes without relying on annotated data and showed that this model infers the right class for genuinely ambiguous verbs 74.6% of the time without taking the local context of their occurrence into account. An obvious question is whether this information is useful for disambiguating tokens rather than types. In the following we report on a disambiguation experiment that takes advantage of this prior information.</Paragraph> <Paragraph position="1"> Word sense disambiguation is often cast as a problem in supervised learning, where a disambiguator is induced from a corpus of manually sense-tagged text. The context within which the ambiguous word occurs is typically represented by a set of linguistically motivated features from which a learning algorithm induces a representative model that performs the disambiguation. A variety of classifiers have been employed for this task (see Mooney [1996] and Ide and Veronis [1998] for overviews), the most popular being decision lists (Yarowsky 1994, 1995) and naive Bayesian classifiers (Pedersen 2000; Ng 1997; Pedersen and Bruce 1998; Mooney 1996; Cucerzan and Yarowsky 2002). We employed a naive Bayesian classifier (Duda and Hart 1973) for our experiments, as it is a very convenient framework for incorporating prior knowledge and studying its influence on the classification task. In Section 5.1 we describe a basic naive Bayesian classifier and show how it can be extended with informative priors. In Section 5.2 we discuss the types of contextual features we use. We report on our experimental results in Section 6.</Paragraph> <Section position="1" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 5.1 Naive Bayes Classification </SectionTitle> <Paragraph position="0"> A naive Bayesian classifier assumes that all the feature variables representing a problem are conditionally independent, given the value of the classification variable. In word sense disambiguation, the features (a</Paragraph> <Paragraph position="2"> ) represent the context surrounding the ambiguous word, and the classification variable c is the sense (Levin class in our case) of the ambiguous word in this particular context. Within a naive Bayes approach, the probability of the class c given its context can be expressed as</Paragraph> <Paragraph position="4"> ) is constant for all classes c, the problem reduces to finding the class c with the maximum value for the numerator:</Paragraph> <Paragraph position="6"> Lapata and Brew Verb Class Disambiguation Using Informative Priors If we choose the prior P(c) to be uniform (P(c)=</Paragraph> <Paragraph position="8"> Note, however, that we developed in the previous section two types of non-uniform prior models. The first model derives P(c) heuristically from the BNC, ignoring the identity of the polysemous verb and its subcategorization profile, and the second model estimates the class distribution P(c, v, f) by taking the frame distribution into account. So, the naive Bayesian classifier in (21) can be extended with a nonuniform prior:</Paragraph> <Paragraph position="10"> where P(c) is estimated as shown in (d)-(g) in Table 2 and P(c, v, f), the prior for each class c corresponding to verb v in frame f , is estimated as explained in Section 2 (see (13)). As before, a</Paragraph> <Paragraph position="12"> are the contextual features. The probabilities P(a</Paragraph> <Paragraph position="14"> with class c, verb v, and frame f (for (23)). For features that have zero counts, we use add-k smoothing (Johnson 1932), where k is a number less than one.</Paragraph> </Section> <Section position="2" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 5.2 Feature Space </SectionTitle> <Paragraph position="0"> As is common in word sense disambiguation studies, we experimented with two types of context representations, collocations and co-occurrences. Co-occurrences simply indicate whether a given word occurs within some number of words to the left or right of an ambiguous word. In this case the contextual features are binary and represent the presence or absence of a particular word in the current or preceding sentence. We used four types of context in our experiments: left context (i.e., words occurring to the left of the ambiguous word), right context (i.e., words occurring to the right of the ambiguous word), the current sentence (i.e., words surrounding the ambiguous word), and the current sentence together with its immediately preceding sentence. Punctuation and capitalization were removed from the windows of context; noncontent words were included. The context words were represented as lemmas or parts of speech.</Paragraph> <Paragraph position="1"> Collocations are words that are frequently adjacent to the word to be disambiguated. We considered 12 types of collocations. Examples of collocations for the verb write are illustrated in Table 13. The L columns in the table indicate the number of words to the left of the ambiguous words, and the R columns, the number of words to the right. So for example, the collocation 1L3R represents one word to the left and three words to the right of the ambiguous word. Collocations again were represented as lemmas (see Table 13) or parts of speech.</Paragraph> </Section> <Section position="3" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 6.1 Method </SectionTitle> <Paragraph position="0"> We tested the performance of our naive Bayesian classifiers on the 67 genuinely ambiguous verbs on which the prior models were tested. Recall that these models were trained without access to a disambiguated corpus, which was used only to determine for a given verb and its frame its most likely meaning overall (i.e., across the corpus) instead of focusing on the meaning of individual corpus tokens. The same corpus was used for the disambiguation of individual tokens, excluding tokens assigned the class Other. The naive Bayes classifiers were trained and tested using 10-fold cross-validation on a set of 5,002 examples. These were representative of the frames &quot;NP1 V NP2,&quot; &quot;NP1 V NP2 NP3,&quot; &quot;NP1 V NP2 to NP3,&quot; and &quot;NP1 V NP2 for NP3.&quot; The frame &quot;NP1 V at NP2&quot; was excluded from our disambiguation experiments as it was represented solely by the verb kick (50 instances).</Paragraph> <Paragraph position="1"> In this study we compare a naive Bayesian classifier that relies on a uniform prior (see (20)) against two classifiers that make use of nonuniform prior models: The classifier in (22) effectively uses as prior the baseline model P(c) from Section 2, whereas the classifier in (23) relies on the more informative model P(c, f , v).Asa baseline for the disambiguation task, we simply assign the most common class in the training data to every instance in the test data, ignoring context and any form of prior information (Pedersen 2001; Gale, Church, and Yarowsky 1992a). We also report an upper bound on disambiguation performance by measuring how well human judges agree with one another (percentage agreement) on the class assignment task. Recall from Section 4.1 that our corpus was annotated by two judges with Levin-compatible verb classes.</Paragraph> </Section> <Section position="4" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> The results of our class disambiguation experiments are summarized in Figures 2-5.</Paragraph> <Paragraph position="1"> In order to investigate differences among different frames, we show how the naive Bayesian classifiers perform for each frame individually. Figures 2-5 (x-axis) also reveal the influence of collocational features of different sizes (see Table 13) on the classification task. Panel (b) in the figures presents the classifiers' accuracy when the collocational features are encoded as lemmas; in panel (c) of the figures, the context is represented as parts of speech, whereas in panel (a) of the figures, the context is represented by both lemmas and parts of speech.</Paragraph> <Paragraph position="2"> As can be seen in the figures, the naive Bayesian classifier with our informative prior (P(c, f , v), IPrior in Figures 2-5) generally outperforms the baseline prior (P(c), BPrior in Figures 2-5), the uniform prior (UPrior in Figures 2-5), and the base-line (Baseline in Figures 2-5) for all frames. Good performances are attained with Word sense disambiguation accuracy for &quot;NP1 V NP2 to NP3&quot; frame.</Paragraph> <Paragraph position="3"> lemmas, parts of speech, and combination of the two. The naive Bayesian classifier (IPrior) reaches the upper bound (UpBound in Figures 2-5) for the ditransitive frames &quot;NP1 V NP2 NP3,&quot; &quot;NP1 V NP2 to NP3,&quot; and &quot;NP1 V NP2 for NP3.&quot; The best accuracy (87.8%) for the transitive frame is achieved with the collocational features 0L2R, 1L2R, and 1L3R (see Figures 2(a)-(c)). For the double-object frame, the highest accuracy (90.8%) is obtained with features 0LR1 and 0L3R (see Figures 3(b) and 3(c)). Similarly, for the ditransitive &quot;NP1 V NP2 to NP3&quot; frame, the features 0L3R and 0L1R yield the best accuracies (88.8%, see Figures 4(a)-(c)). Finally, for the &quot;NP1 V NP2 for NP3&quot; frame, accuracy (94.4%) is generally good for most features when an informative prior is used. In fact, neither the uniform prior nor the baseline P(c) outperforms the baseline for this frame.</Paragraph> <Paragraph position="4"> Word sense disambiguation accuracy for &quot;NP1 V NP2 for NP3&quot; frame.</Paragraph> <Paragraph position="5"> The context encoding (lemmas versus parts of speech) does not seem to have a great influence on the disambiguation performance. Good accuracies are obtained with either parts of speech or lemmas; combination of the two does not yield better results. The classifier with the informative prior P(c, f , v) outperforms the baseline prior P(c) and the uniform prior also when co-occurrences are used. However, the co-occurrences never outperform the collocational features, for all four types of context. The classifiers (regardless of the type of prior being used) never beat the baseline for the frames &quot;NP1 V NP2&quot; and &quot;NP1 V NP2 to NP3&quot;. Accuracies above the baseline are achieved for the frames &quot;NP1 V NP2 NP3&quot; and &quot;NP1 V NP2 for NP3&quot; when an informative prior is used. Detailed results are summarized in the Appendix. Co-occurrences and windows of large sizes traditionally work well for topical distinctions (Gale, Church, and Yarowsky 1992b). Levin classes, however, typically capture differences in argument structure, that is, the types of objects or subjects that verbs select for. Argument structure is approximated by our collocational features. For example, a verb often taking a reflexive pronoun as its object is more likely to be a Reflexive Verb of Appearance than a verb that never subcategorizes for a reflexive object.</Paragraph> <Paragraph position="6"> There is not enough variability among the wider contexts surrounding a polysemous verb to inform the class-disambiguation task, as the Levin classes often do not cross topical boundaries.</Paragraph> </Section> </Section> <Section position="7" start_page="98" end_page="98" type="metho"> <SectionTitle> 7. Discussion </SectionTitle> <Paragraph position="0"> In this article, we have presented a probabilistic model of verb class ambiguity based on Levin's (1993) semantic classification. Our results show that subcategorization information acquired automatically from corpora provides important cues for verb class disambiguation (Experiment 1). In the absence of subcategorization cues, corpus-based distributions and quantitative approximations of linguistic concepts can be used to derive a preference ordering on the set of verbal meanings (Experiment 2). The semantic preferences that we have generated can be thought of as default semantic knowledge, to be used in the absence of any explicit contextual or lexical semantic information to the contrary (see Table 12). We have also shown that these preferences are useful for disambiguating polysemous verbs within their local contexts of occurrence (Experiment 3).</Paragraph> <Paragraph position="1"> The approach is promising in that it achieves satisfactory results with a simple model that has a straightforward interpretation in a Bayesian framework and does not Lapata and Brew Verb Class Disambiguation Using Informative Priors rely on the availability of annotated data. The model's parameters are estimated using simple distributions that can be extracted easily from corpora. Our model achieved an accuracy of 93.9% (over a baseline of 56.7%) on the class disambiguation task (Experiment 1) and an accuracy of 74.6% (over a baseline of 46.2%) on the task of deriving dominant verb classes (Experiment 2). Our disambiguation experiments reveal that this default semantic knowledge, when incorporated as a prior in a naive Bayes classifier, outperforms the uniform prior and the baseline of always defaulting to the most frequent class (Experiment 3). In fact, for three out of the four frames under study, our classifier with the informative prior achieved upper-bound performance.</Paragraph> <Paragraph position="2"> Although our results are promising, it remains to be shown that they generalize across frames and alternations. Four types of alternations were investigated in this study. However, Levin lists 79 alternations and approximately 200 classes. Although distributions for different class/frame combinations can easily be derived automatically, it remains to be shown that these distributions are useful for all verbs, frames, and classes. Also note that the models described in the previous sections crucially rely on the acquisition of relatively accurate frames from the corpus. It is a matter of future work to examine how the quality of the acquired frames influences the disambiguation task. Also, the assumption that the semantic class determines the subcategorization patterns of its class members independently of their identity may not be harmless for all classes and frames.</Paragraph> <Paragraph position="3"> Although our original aim was to develop a probabilistic framework that exploits Levin's (1993) linguistic classification and the systematic correspondence between syntax and semantics, a limitation of the model is that it cannot infer class information for verbs not listed in Levin. For these verbs, P(c), and hence P(c, f , v), will be zero. Recent work in computational linguistics (e.g., Sch &quot;utze 1998) and cognitive psychology (e.g., Landauer and Dumais 1997) has shown that large corpora implicitly contain semantic information, which can be extracted and manipulated in the form of co-occurrence vectors. One possible approach would be to compute the centroid (geometric mean) of the vectors of all members of a semantic class. Given an unknown verb (i.e., a verb not listed in Levin), we can decide its semantic class by comparing its semantic vector to the centroids of all semantic classes. For example, we could determine class membership on the basis of the distance to the closest centroid representing a semantic class (see Patel, Billinaria, and Levy [1998] for a proposal similar in spirit). Another approach put forward by Dorr and Jones (1996) utilizes WordNet (Miller and Charles 1991) to find similarities (via synonymy) between unknown verbs and verbs listed in Levin. Once we have chosen a class for an unknown verb, we are entitled to assume that it will share the broad syntactic and semantic properties of that class.</Paragraph> </Section> class="xml-element"></Paper>