File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3220_metho.xml
Size: 14,684 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3220"> <Title>Verb Sense and Subcategorization: Using Joint Inference to Improve Performance on Complementary Tasks</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Model Structure and Inference </SectionTitle> <Paragraph position="0"> Our generative probabilistic model can be thought of as having three primary components: the sense model, relating the verb sense to the surrounding context, the subcategorization model, relating the verb subcategorization to the sentence, and the joint model, relating the sense and SCF of the verb to each other. More formally, the model is a factored representation of a joint distribution over these variables and the data: the verb sense (V ), the verb SCF (C), the unordered context &quot;bag-of-words&quot; (W), and the sentence as an ordered sequence of words (S). The joint distribution P.V; C; W; S/ is then factored as</Paragraph> <Paragraph position="2"> sense and subcategorization probabilistic model. Note that the box defines a plate, indicating that the model contains n copies of this variable.</Paragraph> <Paragraph position="3"> where Wi is the word type occurring in each position of the context (including the target sentence itself). The first two terms together define a joint distribution over verb sense (V ) and SCF (C), the third term defines the subcategorization model, and the last term defines the sense model. A graphical model representation is shown in Figure 1.</Paragraph> <Paragraph position="4"> The model assumes the following generative process for a data instance of a particular verb. First we generate the sense of the target verb. Conditioned on the sense, we generate the SCF of the verb. (Note that the decision to generate sense and then SCF is arbitrary and forced by the desire to factor the model; we discuss reversing the order below.) Then, conditioned on the sense of the verb, we generate an unordered collection of context words. (For the Senseval-2 corpus, this collection includes not only the words in the sentence in which the verb occurs, but also the words in surrounding sentences.) Finally, conditioned on the SCF of the verb, we generate the immediate sentence containing the verb as an ordered sequence of words.</Paragraph> <Paragraph position="5"> An apparent weakness of this model is that it double-generates the context words from the enclosing sentence: they are generated once by the sense model, as an unordered collection of words, and once by the subcategorization model, as an ordered sequence of words. The model is thus deficient in that it assigns a large portion of its probability mass to impossible cases: those instances which have words in the context which do not match those in the sentence. However because the sentences are always observed, we only consider instances in the set of consistent cases, so the deficiency should be irrelevant for the purpose of reasoning about sense and SCF.</Paragraph> <Paragraph position="6"> We discuss each of the model components in turn.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Verb Sense Model </SectionTitle> <Paragraph position="0"> The verb sense component of our model is an ordinary multinomial Naive Bayes &quot;bag-of-words&quot; model: P.V/Qi P.WijV/. We learn the marginal over verb sense with maximum likelihood estimation (MLE) from the sense annotated data. We learn the sense-conditional word model using smoothed MLE from the sense annotated data, and to smooth we use Bayesian smoothing with a Dirichlet prior.</Paragraph> <Paragraph position="1"> The free smoothing parameter is determined empirically, once for all words in the data set. In the independent sense model, to infer the most likely sense given a context of words P.SjW/, we just find the V that maximizes P.V/Qi P.WijV/. Inference in the joint model over sense and SCF is more complex, and is described below.</Paragraph> <Paragraph position="2"> In order to make our system competitive with leading WSD systems we made an important modification to this basic model: we added relative position feature weighting. It is known that words closer to the target word are more predictive of sense, so it is reasonable to weight them more highly. We define a set of &quot;buckets&quot;, or partition over the position of the context word relative to the target verb, and we weight each context word feature with a weight given by its bucket, both when estimating model parameters at train time and when performing inference at test time. We use the following 8 relative position buckets: . 1; 6U, T 5; 3U, 2, 1, 1, 2, T3; 5U, and T6;1/. The bucket weights are found empirically using a simple optimization procedure on k-fold training set accuracy. In ablation tests on this system we found that the use of relative position feature weighting, when combined with corresponding evidence attenuation (see Section 3.3) increased the accuracy of the standalone verb sense disambiguation model from 46.2% to 54.0%.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Verb Subcategorization Model </SectionTitle> <Paragraph position="0"> The verb SCF component of our model P.SjC/ represents the probability of particular sentences given each possible SCF. Because there are infinitely many possible sentences, a multinomial representation is infeasible, and we instead chose to encode the distribution using a set of probabilistic context free grammars (PCFGs). A PCFG is created for each possible SCF: each PCFG yields only parse trees in which the distinguished verb subcategorizes in the specified manner (but other verbs can parse freely). Given a SCF-specific PCFG, we can determine the probability of the sentence using the inside algorithm, which sums the probabilities of all possible trees in the grammar producing the sentence.</Paragraph> <Paragraph position="1"> To do this, we modified the exact PCFG parser of Klein and Manning (2003). In the independent SCF model, to infer the most likely SCF given a sentence P.CjS/, we just find the C that maximizes P.SjC/P.C/. (For the independent model, the SCF prior is estimated using MLE from the training examples.) Inference in the joint model over sense and SCF is more complex, and is described below.</Paragraph> <Paragraph position="2"> Learning this model, SCF-specific PCFGs, from our SCF-annotated training data, requires some care. Commonly PCFGs are learned using MLE of rewrite rule probabilities from large sets of treeannotated sentences. Thus to learn SCF-specific PCFGs, it seems that we should select a set of annotated sentences containing the target verb, determine the SCF of the target verb in each sentence, create a separate corpus for each SCF of the target verb, and then learn SCF-specific grammars from the SCF-specific corpora. If we are careful to distinguish rules which dominate the target verb from those which do not, then the grammar will be constrained to generate trees in which the target verb subcategorizes in the specified manner, and other verbs can occur in general tree structures. The problem with this approach is that in order to create a broad-coverage grammar (which we will need in order for it to generalize accurately to unseen test instances) we will need a very large number of sentences in which the target verb occurs, and we do not have enough data for this approach.</Paragraph> <Paragraph position="3"> Because we want to maximize the use of the available data, we must instead make use of every verb occurrence when learning SCF-specific rewrite rules. We can accomplish this by making a copy of each sentence for each verb occurrence (not just the target verb), determining the SCF of the distinguished verb in each sentence, partitioning the sentence copies by distinguished verb SCF, and learning SCF-specific grammars using MLE. Finally, we change the lexicon by forcing the distinguished verb tag to rewrite to only our target verb. The method we actually use is functionally equivalent to this latter approach, but altered for efficiency. Instead of making copies of sentences with multiple verbs, we use a dense representation. We determine the SCF of each verb in the sentence, and then annotate the verb and all nonterminal categories occurring above the verb in the tree, up to the root, with the SCF of the verb. Note that some nonterminals will then have multiple annotations. Then to learn a SCF-specific PCFG, we count rules that have the specified SCF annotation as rules which can dominate the distinguished verb, and then count all rules (including the SCF-specific ones) as general rules which cannot dominate the distinguished verb.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 The Joint Model </SectionTitle> <Paragraph position="0"> Given a fully annotated dataset, it is trivial to learn the parameters of the joint distribution over verb sense and SCF P.V; C/ using MLE. However, because we do not have access to such a dataset, we instead use the EM algorithm to &quot;complete&quot; the missing annotations with expectations, or soft assignments, over the values of the missing variable (we present the EM algorithm in detail in the next section). Given this &quot;completed&quot; data, it is again trivial to learn the parameters of the joint probability model using smoothed MLE. We use simple Laplace add-one smoothing to smooth the distribution. null However, a small complication arises from the fact that the marginal distributions over senses and SCFs for a particular verb may differ between the two halves of our data set. They are, after all, wholly different corpora, assembled by different people for different purposes. For this reason, when testing the system on the sense corpus we'd like to use a sense marginal distribution trained from the sense corpus, and when testing the system on the SCF corpus we'd like to use a SCF marginal distribution trained from the SCF corpus. To address this, recall from above that the factoring we choose for the joint distribution is arbitrary. When performing sense inference we use the model Pv.V/P j.CjV/ where P j.CjV/ was learned from the complete data, and Pv.V/ was learned from the sense-marked examples only. When performing SCF inference we use the equivalent factoring Pc.C/P j.VjC/, where P j.VjC/ was learned from the complete data, and Pc.C/ was learned from the SCF-annotated examples only.</Paragraph> <Paragraph position="1"> We made one additional modification to this joint model to improve performance. When performing inference in the model, we found it useful to differentially weight different probability terms. The most obvious need for this comes from the fact that the sense-conditional word model employs relative position feature weighting, which can change the relatively magnitude of the probabilities in this term. In particular, by using feature weights greater than 1:0 during inference we overestimate the actual amount of evidence. Even without the feature weighting, however, the word model can still overestimate the actual evidence given that it encodes an incorrect independence assumption between word features (of course word occurrence in text is actually very highly correlated). The PCFG model also suffers from a less severe instance of the same problem: human languages are of course not context free, and there is in fact correlation between supposedly independent tree structures in different parts of the tree. To remedy this evidence overconfidence, it is helpful to attenuate or downweight the evidence terms accordingly. More generally, we place weights on each of the probability terms used in inference calculations, yielding models of the following form:</Paragraph> <Paragraph position="3"> These . / weights are free parameters, and we find them by simple optimization on k-fold accuracy. In ablation tests on this system, we found that term weighting (particularly evidence attenuation) increased the accuracy of the standalone sense model from 51.9% to 54.0% at the fine-grained verb sense disambiguation task.</Paragraph> <Paragraph position="4"> We now describe the precise EM algorithm used.</Paragraph> <Paragraph position="5"> Prior to running EM we first learn the independent sense and SCF model parameters from their respective datasets. We also initialize the joint sense and SCF distribution to the uniform distribution. Then we iterate over the following steps: E-step: Using the current model parameters, for each datum in the sense-annotated corpus, compute expectations over the possible SCFs, and for each datum in the SCF-annotated corpus, compute expectations over the possible senses.</Paragraph> <Paragraph position="6"> M-step: use the completed data to reestimate the joint distribution over sense and SCF.</Paragraph> <Paragraph position="7"> We run EM to convergence, which for our dataset occurs within 6 iterations. Additional iterations do not change the accuracy of our model. Early stopping of EM after 3 iterations was found to hurt k-fold sense accuracy by 0.1% and SCF accuracy by 0.2%. Early stopping of EM after only 1 iteration was found to hurt k-fold sense accuracy by a total of 0.2% and SCF accuracy by 0.4%. These may seem like small differences, but significant relative to the advantages given by the joint model (see below).</Paragraph> <Paragraph position="8"> In the E-step of EM, it is necessary to do inference over the joint model, computing posterior expectations of unknown variables conditioned on evidence variables. During the testing phase, it is also necessary to do inference, computing maximum a posteriori (MAP) values of unknown variables conditioned on evidence variables. In all cases we do exact Bayesian network inference, which involves conditioning on evidence variables, summing over extraneous variables, and then either maximizing over the resulting factors of query variables, or normalizing them to obtain distributions of query variables. At test time, when querying about the MAP sense (or SCF) of an instance, we chose to maximize over the marginal distribution, rather than maximize over the joint sense and SCF distribution.</Paragraph> <Paragraph position="9"> We found empirically that this gave us higher accuracy at the individual tasks. If instead we were doing joint prediction, we would expect high accuracy to result from maximizing over the joint.</Paragraph> </Section> </Section> class="xml-element"></Paper>