File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1601_metho.xml
Size: 20,553 bytes
Last Modified: 2025-10-06 14:10:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1601"> <Title>Unsupervised Discovery of a Statistical Verb Lexicon</Title> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 2 Learning Setting </SectionTitle> <Paragraph position="0"> Our goal is to learn a model which relates a verb, its semantic roles, and their possible syntactic realizations. As is the case with most semantic role labeling research, we do not attempt to model the syntax itself, and instead assume the existence of a syntactic parse of the sentence. The parse may be from a human annotator, where available, or from an automatic parser. We can easily run our system on completely unannotated text by first running an automatic tokenizer, part-of-speech tagger, and parser to turn the text into tokenized, tagged sentences with associated parse trees.</Paragraph> <Paragraph position="1"> In order to keep the model simple, and independent of any particular choice of syntactic representation, we use an abstract representation of syn-Sentence: A deeper market plunge today could give them their first test.</Paragraph> <Paragraph position="3"> (wsj 2417), the verb instance extracted from it, and the values of the model variables for this instance. The semantic roles listed are taken from the PropBank annotation, but are not observed in the unsupervised training method.</Paragraph> <Paragraph position="4"> tax. We define a small set of syntactic relations, listed in Table 1, each of which describes a possible syntactic relationship between the verb and a dependent. Our goal was to choose a set that provides sufficient syntactic information for the semantic role decision, while remaining accurately computable from any reasonable parse tree using simple deterministic rules. Our set does not include the relations direct object or indirect object, since this distinction can not be made deterministically on the basis of syntactic structure alone; instead, we opted to number the noun phrase (np), complement clause (cl, xcl), and adjectival complements (acomp) appearing in an unbroken sequence directly after the verb, since this is sufficient to capture the necessary syntactic information. The syntactic relations used in our experiments are computed from the typed dependencies returned by the Stanford Parser (Klein and Manning, 2003).</Paragraph> <Paragraph position="5"> We also must choose a representation for semantic roles. We allow each verb a small fixed number of roles, in the manner similar to PropBank's ARG0...ARG5. We also designate a single adjunct role which is shared by all verbs, similar to PropBank's ARGM role. We say &quot;similar&quot; because our system never observes the Prop-Bank roles (or any human annotated semantic roles) and so cannot possibly use the same names.</Paragraph> <Paragraph position="6"> Our system assigns arbitrary integer names to the roles it discovers, just as clustering systems give model, with example values for each variable. The rectangle is a plate, indicating that the model contains multiple copies of the variables shown within it: in this case, one for each dependent j. Variables observed during learning are shaded. arbitrary names to the clusters they discover.3 Given these definitions, we convert our parsed corpora into a simple format: a set of verb instances, each of which represents an occurrence of a verb in a sentence. A verb instance consists of the base form (lemma) of the observed verb, and for each dependent of the verb, the dependent's syntactic relation and head word (represented as the base form with part of speech information). An example Penn Treebank sentence, and the verb instances extracted from it, are given in Figure 1.</Paragraph> </Section> <Section position="5" start_page="2" end_page="4" type="metho"> <SectionTitle> 3 Probabilistic Model </SectionTitle> <Paragraph position="0"> Our learning method is based on a structured probabilistic model of the domain. A graphical representation of the model is shown in Figure 2. The model encodes a joint probability distribution over the elements of a single verb instance, including the verb type, the particular linking, and for each dependent of the verb, its syntactic relation to the verb, semantic role, and head word.</Paragraph> <Paragraph position="1"> We begin by describing the generative process to which our model corresponds, using as our running example the instance of the verb give shown in Figure 1. We begin by generating the verb lemma v, in this case give. Conditioned on the 3In practice, while our system is not guaranteed to choose role names that are consistent with PropBank, it often does anyway, which is a consequence of the constrained form of the linking model.</Paragraph> <Paragraph position="2"> choice of verb give, we next generate a linking lscript, which defines both the set of core semantic roles to be expressed, as well as the syntactic relations that express them. In our example, we sample the ditransitive linking lscript = {ARG0 -</Paragraph> <Paragraph position="4"> ditioned on this choice of linking, we next generate an ordered linking o, giving a final position in the dependent list for each role and relation in the linking lscript, while also optionally inserting one or more adjunct roles. In our example, we generate the vector o = [(ARG0,subj),(ARGM,?), (ARG2,np#1),(ARG1,np#2)]. In doing so we've specified positions for ARG0, ARG1, and ARG2 and added one adjunct role ARGM in the second position. Note that the length of the ordered linking o is equal to the total number of dependents M of the verb instance. Now we iterate through each of the dependents 1 [?] j [?] M, generating each in turn. For the core arguments, the semantic role rj and syntactic relation gj are completely determined by the ordered linking o, so it remains only to sample the syntactic relation for the adjunct role: here we sample g2 = np. We finish by sampling the head word of each dependent, conditioned on the semantic role of that dependent. In this example, we generate the head</Paragraph> <Paragraph position="6"> Before defining the model more formally, we pause to justify some of the choices made in designing the model. First, we chose to distinguish between a verb's core arguments and its adjuncts.</Paragraph> <Paragraph position="7"> While core arguments must be associated with a semantic role that is verb specific (such as the patient role of increase: the rates in our example), adjuncts are generated by a role that is verb independent (such as the time of a generic event: last month in our example). Linkings include mappings only for the core semantic roles, resulting in a small, focused set of possible linkings for each verb. A consequence of this choice is that we introduce uncertainty between the choice of linking and its realization in the dependent list, which we represent with ordered linking variable o.4 We now present the model formally as a factored joint probability distribution. We factor the joint probability distribution into a product of the 4An alternative modeling choice would have been to add a state variable to each dependent, indicating which of the roles in the linking have been &quot;used up&quot; by previous dependents. probabilities of each instance:</Paragraph> <Paragraph position="9"> where we assume there are N instances, and we have used the vector notation g to indicate the vector of variables gj for all values of j (and similarly for r and w). We then factor the probability of each instance using the independencies shown in Figure 2 as follows:</Paragraph> <Paragraph position="11"> where we have assumed that there are M dependents of this instance. The verb v is always observed in our data, so we don't need to define P(v). The probability of generating the linking given the verb P(lscript|v) is a multinomial over possible linkings.5 Next, the probability of a particular ordering of the linking P(o|lscript) is determined only by the number of adjunct dependents that are added to o. One pays a constant penalty for each adjunct that is added to the dependent list, but otherwise all orderings of the roles are equally likely.</Paragraph> <Paragraph position="12"> Formally, the ordering o is distributed according to the geometric distribution of the difference between its length and the length of lscript, with constant parameter l.6 Next, P(gj|o) and P(rj|o) are completely deterministic for core roles: the syntactic relation and semantic role for position j are specified in the ordering o. For adjunct roles, we generate gj from a multinomial over syntactic relations.</Paragraph> <Paragraph position="13"> Finally, the word given the role P(wj|rj) is distributed as a multinomial over words.</Paragraph> <Paragraph position="14"> To allow for labeling elements of verb instances (verb types, syntactic relations, and head words) at test time that were unobserved in the training set, we must smooth our learned distributions. We use Bayesian smoothing: all of the learned distributions are multinomials, so we add psuedocounts, a generalization of the well-known add-one smoothing technique. Formally, this corresponds to a Bayesian model in which the parameters of these multinomial distributions are themselves random portant ordering information is captured by the syntactic relations. null struct a linking, select one operation from each list. variables, distributed according to a Dirichlet distribution.7 null</Paragraph> <Section position="1" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 3.1 Linking Model </SectionTitle> <Paragraph position="0"> The most straightforward choice of a distribution for P(lscript|v) would be a multinomial over all possible linkings. There are two problems with this simple implementation, both stemming from the fact that the space of possible linkings is large (there are O(|G+1||R|), where G is the set of syntactic relations and R is the set of semantic roles). First, most learning algorithms become intractable when they are required to represent uncertainty over such a large space. Second, the large space of linkings yields a large space of possible models, making learning more difficult.</Paragraph> <Paragraph position="1"> As a consequence, we have two objectives when designing P(lscript|v): (1) constrain the set of linkings for each verb to a set of tractable size which are linguistically plausible, and (2) facilitate the construction of a structured prior distribution over this set, which gives higher weight to linkings that are known to be more common. Our solution is to model the derivation of each linking as a sequence of construction operations, an idea which is similar in spirit to that used by Eisner (2001). Each operation adds a new role to the linking, possibly replacing or displacing one of the existing roles.</Paragraph> <Paragraph position="2"> The complete list of linking operations is given in Table 2. To build a linking we select one operation from each list; the presence of a no-operation for each role means that a linking doesn't have to include all roles. Note that this linking derivation process is not shown in Figure 2, since it is possi- null ble to compile the resulting distribution over linkings into the simpler multinomial P(lscript|v).</Paragraph> <Paragraph position="3"> More formally, we factor P(lscript|v) as follows, where c is the vector of construction operations used to build lscript:</Paragraph> <Paragraph position="5"> Note that in the second step we drop the term P(lscript|c) since it is always 1 (a sequence of operations leads deterministically to a linking).</Paragraph> <Paragraph position="6"> Given this derivation process, it is easy to created a structured prior: we just place pseudocounts on the operations that are likely a priori across all verbs. We place high pseudocounts on the no-operations (which preserve simple intransitive and transitive structure) and low pseudocounts on all the rest. Note that the use of this structured prior has another desired side effect: it breaks the symmetry of the role names (because some linkings more likely than others) which encourages the model to adhere to canonical role naming conventions, at least for commonly occurring roles like ARG0 and ARG1.</Paragraph> <Paragraph position="7"> The design of the linking model does incorporate prior knowledge about the structure of verb linkings and diathesis alternations. Indeed, the linking model provides a weak form of Universal Grammar, encoding the kinds of linking patterns that are known to occur in human languages.</Paragraph> <Paragraph position="8"> While not fully developed as a model of cross-linguistic verb argument realization, the model is not very English specific. It provides a not-veryconstrained theory of alternations that captures common cross-linguistic patterns. Finally, though we do encode knowledge in the form of the model structure and associated prior distributions, note that we do not provide any verb-specific knowledge; this is left to the learning algorithm.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="5" type="metho"> <SectionTitle> 4 Learning </SectionTitle> <Paragraph position="0"> Our goal in learning is to find parameter settings of our model which are likely given the data. Using th to represent the vector of all model parameters, if our data were fully observed, we could express our learning problem as</Paragraph> <Paragraph position="2"> Because of the factorization of the joint distribution, this learning task would be trivial, computable in closed form from relative frequency counts. Unfortunately, in our training set the variables lscript, o and r are hidden (not observed), leaving us with a much harder optimization problem:</Paragraph> <Paragraph position="4"> In other words, we want model parameters which maximize the expected likelihood of the observed data, where the expectation is taken over the hidden variables for each instance. Although it is intractable to find exact solutions to optimization problems of this form, the Expectation-Maximization (EM) algorithm is a greedy search procedure over the parameter space which is guaranteed to increase the expected likelihood, and thus find a local maximum of the function.</Paragraph> <Paragraph position="5"> While the M-step is clearly trivial, the E-step at first looks more complex: there are three hidden variables for each instance, lscript,o, and r, each of which can take an exponential number of values.</Paragraph> <Paragraph position="6"> Note however, that conditioned on the observed set of syntactic relations g, the variables lscript and o are completely determined by a choice of roles r for each dependent. So to represent uncertainty over these variables, we need only to represent a distribution over possible role vectors r. Though in the worst case the set of possible role vectors is still exponential, we only need role vectors that are consistent with both the observed list of syntactic relations and a linking that can be generated by the construction operations. Empirically the number of linkings is small (less than 50) for each of the observed instances in our data sets.</Paragraph> <Paragraph position="7"> Then for each instance we construct a conditional probability distribution over this set, which is computable in terms of the model parameters:</Paragraph> <Paragraph position="9"> We have denoted as lscriptr and or the values of lscript and o that are determined by each choice of r.</Paragraph> <Paragraph position="10"> To make EM work, there are a few additional subtleties. First, because EM is a hill-climbing algorithm, we must initialize it to a point in parameter space with slope (and without symmetries). We do so by adding a small amount of noise: for each dependent of each verb, we add a fractional count of 10[?]6 to the word distribution of a semantic role selected at random. Second, we must choose when to stop EM: we run until the relative change in data log likelihood is less than 10[?]4.</Paragraph> <Paragraph position="11"> A separate but important question is how well EM works for finding &quot;good&quot; models in the space of possible parameter settings. &quot;Good&quot; models are ones which list linkings for each verb that correspond to linguists' judgments about verb linking behavior. Recall that EM is guaranteed only to find a local maximum of the data likelihood function. There are two reasons why a particular maximum might not be a &quot;good&quot; model. First, because it is a greedy procedure, EM might get stuck in local maxima, and be unable to find other points in the space that have much higher data likelihood.</Paragraph> <Paragraph position="12"> We take the traditional approach to this problem, which is to use random restarts; however empirically there is very little variance over runs. A deeper problem is that data likelihood may not correspond well to a linguist's assessment of model quality. As evidence that this is not the case, we have observed a strong correlation between data log likelihood and labeling accuracy.</Paragraph> </Section> <Section position="7" start_page="5" end_page="6" type="metho"> <SectionTitle> 5 Datasets and Evaluation </SectionTitle> <Paragraph position="0"> We train our models with verb instances extracted from three parsed corpora: (1) the Wall Street Journal section of the Penn Treebank (PTB), which was parsed by human annotators (Marcus et al., 1993), (2) the Brown Laboratory for Linguistic Information Processing corpus of Wall Street Journal text (BLLIP), which was parsed automatically by the Charniak parser (Charniak, 2000), and (3) the Gigaword corpus of raw newswire text (GW), which we parsed ourselves with the Stanford parser. In all cases, when training a model, in PropBank Section 23 and Section 24 for semantic role. Learned results are averaged over 5 runs.</Paragraph> <Paragraph position="1"> we specify a set of target verb types (e.g., the ones in the test set), and build a training set by adding a fixed number of instances of each verb type from the PTB, BLLIP, and GW data sets, in that order.</Paragraph> <Paragraph position="2"> For the semantic role labeling evaluation, we use our system to label the dependents of unseen verb instances for semantic role. We use the sentences in PTB section 23 for testing, and PTB section 24 for development. The development set consists of 2507 verb instances and 833 different verb types, and the test set consists of 4269 verb instances and 1099 different verb types. Free parameters were tuned on the development set, and the test set was only used for final experiments.</Paragraph> <Paragraph position="3"> Because we do not observe the gold standard semantic roles at training time, we must choose an alignment between the guessed labels and the gold labels. We do so optimistically, by choosing the gold label for each guessed label which maximizes the number of correct guesses. This is a well known approach to evaluation in unsupervised learning: when it is used to compute accuracy, the resulting metric is sometimes called cluster purity. While this amounts to &quot;peeking&quot; at the answers before evaluation, the amount of human knowledge that is given to the system is small: it corresponds to the effort required to hand assign a &quot;name&quot; to each label that the system proposes. As is customary, we divide the problem into two subtasks: identification (ID) and classification (CL). In the identification task, we identify the set of constituents which fill some role for a</Paragraph> <Paragraph position="5"> target verb: in our system we use simple rules to extract dependents of the target verb and their grammatical relations. In the classification task, the identified constituents are labeled for their semantic role by the learned probabilistic model. We report results on two variants of the basic classification task: coarse roles, in which all of the adjunct roles are collapsed to a single ARGM role (Toutanova, 2005), and core roles, in which we evaluate performance on the core semantic roles only (thus collapsing the ARGM and unlabeled categories). We do not report results on the all roles task, since our current model does not distinguish between different types of adjunct roles. For each task we report precision, recall, and F1.</Paragraph> </Section> class="xml-element"></Paper>