File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1014_metho.xml
Size: 20,063 bytes
Last Modified: 2025-10-06 14:15:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1014"> <Title>Detlef Prescher</Title> <Section position="3" start_page="0" end_page="106" type="metho"> <SectionTitle> 2 EM-Based Clustering </SectionTitle> <Paragraph position="0"> In our clustering approach, classes are derived directly from distributional data--a sample of pairs of verbs and nouns, gathered by parsing an unannotated corpus and extracting the fillers of grammatical relations. Semantic classes corresponding to such pairs are viewed as hidden variables or unobserved data in the context of maximum likelihood estimation from incomplete data via the EM algorithm. This approach allows us to work in a mathematically well-defined framework of statistical inference, i.e., standard monotonicity and convergence results for the EM algorithm extend to our method.</Paragraph> <Paragraph position="1"> The two main tasks of EM-based clustering are i) the induction of a smooth probability model on the data, and ii) the automatic discovery of class-structure in the data. Both of these aspects are respected in our application of lexicon induction. The basic ideas of our EM-based clustering approach were presented in Rooth (Ms).</Paragraph> <Paragraph position="2"> Our approach constrasts with the merely heuristic and empirical justification of similarity-based approaches to clustering (Dagan et al., to appear) for which so far no clear probabilistic interpretation has been given. The probability model we use can be found earlier in Pereira et al. (1993). However, in contrast to this ap-</Paragraph> <Paragraph position="4"> proach, our statistical inference method for clustering is formalized clearly as an EM-algorithm.</Paragraph> <Paragraph position="5"> Approaches to probabilistic clustering similar to ours were presented recently in Saul and Pereira (1997) and Hofmann and Puzicha (1998). There also EM-algorithms for similar probability models have been derived, but applied only to simpler tasks not involving a combination of EM-based clustering models as in our lexicon induction experiment. For further applications of our clustering model see Rooth et al. (1998).</Paragraph> <Paragraph position="6"> We seek to derive a joint distribution of verb-noun pairs from a large sample of pairs of verbs v E V and nouns n E N. The key idea is to view v and n as conditioned on a hidden class c E C, where the classes are given no prior interpretation. The semantically smoothed probability of a pair (v, n) is defined to be:</Paragraph> <Paragraph position="8"> The joint distribution p(c,v,n) is defined by p(c, v, n) = p(c)p(vlc)p(n\[c ). Note that by construction, conditioning of v and n on each other is solely made through the classes c.</Paragraph> <Paragraph position="9"> In the framework of the EM algorithm (Dempster et al., 1977), we can formalize clustering as an estimation problem for a latent class (LC) model as follows. We are given: (i) a sample space y of observed, incomplete data, corre17: scalar change sponding to pairs from VxN, (ii) a sample space X of unobserved, complete data, corresponding to triples from CxYxg, (iii) a set X(y) = {x E</Paragraph> <Paragraph position="11"> to the observation y, (iv) a complete-data specification pe(x), corresponding to the joint probability p(c, v, n) over C x V x N, with parameter-vector 0 : (0c, Ovc, OncJc E C, v e V, n E N), (v) an incomplete data specification Po(Y) which is related to the complete-data specification as the marginal probability Po(Y) -- ~~X(y)po(x). &quot; The EM algorithm is directed at finding a value 0 of 0 that maximizes the incomplete-data log-likelihood function L as a function of 0 for a given sample y, i.e., 0 = arg max L(O) where L(O) = lnl-IyP0(y ).</Paragraph> <Paragraph position="12"> As prescribed by the EM algorithm, the parameters of L(e) are estimated indirectly by proceeding iteratively in terms of complete-data estimation for the auxiliary function Q(0;0(t)), which is the conditional expectation of the complete-data log-likelihood lnps(x) given the observed data y and the current fit of the parameter values 0 (t) (E-step). This auxiliary function is iteratively maximized as a function of O (M-step), where each iteration is defined by the map O(t+l) = M(O(t) = argmax Q(O; 0 (t)) 0 Note that our application is an instance of the EM-algorithm for context-free models (Baum et</Paragraph> <Paragraph position="14"> al., 1970), from which the following particularily simple reestimation formulae can be derived.</Paragraph> <Paragraph position="15"> Let x = (c, y) for fixed c and y. Then</Paragraph> <Paragraph position="17"> probabilistic context-free grammar of (Carroll and Rooth, 1998) gave for the British National Corpus (117 million words).</Paragraph> <Paragraph position="18"> Intuitively, the conditional expectation of the number of times a particular v, n, or c choice is made during the derivation is prorated by the conditionally expected total number of times a choice of the same kind is made. As shown by Baum et al. (1970), these expectations can be calculated efficiently using dynamic programming techniques. Every such maximization step increases the log-likelihood function L, and a sequence of re-estimates eventually converges to a (local) maximum of L.</Paragraph> <Paragraph position="19"> In the following, we will present some examples of induced clusters. Input to the clustering algorithm was a training corpus of 1280715 tokens (608850 types) of verb-noun pairs participating in the grammatical relations of intransitive and transitive verbs and their subject- and object-fillers. The data were gathered from the a model with 35 classes. At the top are listed the 20 most probable nouns in the p(nl5 ) distribution and their probabilities, and at left are the 30 most probable verbs in the p(vn5) distribution. 5 is the class index. Those verb-noun pairs which were seen in the training data appear with a dot in the class matrix. Verbs with suffix .as : s indicate the subject slot of an active intransitive. Similarily .ass : s denotes the subject slot of an active transitive, and .ass : o denotes the object slot of an active transitive. Thus v in the above discussion actually consists of a combination of a verb with a subcat frame slot as : s, ass : s, or ass : o. Induced classes often have a basis in lexical semantics; class 5 can be interpreted as clustering agents, denoted by proper names, &quot;man&quot;, and &quot;woman&quot;, together with verbs denoting communicative action. Fig. 1 shows a cluster involving verbs of scalar change and things which can move along scales. Fig. 5 can be interpreted as involving different dispositions and modes of their execution.</Paragraph> </Section> <Section position="4" start_page="106" end_page="106" type="metho"> <SectionTitle> 3 Evaluation of Clustering Models </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="106" end_page="106" type="sub_section"> <SectionTitle> 3.1 Pseudo-Disambiguation </SectionTitle> <Paragraph position="0"> We evaluated our clustering models on a pseudo-disambiguation task similar to that performed in Pereira et al. (1993), but differing in detail.</Paragraph> <Paragraph position="1"> The task is to judge which of two verbs v and v ~ is more likely to take a given noun n as its argument where the pair (v, n) has been cut out of the original corpus and the pair (v ~, n) is constructed by pairing n with a randomly chosen verb v ~ such that the combination (v ~, n) is completely unseen. Thus this test evaluates how well the models generalize over unseen verbs.</Paragraph> <Paragraph position="2"> The data for this test were built as follows.</Paragraph> <Paragraph position="3"> We constructed an evaluation corpus of (v, n, v ~) triples by randomly cutting a test corpus of 3000 (v, n) pairs out of the original corpus of 1280712 tokens, leaving a training corpus of 1178698 tokens. Each noun n in the test corpus was combined with a verb v ~ which was randomly chosen according to its frequency such that the pair (v ~, n) did appear neither in the training nor in the test corpus. However, the elements v, v ~, and n were required to be part of the training corpus.</Paragraph> <Paragraph position="4"> Furthermore, we restricted the verbs and nouns in the evalutation corpus to the ones which occured at least 30 times and at most 3000 times with some verb-functor v in the training corpus. The resulting 1337 evaluation triples were used to evaluate a sequence of clustering models trained from the training corpus.</Paragraph> <Paragraph position="5"> The clustering models we evaluated were * parametrized in starting values of the training algorithm, in the number of classes of the model, and in the number of iteration steps, resulting in a sequence of 3 x 10 x 6 models. Starting from a lower bound of 50 % random choice, accuracy was calculated as the number of times the model decided for p(nlv) > p(nlv' ) out of all choices made. Fig. 3 shows the evaluation results for models trained with 50 iterations, averaged over starting values, and plotted against class cardinality. Different starting values had an ef- null fect of + 2 % on the performance of the test.</Paragraph> <Paragraph position="6"> We obtained a value of about 80 % accuracy for models between 25 and 100 classes. Models with more than 100 classes show a small but stable overfitting effect.</Paragraph> </Section> <Section position="2" start_page="106" end_page="106" type="sub_section"> <SectionTitle> 3.2 Smoothing Power </SectionTitle> <Paragraph position="0"> A second experiment addressed the smoothing power of the model by counting the number of (v, n) pairs in the set V x N of all possible combinations of verbs and nouns which received a positive joint probability by the model. The V x N-space for the above clustering models included about 425 million (v, n) combinations; we approximated the smoothing size of a model by randomly sampling 1000 pairs from V x N and returning the percentage of positively assigned pairs in the random sample. Fig. 4 plots the smoothing results for the above models against the number of classes. Starting values had an influence of -+ 1% on performance. Given the proportion of the number of types in the training corpus to the V x N-space, without clustering we have a smoothing power of 0.14 % whereas for example a model with 50 classes and 50 iterations has a smoothing power of about 93 %.</Paragraph> <Paragraph position="1"> Corresponding to the maximum likelihood paradigm, the number of training iterations had a decreasing effect on the smoothing performance whereas the accuracy of the pseudo-disambiguation was increasing in the number of iterations. We found a number of 50 iterations to be a good compromise in this trade-off.</Paragraph> </Section> </Section> <Section position="5" start_page="106" end_page="108" type="metho"> <SectionTitle> 4 Lexicon Induction Based on </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="106" end_page="107" type="sub_section"> <SectionTitle> Latent Classes </SectionTitle> <Paragraph position="0"> The goal of the following experiment was to derive a lexicon of several hundred intransitive and transitive verbs with subcat slots labeled with latent classes.</Paragraph> </Section> <Section position="2" start_page="107" end_page="107" type="sub_section"> <SectionTitle> 4.1 Probabilistic Labeling with Latent Classes using EM-estimation </SectionTitle> <Paragraph position="0"> To induce latent classes for the subject slot of a fixed intransitive verb the following statistical inference step was performed. Given a latent class model PLC(') for verb-noun pairs, and a sample nl,... ,aM of subjects for a fixed intransitive verb, we calculate the probability of an arbitrary subject n E N by:</Paragraph> <Paragraph position="2"> The estimation of the parameter-vector 0 = (Oclc E C) can be formalized in the EM framework by viewing p(n) or p(c, n) as a function of 0 for fixed PLC(.). The re-estimation formulae resulting from the incomplete data estimation for these probability functions have the following form (f(n) is the frequency of n in the sample of subjects of the fixed verb):</Paragraph> <Paragraph position="4"> A similar EM induction process can be applied also to pairs of nouns, thus enabling induction of latent semantic annotations for transitive verb frames. Given a LC model PLC(') for verb-noun pairs, and a sample (nl,n2)l,..., (nl,n2)M of noun arguments (ni subjects, and n2 direct objects) for a fixed transitive verb, we calculate the probability of its noun argument pairs by:</Paragraph> <Paragraph position="6"> in an EM framework by viewing p(nl,n2) or p(cl,c2,nl,n2) as a function of 0 for fixed PLC(.). The re-estimation formulae resulting from this incomplete data estimation problem have the following simple form (f(nz, n2) is the frequency of (n!, n2) in the sample of noun argument pairs of the fixed verb):</Paragraph> <Paragraph position="8"> Note that the class distributions p(c) and p(cl,C2) for intransitive and transitive models can be computed also for verbs unseen in the</Paragraph> </Section> <Section position="3" start_page="107" end_page="108" type="sub_section"> <SectionTitle> 4.2 Lexicon Induction Experiment </SectionTitle> <Paragraph position="0"> Experiments used a model with 35 classes. From maximal probability parses for the British National Corpus derived with a statistical parser (Carroll and Rooth, 1998), we extracted frequency tables for intransitve verb/subject pairs and transitive verb/subject/object triples. The 500 most frequent verbs were selected for slot labeling. Fig. 6 shows two verbs v for which the most probable class label is 5, a class which we earlier described as communicative action, together with the estimated frequencies of f(n)po(cln ) for those ten nouns n for which this sitive scalar motion sense of increase.</Paragraph> <Paragraph position="1"> Fig. 8 shows the intransitive verbs which take 17 as the most probable label. Intuitively, the verbs are semantically coherent. When compared to Levin (1993)'s 48 top-level verb classes, we found an agreement of our classification with her class of &quot;verbs of changes of state&quot; except for the last three verbs in the list in Fig. 8 which is sorted by probability of the class label.</Paragraph> <Paragraph position="2"> Similar results for German intransitive scalar motion verbs are shown in Fig. 9. The data for these experiments were extracted from the maximal-probability parses of a 4.1 million word</Paragraph> <Paragraph position="4"> corpus of German subordinate clauses, yielding 418290 tokens (318086 types) of pairs of verbs or adjectives and nouns. The lexicalized probabilistic grammar for German used is described in Beil et al. (1999). We compared the German example of scalar motion verbs to the linguistic classification of verbs given by Schuhmacher (1986) and found an agreement of our classification with the class of &quot;einfache Anderungsverben&quot; (simple verbs of change) except for the verbs anwachsen (increase) and stagnieren(stagnate) which were not classified there at all.</Paragraph> <Paragraph position="5"> Fig. i0 shows the most probable pair of classes for increase as a transitive verb, together with estimated frequencies for the head filler pair.</Paragraph> <Paragraph position="6"> Note that the object label 17 is the class found with intransitive scalar motion verbs; this correspondence is exploited in the next section.</Paragraph> </Section> </Section> <Section position="6" start_page="108" end_page="109" type="metho"> <SectionTitle> 5 Linguistic Interpretation </SectionTitle> <Paragraph position="0"> In some linguistic accounts, multi-place verbs are decomposed into representations involving (at least) one predicate or relation per argument. For instance, the transitive causative/inchoative verb increase, is composed of an actor/causative verb combining with a transitive verb increase. Second: corresponding lexical entry with induced classes as relational constants. Third: indexed open class root added as conjunct in transitive scalar motion increase. Fourth: induced entry for related intransitive increase. null one-place predicate in the structure on the left in Fig. 11. Linguistically, such representations are motivated by argument alternations (diathesis), case linking and deep word order, language acquistion, scope ambiguity, by the desire to represent aspects of lexical meaning, and by the fact that in some languages, the postulated decomposed representations are overt, with each primitive predicate corresponding to a morpheme. For references and recent discussion of this kind of theory see Hale and Keyser (1993) and Kural (1996).</Paragraph> <Paragraph position="1"> We will sketch an understanding of the lexical representations induced by latent-class labeling in terms of the linguistic theories mentioned above, aiming at an interpretation which combines computational leaxnability, linguistic motivation, and denotational-semantic adequacy.</Paragraph> <Paragraph position="2"> The basic idea is that latent classes are computational models of the atomic relation symbols occurring in lexical-semantic representations. As a first implementation, consider replacing the relation symbols in the first tree in Fig. 11 with relation symbols derived from the latent class labeling. In the second tree in Fig 11, R17 and R8 are relation symbols with indices derived from the labeling procedure of Sect. 4. Such representations can be semantically interpreted in standard ways, for instance by interpreting relation symbols as denoting relations between events and individuals.</Paragraph> <Paragraph position="3"> Such representations are semantically inadequate for reasons given in philosophical critiques of decomposed linguistic representations; see Fodor (1998) for recent discussion. A lexicon' estimated in the above way has as many primitive relations as there are latent classes. We guess there should be a few hundred classes in an approximately complete lexicon (which would have to be estimated from a corpus of hundreds of millions of words or more). Fodor's arguments, which axe based on the very limited degree of genuine interdefinability of lexical items and on Putnam's arguments for contextual determination of lexical meaning, indicate that the number of basic concepts has the order of magnitude of the lexicon itself. More concretely, a lexicon constructed along the above principles would identify verbs which are labelled with the same latent classes; for instance it might identify the representations of grab and touch.</Paragraph> <Paragraph position="4"> For these reasons, a semantically adequate lexicon must include additional relational constants. We meet this requirement in a simple way, by including as a conjunct a unique constant derived from the open-class root, as in the third tree in Fig. 11. We introduce indexing of the open class root (copied from the class index) in order that homophony of open class roots not result in common conjuncts in semantic representations--for instance, we don't want the two senses of decline exemplified in decline the proposal and decline five percent to have an common entailment represented by a common conjunct. This indexing method works as long as the labeling process produces different latent class labels for the different senses.</Paragraph> <Paragraph position="5"> The last tree in Fig. 11 is the learned representation for the scalar motion sense of the intransitive verb increase. In our approach, learning the argument alternation (diathesis) relating the transitive increase (in its scalar motion sense) to the intransitive increase (in its scalar motion sense) amounts to learning representations with a common component R17 A increase17. In this case, this is achieved.</Paragraph> </Section> class="xml-element"></Paper>