File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3224_metho.xml
Size: 21,280 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3224"> <Title>A Distributional Analysis of a Lexicalized Statistical Parsing Model</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Frequencies </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Definitions and notation In this paper we will refer to any estimated dis- </SectionTitle> <Paragraph position="0"> tribution as a parameter that has been instantiated from a parameter class. For example, in an n-gram language model, p(wijwi 1) is a parameter class, whereas the estimated distribution ^p( jthe) is a particular parameter from this class, consisting of estimates of every word that can follow the word &quot;the&quot;.</Paragraph> <Paragraph position="1"> For this work, we used the model described in (Bikel, 2002; Bikel, 2004). Our emulation of Collins' Model 2 (hereafter referred to simply as &quot;the model&quot;) has eleven parameter classes, each of which employs up to three back-o levels, where back-o level 0 is just the &quot;un-backed-o &quot; maximal context history.1 In other words, a smoothed probability estimate is the interpolation of up to three di erent unsmoothed estimates. The notation and description for each of these parameter classes is shown in Table 1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Basic frequencies </SectionTitle> <Paragraph position="0"> Before looking at the number of parameters in the model, it is important to bear in mind the amount of data on which the model is trained and on which actual parameters will be induced from parameter classes. The standard training set for English consists of Sections 02-21 of the Penn Treebank, which in turn consist of 39,832 sentences with a total of 950,028 word tokens (not including null elements).</Paragraph> <Paragraph position="1"> There are 44,113 unique words (again, not including null elements), 10,437 of which occur 6 times or more.2 The trees consist of 904,748 brackets with 28 basic nonterminal labels, to which function tags such as -TMP and indices are added in the data to form 1184 observed nonterminals, not including preterminals. After tree transformations, the model maps these 1184 nonterminals down to just 43. There are 42 unique part of speech tags that serve as preterminals in the trees; the model prunes away three of these ( , and.).</Paragraph> <Paragraph position="2"> Induced from these training data, the model contains 727,930 parameters; thus, there are nearly as many parameters as there are brackets or word tokens. From a history-based grammar perspective, there are 727,930 types of history contexts from which futures are generated. However, 401,447 of these are singletons. The average count for a history context is approximately 35.56, while the average diversity is approximately 1.72. The model contains 1,252,280 unsmoothed maximum-likelihood probability estimates (727; 930 1:72 1; 252; 280). Even when a given future was not seen with a particular history, it is possible that one of its associated 1Collins' model splits out the PM and PM w classes into leftand right-specific versions, and has two additional classes for dealing with coordinating conjunctions and inter-phrasal punctuation. Our emulation of Collins' model incorporates the information of these specialized parameter classes into the existing PM and PMw parameters.</Paragraph> <Paragraph position="3"> 2We mention this statistic because Collins' thesis experiments were performed with an unknown word threshold of 6. minal is a nonterminal label and its head word's part of speech (such asNP(NN)). yThe hidden nonterminal +TOP+is added during training to be the parent of every observed tree.</Paragraph> <Paragraph position="5"> as a tree fragment. The : : : represents the future that is to be generated given this history.</Paragraph> <Paragraph position="6"> back-o contexts was seen with that future, leading to a non-zero smoothed estimate. The total number of possible non-zero smoothed estimates in the model is 562,596,053. Table 2 contains count and diversity statistics for the two parameter classes on which we will focus much of our attention, PM and PMw . Note how the maximal-context back-o levels (level 0) for both parameter classes have relatively little training: on average, raw estimates are obtained with history counts of only 10.3 and 4.4 in the PM and PMw classes, respectively. Conversely, observe how drastically the average number of transitions n increases as we remove dependence on the head word going from back-o level 0 to 1.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Exploratory data analysis: a common </SectionTitle> <Paragraph position="0"> distribution To begin to get a handle on these distributions, particularly the relatively poorly-trained and/or high-entropy distributions of the PMw class, it is useful to perform some exploratory data analysis. Figure 1 illustrates the 25th-most-frequent PMw history context as a tree fragment. In the top-down model, the following elements have been generated: a parent nonterminal PP(IN/with) (a PP headed by the word with with the part-of-speech tagIN) the parent's head childIN a right subcat bag containingNP-A(a single NP argument must be generated somewhere on the history context illustrated in Figure 1.</Paragraph> <Paragraph position="1"> right side of the head child) a partially-lexicalized right-modifying nonterminal null At this point in the process, a PMw parameter conditioning on all of this context will be used to estimate the probability of the head word of the NP-A(NN), completing the lexicalization of that nonterminal. If a candidate head word was seen in training in this configuration, then it will be generated conditioning on the full context that crucially includes the head wordwith; otherwise, the model will back o to a history context that does not include the head word. In Figure 2, we plot the cumulative density function of this history context. We note that of the 3258 words with non-zero probability in this context, 95% of the probability mass is covered by the 1596 most likely words.</Paragraph> <Paragraph position="2"> In order to get a better visualization of the probability distribution, we plotted smoothed probability estimates versus the training-data frequencies of the words being generated. Figure 3(a) shows smoothed estimates that make use of the full context (i.e., include the head word with) wherever possible, and Figure 3(b) shows smoothed estimates that do not use the head word. Note how the plot in Figure 3(b) appears remarkably similar to the &quot;true&quot; distribu- null average history count and diversity, respectively. n = cd is the average number of transitions from a history context to some future.</Paragraph> <Paragraph position="3"> tion of 3(a). 3(b) looks like a slightly &quot;compressed&quot; version of 3(b) (in the vertical dimension), but the shape of the two distributions appears to be roughly the same. This observation will be confirmed and quantified by the experiments of SS5.3</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Entropies </SectionTitle> <Paragraph position="0"> A good measure of the discriminative e cacy of a parameter is its entropy. Table 3 shows the average entropy of all distributions for each parameter class.4 By far the highest average entropy is for the PMw parameter class.</Paragraph> <Paragraph position="1"> Having computed the entropy for every distribution in every parameter class, we can actually plot a &quot;meta-distribution&quot; of entropies for a parameter class, as shown in Figure 4. As an example of one of the data points of Figure 4, consider the history context explored in the previous section. While it may be one of the most frequent, it also has the highest entropy at 9.141 that jointly estimate the prior probability of a lexicalized nonterminal; however, these two parameter classes are not part of the generative model.</Paragraph> <Paragraph position="2"> bits, as shown by Table 4. This value not only confirms but quantifies the long-held intuition that PP-attachment requires more than just the local phrasal context; it is, e.g., precisely why the PPspecific features of (Collins, 2000) were likely to be very helpful, as cases such as these are among the most di cult that the model must discriminate. In fact, of the top 50 of the highest-entropy Back-o PM PMw level min max avg median min max avg median broken down by parent-head-modifier triple.</Paragraph> <Paragraph position="3"> distributions from PMw , 25 involve the configuration PP --> IN(IN/<prep>)NP-A(NN/: : :), where<prep>is some preposition whose tag isIN.</Paragraph> <Paragraph position="4"> Somewhat disturbingly, these are also some of the most frequent constructions.</Paragraph> <Paragraph position="5"> To gauge roughly the importance of these high-frequency, high-entropy distributions, we performed the following analysis. Assume for the moment that every word-generation decision is roughly independent from all others (this is clearly not true, given head-propagation). We can then compute the total entropy of word-generation decisions for the entire training corpus via</Paragraph> <Paragraph position="7"> where f (c) is the frequency of some history context c and H(c) is that context's entropy. The total modifier word-generation entropy for the corpus with the independence assumption is 3,903,224 bits. Of these, the total entropy for contexts of the formPP ! IN NP-Ais 618,640 bits, representing a sizable 15.9% of the total entropy, and the single largest percentage of total entropy of any parent-head-modifier triple (see Figure 5).</Paragraph> <Paragraph position="8"> On the opposite end of the entropy spectrum, there are tens of thousands of PMw parameters with extremely low entropies, mostly having to do with extremely low-diversity, low-entropy part-of-speech tags, such asDT,CC,INorWRB. Perhaps even more interesting is the number of distributions with identical entropies: of the 206,234 distributions, there are only 92,065 unique entropy values. Distributions with the same entropy are all candidates for removal from the model, because most of their probability mass resides in the back-o distribution.</Paragraph> <Paragraph position="9"> Many of these distributions are low- or one-count history contexts, justifying the common practice of removing transitions whose history count is below a certain threshold. This practice could be made more rigorous by relying on distributional similarity. Finally, we note that the most numerous low-entropy distributions (that are not trivial) involve generating right-modifier words of the head child of an SBAR parent. The model is able to learn these constructions extremely well, as one might expect.</Paragraph> <Paragraph position="10"> 5 Distributional similarity and bilexical statistics We now return to the issue of bilexical statistics. As alluded to earlier, Gildea (2001) performed an experiment with his partial reimplementation of Collins' Model 1 in which he removed the maximal-context back-o level from PMw , which e ectively removed all bilexical statistics from his model. Gildea observed that this change resulted in only a 0.5% drop in parsing performance. There were two logical possibilities for this behavior: either such statistics were not getting used due to sparse data problems, or they were not informative for some reason. The prevailing view of the NLP community had been that bilexical statistics were sparse, and Gildea (2001) adopted this view to explain his results. Subsequently, we duplicated Gildea's experiment with a complete emulation of Collins' Model 2, and found that when the decoder requested a smoothed estimate involving a bigram when testing on held-out data, it only received an estimate that made use of bilexical statistics a mere 1.49% of the time (Bikel, 2004). The conclusion was that the minuscule drop in performance from removing bigrams must have been due to the fact that they were barely able to be used. In other words, it appeared that bigram coverage was not nearly good enough for bigrams to have an impact on parsing performance, seemingly confirming the prevailing view.</Paragraph> <Paragraph position="11"> But the 1.49% figure does not tell the whole story.</Paragraph> <Paragraph position="12"> The parser pursues many incorrect and ultimately low-scoring theories in its search (in this case, using probabilistic CKY). So rather than asking how many times the decoder makes use of bigram statistics on average, a better question is to ask how many times the decoder can use bigram statistics while pursuing the top-ranked theory. To answer this question, we used our parser to constrain-parse its own output. That is, having trained it on Sections 02-21, we used it to parse Section 00 of the Penn Treebank (the canonical development test set) and then re-parse that section using its own highest-scoring trees (without lexicalization) as constraints, so that it only pursued theories consistent with those trees. As it happens, the number of times the decoder was able to use bigram statistics shot up to 28.8% overall, with a rate of 22.4% for NPB constituents. null So, bigram statistics are getting used; in fact, they are getting used more than 19 times as often when pursuing the highest-scoring theory as when pursuing any theory on average. And yet there is no disputing the fact that their use has a surprisingly small e ect on parsing performance. The exploratory data analysis of SS3.3 suggests an explanation for this perplexing behavior: the distributions that include the head word versus those that do not are so similar as to make almost no di erence in terms of parse accuracy.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Distributional similarity </SectionTitle> <Paragraph position="0"> A useful metric for measuring distributional similarity, as explored by (Lee, 1999), is the Jensen- null where D is the Kullback-Leibler divergence (Cover and Thomas, 1991) and where avgp;q = pretation for the Jensen-Shannon divergence due to Slonim et al. (2002) is that it is related to the log-likelihood that &quot;the two sample distributions originate by the most likely common source,&quot; relating the quantity to the &quot;two-sample problem&quot;. In our case, we have p = p(yjx1; x2) and q = p(yjx1), where y is a possible future and x1; x2 are elements of a history context, with q representing a back-o distribution using less context. Therefore, whereas the standard JS formulation is agnosmin max avg. median tic with respect to its two distributions, and averages them in part to ensure that the quantity is defined over the entire space, we have the prior knowledge that one history context is a superset of the other, thathx1iis defined whereverhx1; x2iis. In this case, then, we have a simpler, &quot;one-sided&quot; definition for the Jensen-Shannon divergence, but generalized to the multiple distributions that include an extra history component:</Paragraph> <Paragraph position="2"> An interpretation in our case is that this is the expected number of bits x2 gives you when trying to predict y.5 If we allow x2 to represent an arbitrary amount of context, then the Jensen-Shannon divergence JS b a = JS (pbjjpa) can be computed for any two back-o levels, where a; b are back-o levels s.t. b < a (meaning pb is a distribution using more context than pa). The actual value in bits of the Jensen-Shannon divergence between two distributions should be considered in relation to the number of bits of entropy of the more detailed distribution; that is, JS b a should be considered relative to H(pb). Having explored entropy in SS4, we will now look at some summary statistics for JS divergence.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> We computed the quantity in Equation 3 for every parameter in PMw that used maximal context (contained a head word) and its associated parameter that did not contain the head word. The results are listed in Table 5. Note that, for this parameter class with a median entropy of 3.8 bits, we have a median JS divergence of only 0.097 bits. The distributions are so similar that the 28.8% of the time that the decoder uses an estimate based on a bigram, it might as well be using one that does not include the head word.</Paragraph> <Paragraph position="1"> 5Or, following from Slonim et al.'s interpretation, this quantity is the (negative of the) log-likelihood that all distributions that include an x2 component come from a &quot;common source&quot; that does not include this component.</Paragraph> <Paragraph position="2"> Collins' Model 3, our emulation of Collins' Model 2 and the reduced version at a threshold of 0.06. LR =labeled recall, LP=labeled precision.6</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Distributional Similarity and Parameter </SectionTitle> <Paragraph position="0"> Selection The analysis of the previous two sections provides a window onto what types of parameters the parsing model is learning most and least well, and onto what parameters carry more and less useful information. Having such a window holds the promise of discovering new parameter types or features that would lead to greater parsing accuracy; such is the scientific, or at least, the forward-minded research perspective.</Paragraph> <Paragraph position="1"> From a much more purely engineering perspective, one can also use the analysis of the previous two sections to identify individual parameters that carry little to no useful information and simply remove them from the model. Specifically, if pb is a particular distribution and pb+1 is its corresponding back-o distribution, then one can remove all parameters pb such that</Paragraph> <Paragraph position="3"> where 0 < t < 1 is some threshold. Table 6 shows the results of this experiment using a threshold of 0:06. To our knowledge, this is the first example of detailed parameter selection in the context of a generative lexicalized statistical parsing model. The consequence is a significantly smaller model that performs with no loss of accuracy compared to the full model.6 Further insight is gained by looking at the percentage of parameters removed from each parameter class. The results of (Bikel, 2004) suggested that the power of Collins-style parsing models did not 6None of the di erences between the Model 2-emulation results and the reduced model results is statistically significant. each parameter class for the 0.06-reduced model.</Paragraph> <Paragraph position="4"> lie primarily with the use of bilexical dependencies as was once thought, but in lexico-structural dependencies, that is, predicting syntactic structures conditioning on head words. The percentages of Table 7 provide even more concrete evidence of this assertion, for whereas nearly a third of the PMw parameters were removed, a much smaller fraction of parameters were removed from the PsubcatL , PsubcatR and PM classes that generate structure conditioning on head words.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Examining the lower-entropy PMw distributions revealed that, in many cases, the model was not so much learning how to disambiguate a given syntactic/lexical choice, but simply not having much to learn. For example, once a partially-lexicalized nonterminal has been generated whose tag is fairly specialized, such asIN, then the model has &quot;painted itself into a lexical corner&quot;, as it were (the extreme example isTO, a tag that can only be assigned to the word to). This is an example of the &quot;label bias&quot; problem, which has been the subject of recent discussion (La erty et al., 2001; Klein and Manning, 2002). Of course, just because there is &quot;label bias&quot; does not necessarily mean there is a problem. If the decoder pursues a theory to a nonterminal/partof-speech tag preterminal that has an extremely low entropy distribution for possible head words, then there is certainly a chance that it will get &quot;stuck&quot; in a potentially bad theory. This is of particular concern when a head word--which the top-down model generates at its highest point in the tree--influences an attachment decision. However, inspecting the low-entropy word-generation histories of PMw revealed that almost all such cases are when the model is generating a preterminal, and are thus of little to no consequence vis-a-vis syntactic disambiguation.</Paragraph> </Section> class="xml-element"></Paper>