File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1636_metho.xml
Size: 17,433 bytes
Last Modified: 2025-10-06 14:10:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1636"> <Title>Learning Phrasal Categories</Title> <Section position="5" start_page="302" end_page="305" type="metho"> <SectionTitle> 3 Clustering </SectionTitle> <Paragraph position="0"> The input to the clusterer is a set of annotated grammar productions and counts. Our clustering algorithm is a divisive one reminiscent of (Martin et al., 1995). We start with a single cluster for each Treebank nonterminal and one additional cluster for intermediate nodes, which are described in section 3.2.</Paragraph> <Paragraph position="1"> The clustering method has two interleaved parts: one in which candidate splits are generated, and one in which we choose a candidate split to enact.</Paragraph> <Paragraph position="2"> For each of the initial clusters, we generate a candidate split, and place that split in a priority queue. The priority queue is ordered by the Bayesian Information Criterion (BIC), e.g.(Hastie et al., 2003).</Paragraph> <Paragraph position="3"> The BIC of a model M is defined as -2*(log likelihood of the data according to M) +dM*(log number of observations). dM is the number of degrees of freedom in the model, which for a PCFG is the number of productions minus the number of nonterminals. Thus in this context BIC can be thought of as optimizing the likelihood, but with a penalty against grammars with many rules.</Paragraph> <Paragraph position="4"> While the queue is nonempty, we remove a candidate split to reevaluate. Reevaluation is necessary because, if there is a delay between when a split is proposed and when a split is enacted, the grammar used to score the split will have changed. However, we suppose that the old score is close enough to be a reasonable ordering measure for the priority queue. If the reevaluated candidate is no longer better than the second candidate on the queue, we reinsert it and continue. However, if it is still the best on the queue, and it improves the model, we enact the split; otherwise it is discarded. When a split is enacted, the old cluster is removed from the set of nonterminals, and is replaced with the two new nonterminals of the split. A candidate split for each of the two new clusters is generated, and placed on the priority queue.</Paragraph> <Paragraph position="5"> This process of reevaluation, enacting splits, and generating new candidates continues until the priority queue is empty of potential splits.</Paragraph> <Paragraph position="6"> We select a candidate split of a particular cluster as follows. For each context feature we generate a potential nominee split. To do this we first partition randomly the values for the feature into two buckets. We then repeatedly try to move values from one bucket to the other. If doing so results in an improvement to the likelihood of the training data, we keep the change, otherwise we reject it.</Paragraph> <Paragraph position="7"> The swapping continues until moving no individual value results in an improvement in likelihood. Suppose we have a grammar derived from a corpus of a single tree, whose nodes have been annotated with their parent as in Figure 1. The base productions for this corpus are:</Paragraph> <Paragraph position="9"> Suppose we are in the initial state, with a single cluster for each treebank nonterminal. Consider a potential split of the NP cluster on the parent feature, which in this example has three values: S, VP, and NP. If the S and VP values are grouped together in the left bucket, and the NP value is alone in the right bucket, we get cluster nonterminals NPL = {NP[S],NP[VP]} and NPR = {NP[NP]}. The resulting grammar rules and their probabilities are:</Paragraph> <Paragraph position="11"> If however, VP is swapped to the right bucket with NP, the rules become:</Paragraph> <Paragraph position="13"> The likelihood of the tree in Figure 1 is 1/4 under the first grammar, but only 4/27 under the second.</Paragraph> <Paragraph position="14"> Hence in this case we would reject the swap of VP from the right to the left buckets.</Paragraph> <Paragraph position="15"> The process of swapping continues until no improvement can be made by swapping a single value.</Paragraph> <Paragraph position="16"> The likelihood of the training data according to the clustered grammar is</Paragraph> <Paragraph position="18"> for R the set of observed productions r = phi phj ... in the clustered grammar. Notice that when we are looking to split a cluster ph, only productions that contain the nonterminal ph will have probabilities that change. To evaluate whether a change increases the likelihood, we consider the ratio between the likelihood of the new model, and the likelihood of the old model.</Paragraph> <Paragraph position="19"> Furthermore, when we move a value from one bucket to another, only a fraction of the rules will have their counts change. Suppose we are moving value x from the left bucket to the right when splitting phi. Let phx [?] phi be the set of base nonterminals in phi that have value x for the feature being split upon. Only clustered rules that contain base grammar rules which use nonterminals in phx will have their probability change. These observations allow us to process only a relatively small number of base grammar rules.</Paragraph> <Paragraph position="20"> Once we have generated a potential nominee split for each feature, we select the partitioning which leads to the greatest improvement in the BIC as the candidate split of this cluster. This candidate is placed on the priority queue.</Paragraph> <Paragraph position="21"> One odd thing about the above is that in the local search phase of the clustering we use likelihood, while in the candidate selection phase we use BIC. We tried both measures in each phase, but found that this hybrid measure outperformed using only one or the other.</Paragraph> <Section position="1" start_page="303" end_page="304" type="sub_section"> <SectionTitle> 3.1 Model Selection </SectionTitle> <Paragraph position="0"> Unfortunately, the grammar that results at the end of the clustering process seems to overfit the training data. We resolve this by simply noting periodically the intermediate state of the grammar, and using this grammar to parse a small tuning set (we use the first 400 sentences of WSJ section 24, and parse this every 50 times we enact a split). At the conclusion of clustering, we select the grammar</Paragraph> </Section> <Section position="2" start_page="304" end_page="304" type="sub_section"> <SectionTitle> 3.2 Binarization </SectionTitle> <Paragraph position="0"> Since our experiments make use of a CKY (Kasami, 1965) parser 1 we must modify the tree-bank derived rules so that each expands to at most two labels. We perform this in a manner similar to Klein and Manning (2003) and Matsuzaki et al. (2005) through the creation of intermediate nodes, as in Figure 2. In this example, the nonterminal heir of A's head is D, indicated in the figure by marking D with angled brackets. The square brackets indicate an intermediate node, and the labels inside the brackets indicate that the node will eventually be expanded into those labels.</Paragraph> <Paragraph position="1"> Klein and Manning (2003) employ Collins' (1999) horizontal markovization to desparsify their intermediate nodes. This means that given an intermediate node such as [C<D> EF] in Figure 2, we forget those labels which will not be expanded past a certain horizon. Klein and Manning (2003) use a horizon of two (or less, in some cases) which means only the next two labels to be expanded are retained. For instance in in this example [C <D> EF] is markovized to [C <D> ...F], since C and F are the next two non-intermediate labels.</Paragraph> <Paragraph position="2"> Our mechanism lays out the unmarkovized intermediate rules in the same way, but we mostly use our clustering scheme to reduce sparsity. We do so by aligning the labels contained in the intermediate nodes in the order in which they would be added when increasing the markovization hori1The implementation we use was created by Mark Johnson and used for the research in (Johnson, 1998). It is available at his homepage.</Paragraph> <Paragraph position="3"> zon from zero to three. We also always keep the heir label as a feature, following Klein and Manning (2003). So for instance, [C<D> EF] is represented as having Treebank label &quot;IN-TERMEDIATE&quot;, and would have feature vector (D,C,F,E,D),while [<D> EF] would have feature vector (D,F,E,D,[?]), where the first item is the heir of the parent's head. The &quot;-&quot; indicates that the fourth item to be expanded is here non-existent. The clusterer would consider each of these five features as for a single possible split. We also incorporate our other features into the intermediate nodes in two ways.</Paragraph> <Paragraph position="4"> Some features, such as the parent or grandparent, will be the same for all the labels in the intermediate node, and hence only need to be included once. Others, such as the part of speech of the head, may be different for each label. These features we align with those of corresponding label in the Markov ordering. In our running example, suppose each child node N has part of speech of its head PN, and we have a parent feature. Our aligned intermediate feature vectors then become (A,D,C,PC,F,PF,E,PE,D,PD) and (A,D,F,PF,E,PE,D,PD,[?],[?]). As these are somewhat complicated, let us explain them by unpacking the first, the vector for [C<D> EF]. Consulting Figure 2 we see that its parent is A. We have chosen to put parents first in the vector, thus explaining (A,...). Next comes the heir of the constituent, D. This is followed by the first constituent that is to be unpacked from the binarized version, C, which in turn is followed by its head part-of-speech PC, giving us (A,D,C,PC,...).</Paragraph> <Paragraph position="5"> We follow with the next non-terminal to be unpacked from the binarized node and its head partof-speech, etc.</Paragraph> <Paragraph position="6"> It might be fairly objected that this formulation of binarization loses the information of whether a label is to the left, right, or is the heir of the parent's head. This is solved by the inside position feature, described in Section 2.1 which contains exactly this information.</Paragraph> </Section> <Section position="3" start_page="304" end_page="305" type="sub_section"> <SectionTitle> 3.3 Smoothing </SectionTitle> <Paragraph position="0"> In order to ease comparison between our work and that of Klein and Manning (2003), we follow their lead in smoothing no production probabilities save those going from preterminal to nonterminal.</Paragraph> <Paragraph position="1"> Our smoothing mechanism runs roughly along the lines of theirs.</Paragraph> <Paragraph position="2"> using clusterer with different random seeds. All numbers here are on the development test set (Section 22).</Paragraph> <Paragraph position="3"> Preterminal rules are smoothed as follows. We consider several classes of unknown words, based on capitalization, the presence of digits or hyphens, and the suffix. We estimate the probability of a tag T given a word (or unknown class) W, as p(T |W) = C(T,W)+hp(T|unk)C(W)+h , where</Paragraph> <Paragraph position="5"> ability of the tag given any unknown word class.</Paragraph> <Paragraph position="6"> In order to estimate counts of unknown classes,we let the clusterer see every tree twice: once unmodified, and once with the unknown class replacing each word seen less than five times. The production probability p(W |T) is then p(T | W)p(W)/p(T) where p(W) and p(T) are the respective empirical distributions.</Paragraph> <Paragraph position="7"> The clusterer does not use smoothed probabilities in allocating annotated preterminals to clusters, but simply the maximum likelihood estimates as it does elsewhere. Smoothing is only used in the parser.</Paragraph> </Section> </Section> <Section position="6" start_page="305" end_page="305" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We trained our model on sections 2-21 of the Penn Wall Street Journal Treebank. We used the first 400 sentences of section 24 for model selection.</Paragraph> <Paragraph position="1"> Section 22 was used for testing during development, while section 23 was used for the final evaluation. null</Paragraph> </Section> <Section position="7" start_page="305" end_page="306" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Our results are shown in Table 1. The first three columns show the labeled precision, recall and fmeasure, respectively. The remaining two show the number of crossing brackets per sentence, and the percentage of sentences with no crossing brackets.</Paragraph> <Paragraph position="1"> Unfortunately, our model does not perform quite as well as those of Klein and Manning (2003) or Matsuzaki et al. (2005). It is worth noting that Matsuzaki's grammar uses a different parse evaluation scheme than Klein & Manning or we do.</Paragraph> <Paragraph position="2"> We select the parse with the highest probability according to the annotated grammar. Matsuzaki, on the other hand, argues that the proper thing to do is to find the most likely unannotated parse.</Paragraph> <Paragraph position="3"> The probability of this parse is the sum over the probabilities of all annotated parses that reduce to that unannotated parse. Since calculating the parse that maximizes this quantity is NP hard, they try several approximations. One is what Klein & Manning and we do. However, they have a better performing approximation which is used in their reported score. They do not report their score on section 23 using the most-probable-annotatedparse method. They do however compare the performance of different methods using development data, and find that their better approximation gives an absolute improvement in f-measure in the .5-1 percent range. Hence it is probable that even with their better method our grammar would not out-perform theirs.</Paragraph> <Paragraph position="4"> Table 2 shows the results on the development test set (Section 22) for four different initial random seeds. Recall that when splitting a cluster, the initial partition of the base grammar nonterminals is made randomly. The model from the second run was used for parsing the final test set (Section 23) in Table 1.</Paragraph> <Paragraph position="5"> One interesting thing our method allows is for us to examine which features turn out to be useful in which contexts. We noted for each trereebank nonterminal, and for each feature, how many times that nonterminal was split on that feature, for the grammar selected in the model selection stage. We ran the clustering with these four different random seeds.</Paragraph> <Paragraph position="6"> We find that in particular, the clusterer only found the head feature to be useful in very specific circumstances. It was used quite a bit to split preterminals; but for phrasals it was only used to split ADJP,ADVP,NP,PP,VP,QP, and SBAR. The part of speech of the head was only used to split NP and VP.</Paragraph> <Paragraph position="7"> Furthermore, the grandparent tag appears to be of importance primarily for VP and PP nonter- null minals, though it is used once out of the four runs for NPs.</Paragraph> <Paragraph position="8"> This indicates that perhaps lexical parsers might be able to make do by only using lexical head and grandparent information in very specific instances, thereby shrinking the sizes of their models, and speeding parsing. This warrants further investigation. null</Paragraph> </Section> <Section position="8" start_page="306" end_page="306" type="metho"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> We have presented a scheme for automatically discovering phrasal categories for parsing with a standard CKY parser. The parser achieves 84.8% precision-recall f-measure on the standard testsection of the Penn WSJ-Treebank (section 23).</Paragraph> <Paragraph position="1"> While this is not as accurate as the hand-tailored grammar of Klein and Manning (2003), it is close, and we believe there is room for improvement.</Paragraph> <Paragraph position="2"> For starters, the particular clustering scheme is only one of many. Our algorithm splits clusters along particular features (e.g., parent, headpart-of-speech, etc.). One alternative would be to cluster simultaneously on all the features. It is not obvious which scheme should be better, and they could be quite different. Decisions like this abound, and are worth exploring.</Paragraph> <Paragraph position="3"> More radically, it is also possible to grow many decision trees, and thus many alternative grammars. We have been impressed by the success of random-forest methods in language modeling (Xu and Jelinek, 2004). In these methods many trees (the forest) are grown, each trying to predict the next word. The multiple trees together are much more powerful than any one individually. The same might be true for grammars.</Paragraph> </Section> <Section position="9" start_page="306" end_page="306" type="metho"> <SectionTitle> Acknowledgement </SectionTitle> <Paragraph position="0"> The research presented here was funded in part by</Paragraph> </Section> class="xml-element"></Paper>