File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1111_metho.xml
Size: 23,984 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1111"> <Title>Prototype-Driven Grammar Induction</Title> <Section position="5" start_page="881" end_page="882" type="metho"> <SectionTitle> 3 Experiments in PCFG induction </SectionTitle> <Paragraph position="0"> As an initial experiment, we used the inside-outside algorithm to induce a PCFG in the straightforward way (Lari and Young, 1990; Manning and Sch&quot;utze, 1999). For all the experiments in this paper, we considered binary PCFGs over the nonterminals and terminals occuring in WSJ10. The PCFG rules were of the following forms:</Paragraph> <Paragraph position="2"> For a given sentence S, our CFG generates labeled trees T over S.2 Each tree consists of binary 1Incaseswheremultiplegoldlabelsexistinthegoldtrees, precision and recall were calculated as in Collins (1999). 2Restricting our CFG to a binary branching grammar results in an upper bound of 88.1% on unlabeled F1.</Paragraph> <Paragraph position="3"> productions X(i,j) - a over constituent spans (i,j), where a is a pair of non-terminal and/or terminal symbols in the grammar. The generative probability of a tree T for S is:</Paragraph> <Paragraph position="5"> In the inside-outside algorithm, we iteratively compute posterior expectations over production occurences at each training span, then use those expectations to re-estimate production probabilities. This process is guaranteed to converge to a local extremum of the data likelihood, but initial production probability estimates greatly influence thefinalgrammar(CarrollandCharniak,1992). In particular, uniform initial estimates are an (unstable) fixed point. The classic approach is to add a small amount of random noise to the initial probabilities in order to break the symmetry between grammar symbols.</Paragraph> <Paragraph position="6"> We randomly initialized 5 grammars using tree-bank non-terminals and trained each to convergence on the first 2000 sentences of WSJ-10.</Paragraph> <Paragraph position="7"> Viterbi parses were extracted for each of these 2000 sentences according to each grammar. Of course,theparses'symbolshavenothingtoanchor themtoourintendedtreebanksymbols. Thatis, an NP in one of these grammars may correspond to the target symbol VP, or may not correspond well to any target symbol. To evaluate these learned grammars, we must map the models' phrase types to target phrase types. For each grammar, we followed the common approach of greedily mapping model symbols to target symbols in the way which maximizes the labeled F1. Note that this can, and does, result in mapping multiple model symbols to the most frequent target symbols. This experiment, labeled PCFGxNONE infigure4, resultedin an average labeled F1 of 26.3 and an unlabeled F1 of 45.7. The unlabeled F1 is better than randomly choosing a tree (34.7), but not better than always choosing a right branching structure (61.7).</Paragraph> <Paragraph position="8"> Klein and Manning (2002) suggest that the task of labeling constituents is significantly easier than identifying them. Perhaps it is too much to ask a PCFG induction algorithm to perform both of these tasks simultaneously. Along the lines of Pereira and Schabes (1992), we reran the inside-outside algorithm, but this time placed zero mass on all trees which did not respect the bracketing of the gold trees. This constraint does not fully</Paragraph> </Section> <Section position="6" start_page="882" end_page="884" type="metho"> <SectionTitle> JJNNS VBDDTNN NNPNNP MDVBCD S PRPVBDDTNN QP CDCD DTNNVBDINDTNN RBCD DTVBZDTJJNN DTCDCD PP INNN ADJP RBJJ TOCDCD JJ INPRP JJCCJJ ADVP RBRB RBCD RBCCRB VP-INF VBNN NP-INF NNPOS </SectionTitle> <Paragraph position="0"> eliminate the structural uncertainty since we are inducing binary trees and the gold trees are flatter than binary in many cases. This approach of course achieved the upper bound on unlabeled F1, because of the gold bracket constraints. However, it only resulted in an average labeled F1 of 52.6 (experiment PCFG x GOLD in figure 4). While this labeled score is an improvement over the PCFG x NONE experiment, it is still relatively disappointing. null</Paragraph> <Section position="1" start_page="882" end_page="882" type="sub_section"> <SectionTitle> 3.1 Encoding Prior Knowledge with Prototypes </SectionTitle> <Paragraph position="0"> Clearly, we need to do something more than adding structural bias (e.g. bracketing information) if we are to learn a PCFG in which the symbols have the meaning and behaviour we intend.</Paragraph> <Paragraph position="1"> How might we encode information about our prior knowledge or intentions? Providing labeled trees is clearly an option. This approach tells the learner how symbols should recursively relate to each other. Another option is to provide fully linearized yields as prototypes. We take this approach here, manually creating a list of POS sequences typical of the 7 most frequent categories in the Penn Treebank (see figure 1).3 Our grammar is limited to these 7 phrase types plus an additional type which has no prototypes more categories did not improve performance and took more bol in terms of an observable portion of the data, rather than attempting to relate unknown symbols to other unknown symbols.</Paragraph> <Paragraph position="2"> Broadly, we would like to learn a grammar which explains the observed data (EM's objective) but also meets our prior expectations or requirements of the target grammar. How might we use such a list to constrain the learning of a PCFG with the inside-outside algorithm? We might require that all occurences of a prototype sequence, say DT NN, be constituents of the corresponding type (NP). However, human-elicited prototypes are not likely to have the property that, when they occur, they are (nearly) always constituents. For example, DT NN is a perfectly reasonableexampleofanounphrase, butisnotaconstituent when it is part of a longer DT NN NN constituent. Therefore, whensummingovertreeswith the inside-outside algorithm, we could require a weaker property: whenever a prototype sequence is a constituent it must be given the label specified in the prototype file.5 This constraint is enough to break the symmetry between the model labels, and therefore requires neither random initialization for training, nor post-hoc mapping of labels for evaluation. Adding prototypes in this way and keeping the gold bracket constraint gave 59.9 labeled F1. The labeled F1 measure is again an improvement over naive PCFG induction, but is perhaps less than we might expect given that the model has been given bracketing information and has prototypes as a form of supervision to direct it.</Paragraph> <Paragraph position="3"> In response to a prototype, however, we may wish to conclude something stronger than a constraint on that particular POS sequence. We might hope that sequences which are similar to a prototype in some sense are generally given the same label as that prototype. For example, DT NN is a noun phrase prototype, the sequence DT JJ NN is another good candidate for being a noun phrase.</Paragraph> <Paragraph position="4"> This kind of propagation of constraints requires that we have a good way of defining and detecting similarity between POS sequences.</Paragraph> </Section> <Section position="2" start_page="882" end_page="884" type="sub_section"> <SectionTitle> 3.2 Phrasal Distributional Similarity </SectionTitle> <Paragraph position="0"> A central linguistic argument for constituent types is substitutability: phrases of the same type appear time. We note that we still evaluate against all phrase types regardless of whether or not they are modeled by our grammar. null 5Even this property is likely too strong: prototypes may have multiple possible labels, for example DT NN may also be a QP in the English treebank.</Paragraph> <Paragraph position="1"> types and phrase types, guessed according to (3). in similar contexts and are mutually substitutable (Harris, 1954; Radford, 1988). For instance, DT JJ NN and DT NN occur in similar contexts, and are indeed both common NPs. This idea has been repeatedly and successfully operationalized using various kinds of distributional clustering, where we define a similarity measure between two items on the basis of their immediate left and right contexts (Sch&quot;utze, 1995; Clark, 2000; Klein and Manning, 2002).</Paragraph> <Paragraph position="2"> As in Clark (2001), we characterize the distribution of a sequence by the distribution of POS tags occurring to the left and right of that sequence in a corpus. Each occurence of a POS sequence a falls in a context x a y, where x and y are the adjacent tags. The distribution over contexts x [?] y for a given a is called its signature, and is denoted by s(a). Note that s(a) is composed of context counts from all occurences, constitiuent and distituent, of a. Let sc(a) denote the context distribution for a where the context counts are taken only from constituent occurences of a. For each phrasetypeinourgrammar,X, definesc(X)tobe the context distribution obtained from the counts of all constituent occurences of type X:</Paragraph> <Paragraph position="4"> where p(a|X) is the distribution of yield types for phrase type X. We compare context distributions using the skewed KL divergence:</Paragraph> <Paragraph position="6"> where g controls how much of the source distributions is mixed in with the target distribution.</Paragraph> <Paragraph position="7"> A reasonable baseline rule for classifying the phrase type of a POS yield is to assign it to the phrase from which it has minimal divergence:</Paragraph> <Paragraph position="9"> However, this rule is not always accurate, and, moreover, we do not have access to sc(a) or sc(X). We chose to approximate sc(X) using the prototype yields for X as samples from p(a|X). Letting proto(X) denote the (few) prototype yields for phrase type X, we define ~s(X):</Paragraph> <Paragraph position="11"> Note ~s(X) is an approximation to (1) in several ways. We have replaced an expectation over p(a|X) with a uniform weighting of proto(X), and we have replaced sc(a) with s(a) for each term in that expectation. Because of this, we will rely only on high confidence guesses, and allow yields to be given a NONE type if their divergence from each ~s(X) exceeds a fixed threshold t. This gives the following alternative to (2):</Paragraph> <Paragraph position="13"> argminX DSKL(s(a), ~s(X)), otherwise We built a distributional model implementing the rule in (3) by constructing s(a) from context counts in the WSJ portion of the Penn Treebank as well as the BLIPP corpus. Each ~s(X) was approximated by a uniform mixture of s(a) for each of X's prototypes a listed in figure 1.</Paragraph> <Paragraph position="14"> This method of classifying constituents is very precise if the threshold is chosen conservatively enough. For instance, using a threshold of t = 0.75 and g = 0.1, this rule correctly classifies the majority label of a constituent-type with 83% precision, and has a recall of 23% over constituent types. Figure 2 illustrates some sample yields, the prototype sequence to which it is least divergent, and the output of rule (3).</Paragraph> <Paragraph position="15"> We incorporated this distributional information into our PCFG induction scheme by adding a prototype feature over each span (i,j) indicating the output of (3) for the yield a in that span. Associated with each sentence S is a feature map F specifying, for each (i,j), a prototype feature pij. These features are generated using an augmented</Paragraph> <Paragraph position="17"> 6Technically, all features in F must be generated for each assignment to T, which means that there should be terms in this equation for the prototype features on distituent spans.</Paragraph> <Paragraph position="18"> However, we fixed the prototype distribution to be uniform for distituent spans so that the equation is correct up to a constant depending on F.</Paragraph> <Paragraph position="19"> prototype similarity features.</Paragraph> <Paragraph position="20"> where phCFG+(X - a,pij) is the local factor for placing X - a on a span with prototype feature pij. An example is given in figure 3.</Paragraph> <Paragraph position="21"> For our experiments, we fixed P(pij|X) to be:</Paragraph> <Paragraph position="23"> Modifying the model in this way, and keeping the gold bracketing information, gave 71.1 labeled F1 (see experiment PROTO x GOLD in figure 4), a 40.3% error reduction over naive PCFG induction in the presence of gold bracketing information.</Paragraph> <Paragraph position="24"> We note that the our labeled F1 is upper-bounded by 86.0 due to unary chains and more-than-binary configurations in the treebank that cannot be obtained from our binary grammar.</Paragraph> <Paragraph position="25"> We conclude that in the presence of gold bracket information, we can achieve high labeled accuracy by using a CFG augmented with distributional prototype features.</Paragraph> </Section> </Section> <Section position="7" start_page="884" end_page="884" type="metho"> <SectionTitle> 4 Constituent Context Model </SectionTitle> <Paragraph position="0"> So far, we have shown that, given perfect perfect bracketing information, distributional prototype features allow us to learn tree structures with fairly accurate labels. However, such bracketing information is not available in the unsupervised case.</Paragraph> <Paragraph position="1"> Perhaps we don't actually need bracketing constraints in the presence of prototypes and distributional similarity features. However this experiment, labeled PROTO x NONE in figure 4, gave only 53.1 labeled F1 (61.1 unlabeled), suggesting that some amount of bracketing constraint is necessary to achieve high performance.</Paragraph> <Paragraph position="2"> Fortunately, there are unsupervised systems which can induce unlabeled bracketings with reasonably high accuracy. One such model is the constituent-context model (CCM) of Klein and Manning (2002), a generative distributional model. For a given sentence S, the CCM generates a bracket matrix, B, which for each span (i,j), indicates whether or not it is a constituent (Bij = c) or a distituent (Bij = d). In addition, it generates a feature map Fprime, which for each span (i,j) in S specifies a pair of features,Fprimeij = (yij,cij), where yij is the POS yield of the span, and cij is the context of the span, i.e identity of the conjoined left and right POS tags:</Paragraph> <Paragraph position="4"> The distribution P(B) only places mass on bracketings which correspond to binary trees. We can efficiently compute PCCM(B,Fprime) (up to a constant) depending on Fprime using local factors phCCM(yij,cij) which decomposes over con-</Paragraph> <Paragraph position="6"> The CCM by itself yields an unlabeledF1 of 71.9 on WSJ-10, which is reasonably high, but does not produce labeled trees.</Paragraph> </Section> <Section position="8" start_page="884" end_page="885" type="metho"> <SectionTitle> 5 Intersecting CCM and PCFG </SectionTitle> <Paragraph position="0"> The CCM and PCFG models provide complementary views of syntactic structure. The CCM explicitly learns the non-recursive contextual and yield properties of constituents and distituents. The PCFG model, on the other hand, does not explicitly model properties of distituents but instead focuses on modeling the hierarchical and recursive properties of natural language syntax. One would hope that modeling both of these aspects simultaneously would improve the overall quality of our induced grammar.</Paragraph> <Paragraph position="1"> We therefore combine the CCM with our feature-augmented PCFG, denoted by PROTO in experiment names. When we run EM on either of the models alone, at each iteration and for each training example, we calculate posteriors over that model's latent variables. For CCM, the latent variable is a bracketing matrix B (equivalent to an unlabeled binary tree), while for the CFG+ the latent variable is a labeled tree T. While these latent variables aren't exactly the same, there is a close relationship between them. A bracketing matrix constrains possible labeled trees, and a given labeled tree determines a bracketing matrix. One way to combine these models is to encourage both models to prefer latent variables which are compatible with each other.</Paragraph> <Paragraph position="2"> Similar to the approach of Klein and Manning (2004) on a different model pair, we intersect CCM and CFG+ by multiplying their scores for any labeled tree. For each possible labeled tree over a sentenceS, our generative model for a labeled tree T is given as follows:</Paragraph> <Paragraph position="4"> where B(T) corresponds to the bracketing matrix determined by T. The EM algorithm for the product model will maximize:</Paragraph> <Paragraph position="6"> where T (S) is the set of labeled trees consistent with the sentence S and T (B,S) is the set of labeled trees consistent with the bracketing matrix B and the sentence S. Notice that this quantity increases as the CCM and CFG+ models place probability mass on compatible latent structures, giving an intuitive justification for the success of this approach. null We can compute posterior expectations over (B,T) in the combined model (4) using a variant of the inside-outside algorithm. The local factor for a binary rule r = X - Y Z, over span (i,j), with CCM features Fprimeij = (yij,cij) and prototype feature pij, is given by the product of local factors for the CCM and CFG+ models: ph(r,(i,j)) = phCCM(yij,cij)phCFG+(r,pij) From these local factors, the inside-outside algorithm produces expected counts for each binary rule, r, over each span (i,j) and split point k, denoted by P(r,(i,j),k|S,F,Fprime). These posteriors are sufficient to re-estimate all of our model parameters. null upper bound on labeled recall is due to unary chains.</Paragraph> <Paragraph position="7"> 6 CCM as a Bracketer We tested the product model described in section 5 on WSJ-10 under the same conditions as in section 3. Our initial experiment utilizes no protoype information, random initialization, and greedy remapping of its labels. This experiment, PCFG x CCM in figure 4, gave 35.3 labeled F1, compared to the 51.6 labeled F1 with gold bracketing information (PCFG x GOLD in figure 4). Next we added the manually specified prototypesinfigure1, andconstrainedthemodeltogive these yields their labels if chosen as constituents. This experiment gave 48.9 labeled F1 (73.3 unlabeled). Theerrorreductionis21.0%labeled(5.3% unlabeled) over PCFG x CCM.</Paragraph> <Paragraph position="8"> Wethenexperimentedwithaddingdistributional prototype features as discussed in section 3.2 using a threshold of 0.75 and g = 0.1. This experiment, PROTO x CCM in figure 4, gave 62.2 labeled F1 (76.5 unlabeled). The error reduction is 26.0% labeled (12.0% unlabeled) over the experiment using prototypes without the similarity features. The overallerrorreductionfrom PCFGxCCM is41.6% (16.7%) in labeled (unlabeled) F1.</Paragraph> </Section> <Section position="9" start_page="885" end_page="887" type="metho"> <SectionTitle> 7 Error Analysis </SectionTitle> <Paragraph position="0"> The most common type of error by our PROTO x CCM system was due to the binary grammar restriction. For instance common NPs, such as DT JJ NN, analyzed as [NP DT [NP JJ NN] ], which proposes additional N constituents compared to the flatter treebank analysis. This discrepancy greatly, and perhaps unfairly, damages NP precision (see figure 6). However, this is error is unavoidable sessive NPs are analyzed as [NP NN [PP POS NN ] ], with the POS element treated as a preposition and the possessed NP as its complement. While labeling the POS NN as a PP is clearly incorrect, placing a constituent over these elements is not unreasonable and in fact has been proposed by some linguists (Abney, 1987). Another type of error also reported by Klein and Manning (2002) is MD VB groupings in infinitival VPs also sometimes argued by linguists (Halliday, 2004). More seriously, prepositional phrases are almost always attached &quot;high&quot; to the verb for longer NPs.</Paragraph> <Section position="1" start_page="886" end_page="886" type="sub_section"> <SectionTitle> 7.1 Augmenting Prototypes </SectionTitle> <Paragraph position="0"> One of the advantages of the prototype driven approach, over a fully unsupervised approach, is the ability to refine or add to the annotation specification if we are not happy with the output of our system. We demonstrate this flexibility by augmenting the prototypes in figure 1 with two new categories NP-POS and VP-INF, meant to model possessive noun phrases and infinitival verb phrases, which tend to have slightly different distributional properties from normal NPs and VPs. These new sub-categories are used during training and then stripped in post-processing. This prototype list gave 65.1 labeled F1 (78.2 unlabeled). This experiment is labeled BEST in figure 4. Looking at the CFG-learned rules in figure 7, we see that the basic structure of the treebank grammar is captured.</Paragraph> </Section> <Section position="2" start_page="886" end_page="886" type="sub_section"> <SectionTitle> 7.2 Parsing with only the PCFG </SectionTitle> <Paragraph position="0"> In order to judge how well the PCFG component of our model did in isolation, we experimented with training our BEST model with the CCM component, but dropping it at test time. This experi- null ment gave 65.1 labeled F1 (76.8 unlabeled). This demonstrates that while our PCFG performance degrades without the CCM, it can be used on its own with reasonable accuracy.</Paragraph> </Section> <Section position="3" start_page="886" end_page="887" type="sub_section"> <SectionTitle> 7.3 Automatically Generated Prototypes </SectionTitle> <Paragraph position="0"> There are two types of bias which enter into the creation of prototypes lists. One of them is the bias to choose examples which reflect the annotation semantics we wish our model to have. The second is the iterative change of prototypes in order to maximize F1. Whereas the first is appro- null priate, indeed the point, the latter is not. In order to guard against the second type of bias, we experimented with automatically extracted generated prototype lists which would not be possible without labeled data. For each phrase type category, we extracted the three most common yield associated with that category that differed in either first or last POS tag. Repeating our PROTO x CCM experiment with this list yielded 60.9 labeled F1 (76.5 unlabeled), comparable to the performance of our manual prototype list.</Paragraph> </Section> <Section position="4" start_page="887" end_page="887" type="sub_section"> <SectionTitle> 7.4 Chinese Grammar Induction </SectionTitle> <Paragraph position="0"> In order to demonstrate that our system is somewhat language independent, we tested our model on CTB-10, the 2,437 sentences of the Chinese Treebank (Ircs, 2002) of length at most 10 after punctuation is stripped. Since the authors havenoexpertiseinChinese, weautomaticallyextracted prototypes in the same way described in section7.3. Sincewedidnothaveaccesstoalarge auxiliary POS tagged Chinese corpus, our distributional model was built only from the treebank text, and the distributional similarities are presumably degraded relative to the English. Our PCFG x CCM experiment gave 18.0 labeled F1 (43.4 unlabeled). The PROTO x CCM model gave 39.0 labeled F1 (53.2 unlabeled). Presumably with access to more POS tagged data, and the expertise of aChinesespeaker, oursystemwouldseeincreased performance. It is worth noting that our unlabeled F1 of 53.2 is the best reported from a primarily unsupervised system, with the next highest figure being 46.7 reported by Klein and Manning (2004).</Paragraph> </Section> </Section> class="xml-element"></Paper>