File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1042_metho.xml
Size: 18,146 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1042"> <Title>Joint and conditional estimation of tagging and parsing models</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 PCFG parsing </SectionTitle> <Paragraph position="0"> In this application, the pairs (y;x) consist of a parse tree y and its terminal string or yield x (it may be simpler to think of y containing all of the parse tree except for the string x). Recall that in a PCFG with production set R, each production (A! ) 2 R is associated with a parameter A! . These parameters satisfy a normalization constraint for each nonterminal A:</Paragraph> <Paragraph position="2"> For each production r 2 R, let fr(y) be the number of times r is used in the derivation of the tree y. Then the PCFG defines a probability distribution over trees:</Paragraph> <Paragraph position="4"> Unfortunately the MCLE for a PCFG is more complicated. If x is a word string, then let (x) be the set of parse trees with terminal string or yield x generated by the PCFG. Then given a training corpus D = ((y1;x1);::: ;(yn;xn)), where yi is a parse tree for the string xi, the log conditional likelihood of the training data log P(~yj~x) and its derivative are given by:</Paragraph> <Paragraph position="6"> Here E (fjx) denotes the expectation of f with respect to P conditioned on Y 2 (x). There does not seem to be a closed-form solution for the that maximizes P(~yj~x) subject to the constraints (3), so we used an iterative numerical gradient ascent method, with the constraints (3) imposed at each iteration using Lagrange multipliers. Note that Pni=1 E (fA! jxi) is a quantity calculated in the Inside-Outside algorithm (Lari and Young, 1990) and P(~yj~x) is easily computed as a by-product of the same dynamic programming calculation.</Paragraph> <Paragraph position="7"> Since the expected production counts E (fjx) depend on the production weights , the entire training corpus must be reparsed on each iteration (as is true of the Inside-Outside algorithm). This is computationally expensive with a large grammar and training corpus; for this reason the MCLE PCFG experiments described here were performed with the relatively small ATIS tree-bank corpus of air travel reservations distributed by LDC.</Paragraph> <Paragraph position="8"> In this experiment, the PCFGs were always trained on the 1088 sentences of the ATIS1 corpus and evaluated on the 294 sentences of the ATIS2 corpus. Lexical items were ignored; the PCFGs generate preterminal strings. The iterative algorithm for the MCLE was initialized with the MLE parameters, i.e., the &quot;standard&quot; PCFG estimated from a treebank. Table 1 compares the MLE and MCLE PCFGs.</Paragraph> <Paragraph position="9"> The data in table 1 shows that compared to the MLE PCFG, the MCLE PCFG assigns a higher conditional probability of the parses in the training data given their yields, at the expense of assigning a lower marginal probability to the yields themselves. The labelled precision and recall parsing results for the MCLE PCFG were slightly higher than those of the MLE PCFG. Because P(~yj~x) of the ATIS1 training trees, and the marginal likelihood P(~x) of the ATIS1 training strings, as well as the labelled precision and recall of the ATIS2 test trees, using the MLE and MCLE PCFGs.</Paragraph> <Paragraph position="10"> both the test data set and the differences are so small, the significance of these results was estimated using a bootstrap method with the difference in F-score in precision and recall as the test statistic (Cohen, 1995). This test showed that the difference was not significant (p 0:1). Thus the MCLE PCFG did not perform significantly better than the MLE PCFG in terms of precision and recall.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 HMM tagging </SectionTitle> <Paragraph position="0"> As noted in the previous section, maximizing the conditional likelihood of a PCFG or a HMM can be computationally intensive. This section and the next pursues an alternative strategy for comparing MLEs and MCLEs: we compare similiar (but not identical) model classes, one of which has an easily computed MLE, and the other of which has an easily computed MCLE. The application considered in this section is bitag POS tagging, but the techniques extend straight-forwardly to n-tag tagging. In this application, the data pairs (y;x) consist of a tag sequence y = t1 ::: tm and a word sequence x = w1 ::: wm, where tj is the tag for word wj (to simplify the formulae, w0, t0, wm+1 and tm+1 are always taken to be end-markers). Standard HMM tagging models define a joint distribution over word-tag sequence pairs; these are most straight-forwardly estimated by maximizing the likelihood of the joint training distribution. However, it is straight-forward to devise closely related HMM tagging models which define a conditional distribution over tag sequences given word sequences, and which are most straight-forwardly estimated by maximizing the conditional likelihood of the distribution of tag sequences given word sequences in the training data.</Paragraph> <Paragraph position="1"> All of the HMM models investigated in this section are instances of a certain kind of graphical model that Pearl (1988) calls &quot;Bayes nets&quot;; Figure 2 sketches the networks that correspond to all of the models discussed here. (In such a graph, the set of incoming arcs to a node depicting a variable indicate the set of variables on which this variable is conditioned).</Paragraph> <Paragraph position="2"> Recall the standard bitag HMM model, which defines a joint distribution over word and tag sequences: null</Paragraph> <Paragraph position="4"> As is well-known, the MLE for (4) sets ^P to the empirical distributions on the training data.</Paragraph> <Paragraph position="5"> Now consider the following conditional model of the conditional distribution of tags given words (this is a simplified form of the model described in McCallum et al. (2000)):</Paragraph> <Paragraph position="7"> The MCLE of (5) is easily calculated: P0 should be set the empirical distribution of the training data. However, to minimize sparse data problems we estimated P0(TjjWj;Tj 1) as a mixture of ^P(TjjWj), ^P(TjjTj 1) and ^P(TjjWj;Tj 1), where the ^P are empirical probabilities and the (bucketted) mixing parameters are determined using deleted interpolation from heldout data (Jelinek, 1997).</Paragraph> <Paragraph position="8"> These models were trained on sections 2-21 of the Penn tree-bank corpus. Section 22 was used as heldout data to evaluate the interpolation parameters . The tagging accuracy of the models was evaluated on section 23 of the tree-bank corpus (in both cases, the tag tj assigned to word wj is the one which maximizes the marginal P(tjjw1 ::: wm), since this minimizes the expected loss on a tag-by-tag basis).</Paragraph> <Paragraph position="9"> The conditional model (5) has the worst performance of any of the tagging models investigated in this section: its tagging accuracy is 94.4%. The joint model (4) has a considerably lower error rate: its tagging accuracy is 95.5%.</Paragraph> <Paragraph position="10"> One possible explanation for this result is that the way in which the interpolated estimate of P0 is calculated, rather than conditional likelihood estimation per se, is lowering tagger accuracy somehow. To investigate this possibility, two additional joint models were estimated and tested, based on the formulae below.</Paragraph> <Paragraph position="12"> The MLEs for both (6) and (7) are easy to calculate. (6) contains a conditional distribution P1 which would seem to be of roughly equal complexity to P0, and it was estimated using deleted interpolation in exactly the same way as P0, so if the poor performance of the conditional model was due to some artifact of the interpolation procedure, we would expect the model based on (6) to perform poorly. Yet the tagger based on (6) performs the best of all the taggers investigated in this section: its tagging accuracy is 96.2%.</Paragraph> <Paragraph position="13"> (7) is admitted a rather strange model, since the right hand term in effect predicts the following word from the current word's tag. However, note that (7) differs from (5) only via the presence of this rather unusual term, which effectively converts (5) from a conditional model to a joint model. Yet adding this term improves tagging accuracy considerably, to 95.3%. Thus for bitag tagging at least, the conditional model has a considerably higher error rate than any of the joint models examined here. (While a test of significance was not conducted here, previous experience with this test set shows that performance differences of this magnitude are extremely significant statistically). null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Shift-reduce parsing </SectionTitle> <Paragraph position="0"> The previous section compared similiar joint and conditional tagging models. This section compares a pair of joint and conditional parsing models. The models are both stochastic shift-reduce parsers; they differ only in how the distribution over possible next moves are calculated. These parsers are direct simplifications of the Structured Language Model (Jelinek, 2000). Because the parsers' moves are determined solely by the top two category labels on the stack and possibly the look-ahead symbol, they are much simpler than stochastic LR parsers (Briscoe and Carroll, 1993; Inui et al., 1997). The distribution over trees generated by the joint model is a probabilistic context-free language (Abney et al., 1999). As with the PCFG models discussed earlier, these parsers are not lexicalized; lexical items are ignored, and the POS tags are used as the terminals.</Paragraph> <Paragraph position="1"> These two parsers only produce trees with unary or binary nodes, so we binarized the training data before training the parser, and debinarize the trees the parsers produce before evaluating them with respect to the test data (Johnson, 1998). We binarized by inserting n 2 additional nodes into each local tree with n > 2 children.</Paragraph> <Paragraph position="2"> We binarized by first joining the head to all of the constituents to its right, and then joining the resulting structure with constituents to the left. The label of a new node is the label of the head followed by the suffix &quot;-1&quot; if the head is (contained in) the right child or &quot;-2&quot; if the head is (contained in) the left child. Figure 3 depicts an example of this transformation.</Paragraph> <Paragraph position="3"> The Structured Language Model is described in detail in Jelinek (2000), so it is only reviewed here. Each parser's stack is a sequence of node reduce parser experiments transforms tree (a) into tree (b). labels (possibly including labels introduced by binarization). In what follows, s1 refers to the top element of the stack, or '?' if the stack is empty; similarly s2 refers to the next-to-top element of the stack or '?' if the stack contains less than two elements. We also append a '?' to end of the actual terminal string being parsed (just as with the HMMs above), as this simplifies the formulation of the parsers, i.e., if the string to be parsed is w1 ::: wm, then we take wm+1 = ?.</Paragraph> <Paragraph position="4"> A shift-reduce parse is defined in terms of moves. A move is either shift(w), reduce1(c) or reduce2(c), where c is a nonterminal label and w is either a terminal label or '?'. Moves are partial functions from stacks to stacks: a shift(w) move pushes a w onto the top of stack, while a reducei(c) move pops the top i terminal or non-terminal labels off the stack and pushes a c onto the stack. A shift-reduce parse is a sequence of moves which (when composed) map the empty stack to the two-element stack whose top element is '?' and whose next-to-top element is the start symbol. (Note that the last move in a shift-reduce parse must always be a shift(?) move; this corresponds to the final &quot;accept&quot; move in an LR parser). The isomorphism between shift-reduce parses and standard parse trees is well-known (Hopcroft and Ullman, 1979), and so is not described here.</Paragraph> <Paragraph position="5"> A (joint) shift-reduce parser is defined by a distribution P(mjs1;s2) over next moves m given the top and next-to-top stack labels s1 and s2. To ensure that the next move is in fact a possible move given the current stack, we require that P(reduce1(c)j?;?) = 0 and</Paragraph> <Paragraph position="7"> bol and s2 = ?. Note that this extends to a probability distribution over shift-reduce parses (and hence parse trees) in a particularly simple way: the probability of a parse is the product of the probabilities of the moves it consists of. Assuming that P meets certain tightness conditions, this distribution over parses is properly normalized because there are no &quot;dead&quot; stack configurations: we require that the distribution over moves be defined for all possible stacks.</Paragraph> <Paragraph position="8"> A conditional shift-reduce parser differs only minimally from the shift-reduce parser just described: it is defined by a distribution P(mjs1;s2;t) over next moves m given the top and next-to-top stack labels s1, s2 and the next input symbol w (w is called the look-ahead symbol). In addition to the requirements on P above, we also require that if w0 6= w then</Paragraph> <Paragraph position="10"> shift moves can only shift the current look-ahead symbol. This restriction implies that all non-zero probability derivations are derivations of the parse string, since the parse string forces a single sequence of symbols to be shifted in all derivations.</Paragraph> <Paragraph position="11"> As before, since there are no &quot;dead&quot; stack configurations, so long as P obeys certain tightness conditions, this defines a properly normalized distribution over parses. Since all the parses are required to be parses of of the input string, this defines a conditional distribution over parses given the input string.</Paragraph> <Paragraph position="12"> It is easy to show that the MLE for the joint model, and the MCLE for the conditional model, are just the empirical distributions from the training data. We ran into sparse data problems using the empirical training distribution as an estimate for P(mjs1;s2;w) in the conditional model, so in fact we used deleted interpolation to interpolate ^P(mjs1;s2;w), and ^P(mjs1;s2) to estimate P(mjs1;s2;w). The models were estimated from sections 2-21 of the Penn treebank, and tested on the 2245 sentences of length 40 or less in section 23. The deleted interpolation parameters were estimated using heldout training data from section conditional shift-reduce parsers, and for a PCFG.</Paragraph> <Paragraph position="13"> 22.</Paragraph> <Paragraph position="14"> We calculated the most probable parses using a dynamic programming algorithm based on the one described in Jelinek (2000). Jelinek notes that this algorithm's running time is n6 (where n is the length of sentence being parsed), and we found exhaustive parsing to be computationally impractical. We used a beam search procedure which thresholded the best analyses of each prefix of the string being parsed, and only considered analyses whose top two stack symbols had been observed in the training data. In order to help guard against the possibility that this stochastic pruning influenced the results, we ran the parsers twice, once with a beam threshold of 10 6 (i.e., edges whose probability was less than 10 6 of the best edge spanning the same prefix were pruned) and again with a beam threshold of 10 9. The results of the latter runs are reported in table 2; the labelled precision and recall results from the run with the more restrictive beam threshold differ by less than 0:001, i.e., at the level of precision reported here, are identical with the results presented in table 2 except for the Precision of the Joint SR parser, which was 0:665. For comparision, table 2 also reports results from the non-lexicalized treebank PCFG estimated from the transformed trees in sections 2-21 of the treebank; here exhaustive CKY parsing was used to find the most probable parses.</Paragraph> <Paragraph position="15"> All of the precision and recall results, including those for the PCFG, presented in table 2 are much lower than those from a standard treebank PCFG; presumably this is because the binarization transformation depicted in Figure 3 loses information about pairs of non-head constituents in the same local tree (Johnson (1998) reports similiar performance degradation for other binarization transformations). Both the joint and the conditional shift-reduce parsers performed much worse than the PCFG. This may be due to the pruning effect of the beam search, although this seems unlikely given that varying the beam threshold did not affect the results. The performance difference between the joint and conditional shift-reduce parsers bears directly on the issue addressed by this paper: the joint shift-reduce parser performed much better than the conditional shift-reduce parser. The differences are around a percentage point, which is quite large in parsing research (and certainly highly significant).</Paragraph> <Paragraph position="16"> The fact that the joint shift-reduce parser out-performs the conditional shift-reduce parser is somewhat surprising. Because the conditional parser predicts its next move on the basis of the lookahead symbol as well as the two top stack categories, one might expect it to predict this next move more accurately than the joint shift-reduce parser. The results presented here show that this is not the case, at least for non-lexicalized parsing. The label bias of conditional models may be responsible for this (Bottou, 1991; Lafferty et al., 2001).</Paragraph> </Section> class="xml-element"></Paper>