File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0103_metho.xml
Size: 14,710 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0103"> <Title>Prepositional Phrase Attachment through a Backed-Off Model</Title> <Section position="4" start_page="28" end_page="30" type="metho"> <SectionTitle> 3 Estimation based on Training Data Counts </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 3.1 Notation </SectionTitle> <Paragraph position="0"> We will use the symbol f to denote the number of times a particular tuple is seen in training data. For example f(1, is, revenue, from, research) is the number of times the quadruple (is, revenue, from, research) is seen with a noun attachment. Counts of lower order tuples can also be made- for example f(1, P = from) is the number of times (P = from) is seen with noun attachment in training data, f(V = is, N2 = research) is the number of times (V = is, N2 = research) is seen with either attachment and any value of N1 and P.</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 3.2 Maximum Likelihood Estimation </SectionTitle> <Paragraph position="0"> A maximum likelihood method would use the training data to give the following estimation for the conditional probability: l~(l\[v, nl,p, n2)= f(1,v, nl,p, n2) f(v, nl, p, n2) Unfortunately sparse data problems make this estimate useless. A quadruple may appear in test data which has never been seen in training data. ie. f(v, nl,p, n2) = 0. The above estimate is undefined in this situation, which happens extremely frequently in a large vocabulary domain such as WSJ. (In this experiment about 95% of those quadruples appearing in test data had not been seen in training data).</Paragraph> <Paragraph position="1"> Even if f(v, nl,p, n2) > 0, it may still be very low, and this may make the above MLE estimate inaccurate. Unsmoothed MLE estimates based on low counts are notoriously bad in similar problems such as n-gram language modeling \[GC90\]. However later in this paper it is shown that estimates based on low counts are surprisingly useful in the PP-attachment problem.</Paragraph> </Section> <Section position="3" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 3.3 Previous Work </SectionTitle> <Paragraph position="0"> Hindle and Rooth \[HR93\] describe one of the first statistical approaches to the prepositional phrase attachment problem. Over 200,000 (v, nl,p) triples were extracted from 13 million words of AP news stories. The attachment decisions for these triples were unknown, so an unsupervised training method was used (section 5.2 describes the algorithm in more detail). Two human judges annotated the attachment decision for 880 test examples, and the method performed at 80% accuracy on these cases. Note that it is difficult to compare this result to results on Wall Street Journal, as the two corpora may be quite different.</Paragraph> <Paragraph position="1"> The Wall Street Journal Treebank \[MSM93\] enabled both \[RRR94\] and \[BR94\] to extract a large amount of supervised training material for the problem. Both of these methods consider the second noun, n2, as well as v, nl and p, with the hope that this additional information will improve results. \[BR94\] use 12,000 training and 500 test examples. A greedy search is used to learn a sequence of 'transformations' which minimise the error rate on training data. A transformation is a rule which makes an attachment decision depending on up to 3 elements of the (v, nl,p, n2) quadruple.</Paragraph> <Paragraph position="2"> (Typical examples would be 'If P=ofthen choose noun attachment' or 'If V=buy and P=for choose verb attachment'). A further experiment incorporated word-class information from WordNet into the model, by allowing the transformations to look at classes as well as the words. (An example would be 'If N2 is in the time semantic class, choose verb attachment'). The method gave 80.8% accuracy with words only, 81.8% with words and semantic classes, and they also report an accuracy of 75.8% for the metric of \[HR93\] on this data. Transformations (using words only) score 81.9% 1 \[RRR94\] use the data described in section 2.1 of this paper - 20801 training and 3097 test examples from Wall Street Journal. They use a maximum entropy model which also considers subsets of the quadruple. Each sub-tuple predicts noun or verb attachment with a weight indicating its strength of prediction - the weights are trained to maximise the likelihood of training data. For example (P = of) might have a strong weight for noun attachment, while (V = buy, P = for) would have a strong weight for verb attachment. \[RRR94\] also allow the model to look a.t class inlbrmation, this time the classes were learned automatically from a corpus. Results of 77.7% (words only) and 81.6% (words and classes) are reported. Crucially they ignore low-count events in training data by imposing a frequency cut-off somewhere between 3 and 5.</Paragraph> </Section> </Section> <Section position="5" start_page="30" end_page="32" type="metho"> <SectionTitle> 4 The Backed-Off Estimate </SectionTitle> <Paragraph position="0"> \[KATZ87\] describes backed-off n-gram word models for speech recognition. There the task is to estimate the probability of the next word in a text given the (n-l) preceding words. The MLE estimate of this probability would be:</Paragraph> <Paragraph position="2"> But again the denominator f(Wl, W2 .... Wn_l) will frequently be zero, especially for large n. The backed-off estimate is a method of combating the sparse data problem. It is defined recursively as follows:</Paragraph> <Paragraph position="4"> Else backing-off continues in the same way.</Paragraph> <Paragraph position="5"> f(w2, w3 .... Wn) f(w~, W 3 .... Wn-1 ) f(w3, W4 .... ten) f(w3, w4 .... w~_~) The idea here is to use MLE estimates based on lower order n-grams if counts are not high enough to make an accurate estimate at the current level. The cut off frequencies (O, c2 .... ) are thresholds determining whether to back-off or not at each level - counts lower than ci at stage i are deemed to be too low to give an accurate estimate, so in this case backing-off continues. (~1, ~2, .... ) are normalisation constants which ensure that conditional probabilities sum to one. Note that the estimation of 15(wn\[w~, w2 .... Wn-1) is analogous to the estimation of 15(1\]v, nl, p, n2), and the above method can therefore also be applied to the PP-attachment problem. For example a simple method for estimation of 15(1\[v, nl,p, n2) would go from MLE estimates ofiS(llv, nl,p, n2) to ~5(11v , nl,p) to ~5(1\[v, nl) to 15(1\[v) to 15(1). However a crucial difference between the two problems is that in the n-gram task the words Wl to wn are sequentiM, giving a natural order in which backing off takes place - from p(Wn\[Wl, W 2 .... Wn_l) to 15(WnIW2, W3 .... Wn-1) to 15(W~\[W3, W4 .... Wn_l) and so on. There is no such sequence in the PP-attachment problem, and because of this there are four possible triples when backing off from quadruples ((v, nl,p), (v,p, n2), (nl,p, n2) and (v, nl, n2)) and six possible pairs when backing off from triples ((v,p), (nl,p), (p, n2), (v, nl), (v, n2) and A key observation in choosing between these tuples is that the preposition is particularly important to the attachment decision. For this reason only tuples which contained the preposition were used in backed off estimates - this reduces the problem to a choice between 3 triples and 3 pairs at each respective stage. Section 6.2 describes experiments which show that tuples containing the preposition are much better indicators of attachment.</Paragraph> <Paragraph position="6"> The following method of combining the counts was found to work best in practice:</Paragraph> <Paragraph position="8"> Note that this method effectively gives more weight to tuples with high overall counts. Another obvious method of combination, a simple average 2, gives equal weight to the three tuples regardless of their total counts and does not perform as well.</Paragraph> <Paragraph position="9"> The cut-off frequencies must then be chosen. A surprising difference fi'om language modeling is that a cut-off frequency of 0 is found to be optimum at all stages. This effectively means however low a count is, still use it rather than backing off a level.</Paragraph> <Paragraph position="10"> 2eg. A simple average for triples would be defined as</Paragraph> <Paragraph position="12"/> <Section position="1" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 4.1 Description of the Algorithm </SectionTitle> <Paragraph position="0"> The algorithm is then as follows: 1. If 3 f(v, nl,p, n2) > 0 l~(llv, nl,p, n2)= f(1,v, nl,p, n2) f(v, nl, p, n2) 2. Else if f(v, nl,p) + f(v,p, n2) + f(nl,p, n2) > 0 fi(11 v, nl, p, n2) = f( 1, v, nl, p) + f( 1, v, p, n2) + .f( 1, nl, p, n2) f(v, nl, p) + f(v, p, n2) + f(nl, p, n2) 3. Else if f(v,p) + f(nl,p) + f(p, n2) > 0</Paragraph> <Paragraph position="2"> The decision is then: If 15(11v , nl,p, n2) >= 0.5 choose noun attachment. Otherwise choose verb attachment</Paragraph> </Section> </Section> <Section position="6" start_page="32" end_page="35" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The figure below shows the results for the method on the 3097 test sentences, also giving the total count and accuracy at each of the backed-off stages.</Paragraph> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 5.1 Results with Morphological Analysis </SectionTitle> <Paragraph position="0"> In an effort to reduce sparse data problems the following processing was run over both test and in \[KSZE94\].</Paragraph> <Paragraph position="1"> These modifications are similar to those performed on the corpus used by \[BR94\]. The result using this modified corpus was 84.5%, an improvement of 0.4~0 on the previous result.</Paragraph> </Section> <Section position="2" start_page="33" end_page="35" type="sub_section"> <SectionTitle> 5.2 Comparison with Other Work </SectionTitle> <Paragraph position="0"> Results from \[RRR94\], \[BR94\] and the backed-off method are shown in the table below 4. All results are for the IBM data. These figures should be taken in the context of the lower and upper bounds then choose noun attachment, else choose verb attachment.</Paragraph> <Paragraph position="1"> Here f(w,p) is the number of times preposition p is seen attached to word w in the table. ~tnd</Paragraph> <Paragraph position="3"> If we ignore n2 then the IBM data is equivalent to Hindle and Rooth's (v, hi, p} triples, with the advantage of the attachment decision being known, allowing a supervised algorithm. The test used in \[HR93\] can then be stated as follows in our notation:</Paragraph> <Paragraph position="5"> then choose noun attachment, else choose verb attachment.</Paragraph> <Paragraph position="6"> This is effectively a comparison of the maximum likelihood estimates of/)(pll, nl ) and P(PI(}, v), a different measure from the backed-off estimate which gives i5(lIv,p , nl).</Paragraph> <Paragraph position="7"> The backed-off method based on just the f(v,p) and f(nl,p) counts would be: If 15(llv , nl,p) >= 0.5 then choose noun attachment, else choose verb attachment, where l~(lIv, nl,p) = f(1, v,p)+ f(1,nl,p) f(v,p) + f(nl,p) 5This ignores refinements to the test such ~ smoothing of the estimate, and a measure of the confidence of the decision. However the measure given is at the core of the algorithm. / , On the surface the method described in \[HR93\] looks very similar to the backed-off estimate. For this reason the two methods deserve closer comparison. Itindle and Rooth used a partial paxser to extract head nouns from a corpus, together with a preceding verb and a followillg preposition, giving a table of (v, nl,p) triples. An iterative, unsupervised method was thell used to decide between noun and verb attachment for each triple. The decision was made as followsZ: An experiment was implemented to investigate the difference in performance between these two methods. The test set was restricted to those cases where f(1,nl) > 0, .f(0, v) > 0, and Hindle and Rooth's method gave a definite decision. (ie. the above inequality is strictly less-than or greater-than). This gave 1924 test cases. Hindle and Rooth's method scored 82.1% accuracy (1580 correct) on this set, whereas the backed-off measure scored 86.5% (1665 correct).</Paragraph> </Section> </Section> <Section position="7" start_page="35" end_page="36" type="metho"> <SectionTitle> 6 A Closer Look at Backing-Off </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 6.1 Low Counts are Important </SectionTitle> <Paragraph position="0"> A possible criticism of the backed-off estimate is that it uses low count events without any smoothing, which has been shown to be a mistake in similar problems such as n-gram language models.</Paragraph> <Paragraph position="1"> In particular, quadruples and triples seen in test data will frequently be seen only once or twice in trMning data.</Paragraph> <Paragraph position="2"> An experiment was made with all counts less than 5 being put to zero, 6 effectively making the algorithm ignore low count events. In \[RRR94\] a cut-off 'between 3 and 5' is used for all events. The training and test data were both the unprocessed, original data sets.</Paragraph> <Paragraph position="3"> The results were as follows:</Paragraph> </Section> <Section position="2" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 6.2 Tuples with Prepositions are Better </SectionTitle> <Paragraph position="0"> We have excluded tuples which do not contain a preposition from the model. This section gives results which justify this.</Paragraph> <Paragraph position="1"> The table below gives accuracies for the sub-tuples at each stage of backing-off. The accuracy figure for a particular tuple is obtained by modifying the algorithm in section 4.1 to use only information from that tuple at the appropriate stage. For example for (v, nl, n2), stage 2 would be modified to read 6Specifically: if for a subset x of the quadruple f(x) < 5, then make f(x) = f(1, x) = f(0, x) = 0. At each stage there is a sharp difference in accuracy between tuples with and without a preposition. Moreover, if the 14 tuples in the above table were ranked by accuracy, the top 7 tuples wouhl be the 7 tuples which contain a preposition.</Paragraph> </Section> </Section> class="xml-element"></Paper>