File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2410_evalu.xml
Size: 7,792 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2410"> <Title>Thesauruses for Prepositional Phrase Attachment</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> For our experiments we use the Wall Street Journal dataset created by Ratnaparkhi et al. (1994). This is divided into a training set of 20,801 words, a development set of 4,039 words and a test set of 3,097 words. Each word was reduced to its morphological root using the morphological analyser described in (Minnen et al., 2000). Strings of four digits beginning with a 1 or 2 are replaced with YEAR and all other digit strings including those including commas and full stops were replaced with NUM. Our implementation of Collins' algorithm only achieves 84.3% on the test data, with the shortfall of 0.2% primarily due to the different morphological analysers used3</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Smoothing </SectionTitle> <Paragraph position="0"> Firstly we compare the different PP similarity functions.</Paragraph> <Paragraph position="1"> Figure 2 shows the accuracy of each as a function of k, the number of examples in S(c) . The WASPS thesaurus was used in all cases. The best smoothed model is rank with 85.1% accuracy when b = 0.05 and k = 15.</Paragraph> <Paragraph position="2"> The accuracy of rank with the smallest b value drops off rapidly when k > 10, showing that neighbours beyond this point are providing unreliable evidence and should be discounted more aggressively. More interestingly, this problem also affects average, suggesting that the similarity scores provided by the thesaurus are also misleadingly high for less similar words. The same effect was also observed when we used the harmonic mean of all similarity scores, so it is unlikely that the problem is an artifact of the averaging operation.</Paragraph> <Paragraph position="3"> On the other hand, if b is set quite low (for example the development set plotted against k, the number of similar words used for smoothing b = 0.01) then accuracy levels off very quickly as less similar neighbours are assigned zero frequency. The middle value of b = 0.05 appears to offer a good trade-off. Regardless of the similarity function we can see that relatively small values for k are sufficient, which is good news for efficiency reasons (each attachment decision is an O(k) operation).</Paragraph> <Paragraph position="4"> Figure 3 shows the combined coverage of the triple and quadruple features in Collins' model, which are the only smoothed features in our model. For example, almost 75% of attachment decisions are resolved by 3- or 4-tuples using the average function and setting k = 25.</Paragraph> <Paragraph position="5"> Again, average is comparable to rank with b = 0.01.</Paragraph> <Paragraph position="6"> Table 1 compares the accuracy of the smoothed and unsmoothed models at each backing off stage. Smoothing has a negative effect on accuracy, but this is made for by an increase in accuracy.</Paragraph> <Paragraph position="7"> The reduction in the error rate with the single best policy on the development set is somewhat less than with the smoothed frequency models, and the results more error-prone and sensitive to the choice of k. These models are more likely to be unlucky with a choice of feature than with the smoothed frequencies. As noted above, this technique may still be useful for algorithms which cannot</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Thesauruses </SectionTitle> <Paragraph position="0"> A thesaurus providing better neighbours should do better on this task. Figure 4 shows the accuracy of the three thesauruses using rank smoothing and b = 0.05 on the development data. Final results using k = 5 and b = 0.05 on the data is shown in Table 2, together with the size of the noun sections of each thesaurus (the direct object thesaurus in the case of specialist) and coverage of 3- and 4-tuples.</Paragraph> <Paragraph position="1"> Clearly both generic thesauruses consistently outperform the specialist thesaurus. The latter tends to produce neighbours with have less obvious semantic similarity, for example providing pour as the first neighbour of fetch. We hypothesised that using syntactic rather than semantic neighbours could be desirable, but in this case it often generates contexts that are unlikely to occur: pour price of profit as a neighbour of fetch price of profit, for example. Although this may be a flaw in the approach, we may simply be using too few contexts to create a reliable thesaurus. Previous research has found that using more data leads to better quality thesauruses (Curran and Moens, 2002). We are also conflating attachment preferences, since a word must appear with similar contexts in both noun and verb modifying PPs to achieve a high sim- null WordNet or automatic clustering algorithms ilarity score. There may be merit in creating separate thesauruses for noun-attachment and verb-attachment, since there may be words that are strongly similar in only one of these cases.</Paragraph> <Paragraph position="2"> Interestingly, although Lin is smaller than WASPS it has better coverage. This is most likely due to the different corpora used to construct each thesaurus. Lin is built using newswire text which is closer in genre to the Wall Street Journal. For example, the first neighbour for fetch in WASPS is grab, but none of the top 25 neighbours of this word in Lin have this sporting sense. Both WASPS and specialist are derived from the BNC and have similar coverage, although the quality of specialist neighbours is not as good.</Paragraph> <Paragraph position="3"> The WASPS and Lin models produce statistically significant (P < 0.05) improvements over the vanilla Collins model using a paired t-test with 10-fold cross-validation on the entire dataset4. The specialist model is not significantly better. Table 3 compares our results with other comparable PP attachment models.</Paragraph> <Paragraph position="4"> On the face of it, these are not resounding improvements over the baseline, but this is a very hard task. Ratnaparkhi (1994) established a human upper bound of 88.2% but subsequent research has put this as low as 78.3% (Mitchell, 2003). At least two thirds of the re- null smoothed model 84.90+-1.0% accuracy by this measure.</Paragraph> <Paragraph position="5"> maining errors are therefore likely to be very difficult. An inspection of the data shows that many of the remaining errors are due to poor neighbouring PPs being used for smoothing. For example, the PP in entrust company with cash modifies the verb, but no matching quadruples are present in the training data. The only matching (n1,p,n2) triple using WASPS is (industry, for, income), which appears twice in the training data modifying the noun. The model therefore guesses incorrectly even though the thesaurus is providing what appear to be semantically appropriate neighbours. Another example is attend meeting with representative, where the (v,p,n2) triple (talk, with, official) convinces the model to incorrectly guess verb attachment.</Paragraph> <Paragraph position="6"> Part of the problem is that words in the PP are replaced independently and without consideration to the remaining context. However we had hoped the specialist thesaurus might alleviate this problem by providing neighbours that are more appropriate for this specific task. Finding good neighbours for verbs is clearly more difficult than for nouns since subcategorisation and selectional preferences also play a role.</Paragraph> </Section> </Section> class="xml-element"></Paper>