File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2234_evalu.xml
Size: 9,319 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2234"> <Title>Some Properties of Preposition and Subordinate Conjunction Attachments*</Title> <Section position="7" start_page="1439" end_page="1441" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1439" end_page="1439" type="sub_section"> <SectionTitle> 6.1 Data preparation </SectionTitle> <Paragraph position="0"> Our experiments were conducted with data made available through the Penn Treebank annotation effort (Marcus et al., 1993). However, since our grammar model is based on syntax groups, not conventional categories, we needed to extend the Treebank annotations to include the constructs of interest to us.</Paragraph> <Paragraph position="1"> This was accomplished in several steps. First, noun groups and verb groups were manually annotated using Treebank data that had been stripped of all phrase structure markup. 5 This syntax group markup was then reconciled with the Treebank annotations by a semi-automatic procedure. Usually, the procedure just needs to overlay the syntax group markup on top of the Treebank annotations. However, the Treebank annotations often had to be adjusted to make them consistent with the syntax groups (e.g., verbal auxiliaries need to be included in the relevant verb phrase). Some 4-5% of all Treebank sentences could not be automatically reconciled in this way, and were removed from the data sets for these experiments.</Paragraph> <Paragraph position="2"> The reconciliation procedure also automatically tags the data for part-of-speech, using a high-performance tagger based on (BriU, 1993).</Paragraph> <Paragraph position="3"> Finally, the reconciler introduces adjective, adverb, and I-group markup. I-groups are created for all lexemes tagged with the IN, TO, WDT, WP, WP$ or WRB parts of speech, as well as multi-word prepositions such as according to.</Paragraph> <Paragraph position="4"> The reconciled data are then compiled into attachment problems using another semi-automatic pattern-matching procedure. 8% of the cases did not fit into the patterns and required manual intervention.</Paragraph> <Paragraph position="5"> We split our data into a training set (files 2000, 2013, and 200-269) and a test set (files 270-299). Because manual intervention is time consuming, it was only performed on the test set. The training set (called 0x6x) has 2615</Paragraph> </Section> <Section position="2" start_page="1439" end_page="1439" type="sub_section"> <SectionTitle> 6.2 Preliminary test </SectionTitle> <Paragraph position="0"> The preliminary experiment with our system compares it to previous work (Ratnaparkhi et al., 1994; Brill and Resnik, 1994; Collins and Brooks, 1995) when handling VNPN binary PP attachment ambiguity. In our terms, the task is to determine the attachment of certain vnpn category I-groups. The data originally was used in (Ratnaparkhi et al., 1994) and was derived from the Penn Treebank Wall St. Journal.</Paragraph> <Paragraph position="1"> It consists of about 21,000 training examples (call this lt, short for large-training) and about 3000 test examples. The format of this data is slightly different than for 0x6x and 7x9x: for each sample, only the 4 mentioned groups (VNPN) are provided, and for each group, this data just provides the head-word. As a result, our part-of-speech tagger could not run on this data, so we temporarily adjusted our system to only consider two part-of-speech categories: numbers for words with just commas, periods and digits, and non-numbers for all other words.</Paragraph> <Paragraph position="2"> The training used a 3 improvement threshold.</Paragraph> <Paragraph position="3"> With these rules, the percent correct on the test set went from 59.0% (guess all adjacent attachments) to 83.1%, an error reduction of 58.9%. This result is just a little behind the current best result of 84.5% (Collins and Brooks, 1995) (using a binomial distribution test, the difference is statistically significant at the 2% level). (Collins and Brooks, 1995) also reports a result of 81.9% for a word only version of the system (Brill and Resnik, 1994) that we extend (difference with our result is statistically significant at the 4% level). So our system is competitive on a known task.</Paragraph> </Section> <Section position="3" start_page="1439" end_page="1440" type="sub_section"> <SectionTitle> 6.3 The main experiments </SectionTitle> <Paragraph position="0"> We made 4 training and test run pairs: mm m lmmm'm m m The test set was always 7x9x, which starts at 67.7% correct. The results report the number of RULES the training run produces, as well as the percent CORrect and Error Reduction in the test. One source of variation is whether ALL or the V-A Attachment Points are used. The other source is the TRaining SET used. The set lt- is the set It (Section 6.2) with the entries from Penn Treebank Wall St. Journal files 270 to 299 (the files used to form the test set) removed. About 600 entries were removed. Several adjustments were made when using lt-: The part-of-speech treatment in Section 6.2 was used. Because It- only gives two possible attachment points (the adjacent noun and the nearest verb), only V-A attachment points were used. Finally, because It- is much slower to train on than 0x6x, training used a 3 improvement threshold. For 0x6x, a 2 improvement threshold was used.</Paragraph> <Paragraph position="1"> Set It2 is the data used in (Merlo et al., 1997) and has about 26000 entries. The set It2- is the set lt2 with the entries from Penn Treebank files 270-299 removed. Again, about 600 entries were removed. Generally, It2 has no information on the word(s) to the right of the preposition being attached, so this field was ignored in both training and test. In addition, for similar reasons as given for lt-, the adjustments made when using It- were also made when using lt2-.</Paragraph> <Paragraph position="2"> If one removes the lt2- results, then all the COR results are statistically significantly different from the starting 67.7% score and from each other at a 1% level or better. In addition, the lt2- and lt- results are not statistically significantly different (even at the 20% level). lt2- has more data points and more categories of data than lt-, but the lt- run has the best overall score. Besides pure chance, two other possible reasons for this somewhat surprising result are that the It2- entries have no information on the word(s) to the right of the preposition being attached (lt- does) and both datasets contain entries not in the other dataset. When looking at the It- run's remaining errors, 43% of the errors were in category Vnpn, 21% in vnpn, 16% in xfipx, 13% in xxsx, 4% in ~npfi and 3% in vnpfi.</Paragraph> </Section> <Section position="4" start_page="1440" end_page="1441" type="sub_section"> <SectionTitle> 6.4 Afterwards </SectionTitle> <Paragraph position="0"> The lt- run has the best overall score. However, the It- run does not always produce the best score for each category. Below are the scores (number correct) for each run that has a best score (bold face) for some category: The location of most of the best subscores is not surprising. Of the training sets, lt- has the most vnpn entries, 6 It2- has the most ~nptype entries and 0x6x has the most xxsx entries. The best vnpfi and xfipx subscore locations are somewhat surprising. The best vnpfi subscore is statistically significantly better than the It2vnpfi subscore at the 5% level. A possible explanation is that the vnpfi and vnpn categories are closely related. The best xfipx subscore is not statistically significantly better than the ltxfipx subscore, even at the 25% level. Besides pure chance, a possible explanation is that the xfipx category is related to the four np-type categories (where lt2- has the most entries).</Paragraph> <Paragraph position="1"> The fact that the subscores for the various categories differ according to training regimen suggests a system architecture that would exploit this. In particular, we might apply different rule sets for each attachment category, with each rule set trained in the optimal configuration for that category. We would thus expect the overall accuracy of the attachment procedure to improve overall. To estimate the magnitude of this improvement, we calculated a post-hoc composite score on our test set by combining the best subscore for each of the 6 categories. When viewed as trying to improve upon the It- subscores, the new ~npfi subscore is statistically significantly better (4% level) and the new xxsx subscore is mildly statistically significantly better (20% level). The new ~npn and xfipx subscores are not statistically significantly better, even at the 25% level. This combination yields a post-hoc improved score of 76.5%. This is of course only a post-hoc estimate, and we would need to run a new independent test to verify the actual validity of this effect. Also, this estimate is only mildly statistically significantly better (13% level) than the existing 75.4% score.</Paragraph> <Paragraph position="2"> 6For vnpn, the lt- score is statistically significantly better than the It2- score at the 2% level.</Paragraph> </Section> </Section> class="xml-element"></Paper>