File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1004_metho.xml
Size: 17,800 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1004"> <Title>Combining unsupervised and supervised methods for PP attachment disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Our training resources </SectionTitle> <Paragraph position="0"> We used the NEGRA treebank (Skut et al., 1998) with 10,000 sentences from German newspapers and extracted 4-tuples (V;N1;P;N2) whenever a PP with the preposition P and the core noun N2 immediately followed a noun N1 in a clause headed by the verb V. For example, the sentence In Deutschland ist das Ger&quot;at &quot;uber die Bad Homburger Ergos zu beziehen.</Paragraph> <Paragraph position="1"> [In Germany the appliance may be ordered from Ergos based in Bad Homburg.] leads to the 4-tuple (beziehen, Ger&quot;at, &quot;uber, Ergos). In this way we obtained 5803 4-tuples with the human judgements about the attachment of the PP (42% verb attachments and 58% noun attachments). We call this the NEGRA test set.</Paragraph> <Paragraph position="2"> As raw corpus for unsupervised training we used four annual volumes (around 5.5 million words) of the &quot;Computer-Zeitung&quot; (CZ), a weekly computer science magazine. This corpus was subjected to a number of processing steps: sentence recognition, proper name recognition for persons, companies and geographical locations (cities and countries), part-of-speech tagging, lemmatization, NP/PP chunking, recognition of local and temporal PPs, and finally clause boundary recognition.</Paragraph> <Paragraph position="3"> 3000 sentences of the CZ corpus each containing at least one PP in an ambiguous position were set aside for manual disambiguation. Annotation was done according to the same guidelines as for the NEGRA treebank. From these manually annotated sentences we obtained a second test set (which we call the CZ test set) of 4469 4-tuples from the same domain as our raw training corpus.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Results for the unsupervised methods </SectionTitle> <Paragraph position="0"> We explored various possibilities to extract PP disambiguation information from the automatically annotated CZ corpus. We first used it to gather frequency data on the cooccurrence of pairs: nouns + prepositions and verbs + prepositions. null The cooccurrence value is the ratio of the bi-gram frequency count freq(word;preposition) divided by the unigram frequency freq(word).</Paragraph> <Paragraph position="1"> For our purposes word can be the verb V or the reference noun N1. The ratio describes the percentage of the cooccurrence of word + preposition against all occurrences of word. It isthusastraightforwardassociationmeasurefor a word pair. The cooccurrence value can be seen as the attachment probability of the preposition based on maximum likelihood estimates. We</Paragraph> <Paragraph position="3"> with W 2 fV;N1g. The cooccurrence values for verb V and noun N1 correspond to the probability estimates in (Ratnaparkhi, 1998) except that Ratnaparkhi includes a back-off to the uniform distribution for the zero denominator case.</Paragraph> <Paragraph position="4"> We will add special precautions for this case in our disambiguation algorithm. The cooccurrence values are also very similar to the probability estimates in (Hindle and Rooth, 1993).</Paragraph> <Paragraph position="5"> We started by computing the cooccurrence values over word forms for nouns, prepositions, and verbs based on their part-of-speech tags. In order to compute the pair frequencies freq(N1;P), we search the training corpus for all token pairs in which a noun is immediately followed by a preposition. The treatment of verb + preposition cooccurrences is different from the treatment of N+P pairs since verb and preposition are seldom adjacent to each other in a German sentence. On the contrary, they can be far apart from each other, the only restriction being that they cooccur within the same clause. We use the clause boundary information in our training corpus to enforce this restriction.</Paragraph> <Paragraph position="6"> For computing the cooccurrence values we accept only verbs and nouns with a occurrence frequency of more than 10.</Paragraph> <Paragraph position="7"> With the N+P and V+P cooccurrence values for word forms we did a first evaluation over the CZ test set with the following simple disambiguation algorithm.</Paragraph> <Paragraph position="8"> if ( cooc(N1,P) && cooc(V,P) ) then if ( cooc(N1,P) >= cooc(V,P) ) then noun attachment else verb attachment We found that we can only decide 57% of the test cases with an accuracy of 71.4% (93.9% correct noun attachments and 55.0% correct verb attachments). This shows a striking imbalance between the noun attachment accuracy and the verb attachment accuracy. Obviously, the cooccurrence values favor verb attachment. The comparison of the verb cooccurrence value and the noun cooccurrence value too often leads to verb attachment, and only the clear cases of noun attachment remain. This points to an inherent imbalance between the cooccurrence values for verbs and nouns. We will flatten out this imbalance with a noun factor.</Paragraph> <Paragraph position="9"> The noun factor is supposed to strengthen the N+P cooccurrence values and thus to attract more noun attachment decisions. What is the rationale behind the imbalance between noun cooccurrence value and verb cooccurrence value? One influence is certainly the well-known fact that verbs bind their complements stronger than nouns.</Paragraph> <Paragraph position="10"> The imbalance between noun cooccurrence values and verb cooccurrence values can be quantified by comparing the overall tendency of nouns to cooccur with a preposition to the over-all tendency of verbs to cooccur with a preposition. We compute the overall tendency as the cooccurrence value of all nouns with all prepositions. null</Paragraph> <Paragraph position="12"> The computation for the overall verb cooccurrence tendency is analogous. For example, in our training corpus we have found 314,028 N+P pairs (tokens) and 1.72 million noun tokens. This leads to an overall noun cooccurrence value of 0.182. The noun factor (nf) is then the ratio of the overall verb cooccurrence tendency divided by the overall noun cooccurrence tendency: nf = cooc(all V;all P)cooc(all N;all P) Inourtrainingcorpusthisleadstoanounfactor of 0:774=0:182 = 4:25. In the disambiguation algorithm we multiply the noun cooccurrence valuewith this noun factor before comparing the product to the verb cooccurrence value. This move leads to an improvement of the over-all attachment accuracy to 81.3% (83.1% correct noun attachments and 76.9% correct verb attachments).</Paragraph> <Paragraph position="13"> We then went on to increase the attachment coverage, the number of decidable cases, by using lemmas, decompounding (i.e. using only the last component of a noun compound), and proper name classes. These measures increased the coverage from 57% to 86% of the test cases.</Paragraph> <Paragraph position="14"> For the remaining test cases we used a threshold comparison if either of the needed cooccurrence values (cooc(N1;P) or cooc(V;P)) has been computed from our training corpus. This raises the coverage to 90%. While coverage increased, accuracy suffered slightly and at this stage was at 78.3%.</Paragraph> <Paragraph position="15"> This is a surprising result given the fact that we counted all PPs during the training phases.</Paragraph> <Paragraph position="16"> No disambiguation was attempted so far, we counted ambiguous and non-ambiguous PPs in the same manner. We then added this distinction in the training, counting one point for a PP in a non-ambiguous position and only half a point for an ambiguous PP, in this way splitting the PP's contribution to verb and noun attachment. This move increased the accuracy rate by 2% (to 80.5%).</Paragraph> <Paragraph position="17"> So far we have used bigram frequencies over word pairs, (V;P) and (N1;P), to compute the cooccurrence values. Some of the previous research (e.g. (Collins and Brooks, 1995) and (Pantel and Lin, 2000)) has shown that it is advantageous to include the noun from within the PP (called N2) in the calculation. But moving from pair frequencies to triple frequencies will increase the sparse data problem. Therefore we computed the pair frequencies and triple frequencies in parallel and used a cascaded disambiguation algorithm to exploit the triple cooccurrence values and the pair cooccurrence values in sequence.</Paragraph> <Paragraph position="18"> In analogy to the pair cooccurrence value, the triple cooccurrence value is computed as:</Paragraph> <Paragraph position="20"> with W 2 fV;N1g. With the triple information (V;P;N2) we were able to identify support verb units (such as in Angriff nehmen, unter Beweis stellen) which are clear cases of verb attachment. We integrated this and the triple cooccurrence values into the disambiguation algorithm in the following manner.</Paragraph> <Paragraph position="22"> pair comparison are computed separately. The noun factor for pairs is 5.47 and for triples 5.97.</Paragraph> <Paragraph position="23"> The attachment accuracy is improved to 81.67% by the integration of the triple cooccurrence values (see table 1). A split on the decision levels reveals that triple comparison is 4.41% better than pair comparison (see table 2).</Paragraph> <Paragraph position="24"> The 84.36% for triple comparison demonstrates what we can expect if we enlarge our corpus and consequently increase the percentage of test cases that can be disambiguated based on triple cooccurrence values.</Paragraph> <Paragraph position="25"> The accuracy of 81.67% reported in table 1 is computed over the decidable cases. If we force a default decision (noun attachment) on the remaining cases, the overall accuracy is at 79.14%.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results for the supervised </SectionTitle> <Paragraph position="0"> methods One of the most successful supervised methods is the Back-off model as introduced by Collins and Brooks (1995). This model is based on the idea of using the best information available and backing off to the next best level whenever an information level is missing. For the PP attachment task this means using the attachment tendency for the complete quadruple (V;N1;P;N2) if the quadruple has been seen in the training data. If not, the algorithm backs off to the attachment tendency of triples. All triples that contain the preposition are considered: (V;N1;P); (V;P;N2); (N1;P;N2). The triple information is used if any of the triples has been seen in the training data. Else, the algorithm backs off to pairs, then to the preposition alone, and finally to default attachment. The attachment tendency on each level is computed as the ratio of the relative frequency to the absolute frequency. Lacking a large tree-bank we had to use our test sets in turn as training data for the supervised learning. In a first experiment we used the NEGRA test set as training material and evaluated against the CZ test set. Both test sets were subjected to the following restrictions to reduce the sparse data problem.</Paragraph> <Paragraph position="1"> 1. Verbs, nouns and contracted prepositions were substituted by their base forms. Compound nouns were substituted by the base form of their last component.</Paragraph> <Paragraph position="2"> 2. Proper names were substituted by their name class tag (person, location, company). null 3. Pronouns and numbers (in PP complement position) were substituted by a pronoun tag or number tag respectively.</Paragraph> <Paragraph position="3"> This means we used 5803 NEGRA quadruples with their given attachment decisions as training material for the Back-off model. We then applied the Back-off decision algorithm to determine the attachments for the 4469 test cases in the CZ test set. Table 3 shows the results. Due to the default attachment step in the algorithm, the coverage is 100%. The accuracy is close to 74%, with noun attachment accuracy being 10% better than verb attachment.</Paragraph> <Paragraph position="4"> A closer look reveals that the attachment accuracy for quadruples (100%) and triples (88.7%) is highly reliable (cf. table 4) but only 7.5% of the test cases can be resolved in this way. The overall accuracy is most influenced by the accuracy of the pairs (that account for 68% of all attachments with an accuracy of 75.66%) and by the attachment tendency of the preposition alone which resolves 24.1% of the test cases but results in a low accuracy of 64.66%.</Paragraph> <Paragraph position="5"> We suspected that the size of the training corpus has a strong impact on the disambiguation quality. Since we did not have access to any larger treebank for German, we used cross validation on the CZ test set in a third experiment. We evenly divided this test corpus in 5 parts of 894 test cases each. We added 4 of these parts to the NEGRA test set as training material. The training material thus consists of 5803 quadruples from the NEGRA test set plus 3576 quadruples from the CZ test set. We then evaluated against the remaining part of 894 test cases. We repeated this 5 times with the different parts of the CZ test set and summed up the correct and incorrect attachment decisions.</Paragraph> <Paragraph position="6"> The result from cross validation is 5% better than using the NEGRA corpus alone as training material. This could be due to the enlarged training set or to the domain overlap of the test set with part of the training set. We therefore did another cross validation experiment taking only the 4 parts of the CZ test set as training material. If the improved accuracy were a result of the increased corpus size, we would expect a worse accuracy for this small training set. But in fact, training with this small set resulted in around 77% attachment accuracy. This is better than training on the NEGRA test set alone. This indicates that the domain overlap is the most influential factor.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Intertwining unsupervised and </SectionTitle> <Paragraph position="0"> supervised methods Now, that we have seen the advantages of the supervised approaches, but lack a sufficiently large treebank for training, we suggest combining the unsupervised and supervised information. With the experiments on cooccurrence values and the Back-off method we have worked out the quality of the various decision levels within these approaches, and we will now order the decision levels according to the reliability of the information sources.</Paragraph> <Paragraph position="1"> We reuse the triple and pair cooccurrence values that we have computed for the experiments with our unsupervised method. That means that we will also reuse the respective noun factors and thresholds. In addition, we use the NEGRA test set as supervised training corpus for the Back-off method.</Paragraph> <Paragraph position="2"> The disambiguation algorithm will now work in the following manner. It starts off with the support verb units as level 1, since they are known to be very reliable. As long as no attachment decision is taken, the algorithm proceeds to the next level. Next is the application of supervised quadruples (level 2), followed by supervised triples (level 3). In section 4 we had seen that there is a wide gap between the accuracy of supervised triples and pairs. We fill this gap by accessing unsupervised information, i.e.</Paragraph> <Paragraph position="3"> triple cooccurrence values followed by pair cooccurrence values (level 4 and 5). Even threshold comparisons based on one cooccurrence value are usually more reliable than supervised pairs andtherefore constitute levels6and 7. Ifstill no decision has been reached, the algorithm continues with supervised pair probabilities followed by pure preposition probabilities. The left-over cases are handled by default attachment. Below is the complete disambiguation algorithm in pseudo-code: else default verb attachment And indeed, this combination of unsupervised and supervised information leads to an improved attachment accuracy. For complete coverage we get an accuracy of 80.98% (cf. table 5). This compares favorably to the accuracy of the cooccurrence experiments plus default attachment (79.14%) reported in section 3 and to the Back-off results (73.98%) reported in table 3. We obviously succeeded in combining the best of both worlds into an improved behavior of the disambiguation algorithm.</Paragraph> <Paragraph position="4"> The decision levels in table 6 reveal that the bulk of the attachment decisions still rests with the cooccurrence values, mostly pair value comparisons (59.9%) and triple value comparisons (18.9%). But the high accuracy of the supervised triples and, equally important, the graceful degradation in stepping from threshold comparison to supervised pairs (resolving 202 test cases with 75.74% accuracy) help to improvethe overall attachment accuracy.</Paragraph> <Paragraph position="5"> We also checked whether the combination of unsupervised and supervised approaches leads to an improvement for the NEGRA test set. We exchanged the corpus for the supervised training (now the CZ test set) and evaluated over the NEGRA test set. This results in an accuracy of 71.95% compared to 68.29% for pure application of the supervised Back-off method. That means, the combination leads to an improvement of 3.66% in accuracy.</Paragraph> </Section> class="xml-element"></Paper>