XML Viewer - w06-2112

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2112_metho.xml
Size: 19,800 bytes
Last Modified: 2025-10-06 14:10:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2112">
  <Title>How bad is the problem of PP-attachment? A comparison of English, German and Swedish</Title>
  <Section position="5" start_page="81" end_page="83" type="metho">
    <SectionTitle>
3 Querying Treebanks with
</SectionTitle>
    <Paragraph position="0"> TIGER-Search We therefore decided to check the attachment tendencies of PPs in various treebanks for the three languages in question with the same tool and with queries that are as uniform as possible.</Paragraph>
    <Paragraph position="1"> For English we used the WSJ section of the Penn Treebank, for German we used our own ComputerZeitung treebank (3000 sentences), the  NEGRA treebank (10,000 sentences) and the recently released version of the TIGER treebank (50,000 sentences). For Swedish we used the SynTag treebank mentioned above and one section of the Talbanken treebank (6100 sentences). All these treebanks consist of constituent structure trees, and they are in representation formats which allow them to be loaded into TIGER-Search. This enables us to query them all in similar manners and to get a fairer comparison of the attachment tendencies.</Paragraph>
    <Paragraph position="2"> TIGER-Search is a powerful treebank query tool developed at the University of Stuttgart (K&amp;quot;onig and Lezius, 2002). Its query language allows for feature-value descriptions of syntax graphs. It is similar in expressiveness to tgrep (Rohde, 2005) but it comes with graphical output and highlighting of the syntax trees plus some nice statistics functions.</Paragraph>
    <Paragraph position="3"> Our experiments for determining attachment tendencies proceed along the following lines. For each treebank we first query for all sequences of a noun immediately followed by a PP (henceforth noun+PP sequences). The dot being the precedence operator, we use the query: [pos=&amp;quot;NN&amp;quot;] . [cat=&amp;quot;PP&amp;quot;] This query will match twice in the tree in figure 1. It gives us the frequency of all ambiguously located PP. We disregard the fact that in certain clause positions a PP in such a sequence cannot be verb-attached and is thus not ambiguous. For example, an English noun+PP sequence in subject position is not ambiguous with respect to PP attachment since the PP cannot attach to the verb. Similar restrictions apply to German and Swedish. In order to determine how many of these sequences are annotated as noun attachments, we query for noun phrases that contain both a noun and an immediately following PP. This query will  look like: #np_mum:[cat=&amp;quot;NP&amp;quot;] &gt; #np_child:[cat=&amp;quot;NP&amp;quot;] &amp; #np_mum &gt; #pp:[cat=&amp;quot;PP&amp;quot;] &amp; #np_child &gt;* #noun:[pos=&amp;quot;NN&amp;quot;] &amp; #noun . #pp  All strings starting with # are variables and the &gt; symbol is the dominance operator. So, this query says: Search for an NP (and call it np mum) that immediately dominates another NP (np child) AND that immediately dominates a PP, AND the np child dominates a noun which is immediately followed by the PP.</Paragraph>
    <Paragraph position="4"> This query presupposes that a PP which is attached to a noun is actually annotated with the structure (NP (NP (... N)) (PP)) which is true for the Penn treebank (compare to the tree in figure 1). But the German treebanks represent this type of attachment rather as (NP (... N) (PP)) which means that the query needs to be adapted accordingly.4 Such queries give us the frequency of all noun+PP sequences and the frequency of all such sequences with noun attachments. These frequencies allow us to calculate the noun attachment rate (NAR) in our treebanks.</Paragraph>
    <Paragraph position="6"> We assume that all PPs in noun+PP sequences which are not attached to a noun are attached to a verb. This means we ignore the very few cases of such PPs that might be attached to adjectives (as for instance the second PP in &amp;quot;due for revision in 1990&amp;quot;).</Paragraph>
    <Paragraph position="7"> Different annotation schemes require modifications to these basic queries, and different noun classes (regular nouns, proper names, deverbal nouns etc.) allow for a more detailed investigation. We now present the results for each language in turn.</Paragraph>
    <Section position="1" start_page="82" end_page="83" type="sub_section">
      <SectionTitle>
3.1 Results for English
</SectionTitle>
      <Paragraph position="0"> We used sections 0 to 12 of the WSJ part of the Penn Treebank (Marcus et al., 1993) with a total of 24,618 sentences for our experiments. Our start query reveals that an ambiguously located PP (i.e.</Paragraph>
      <Paragraph position="1"> a noun+PP sequence) occurs in 13,191 (54%) of these sentences, and it occurs a total of 20,858 times (a rate of 0.84 occurrences per sentences with respect to all sentences in the treebank).</Paragraph>
      <Paragraph position="2"> Searching for noun attachments with the second query described in section 3 we learn that 15,273 noun+PP sequences are annotated as noun attachments. And we catch another 547 noun attachments if we query for noun phrases that contain two PPs in sequence.5 In these cases the sec- null to the noun immediately preceding it (as for example in the tree in figure 1). With some similar queries we located another 110 cases of noun attachments (most of which are probably annotation errors if the annotation guidelines are applied strictly). This means that we found a total of 15,930 cases of noun attachment which corresponds to a noun attachment rate of 76.4% (by comparison to the 20,858 occurrences).</Paragraph>
      <Paragraph position="3"> This is a surprisingly high number. Neither (Hindle and Rooth, 1993) with 67% nor (Ratnaparkhi et al., 1994) with 59% noun attachment were anywhere close to this figure. What have we done differently? One aspect is that we only queried for singular nouns (NN) in the Penn Treebank where plural nouns (NNS) and proper names (NNP and NNPS) have separate PoS tags. Using analogous queries for plural nouns we found that they exhibit a NAR of 71.7%. Whereas the queries for proper names (singular and plural names taken together) account for a NAR of 54.5%.</Paragraph>
      <Paragraph position="4"> Another reason for the discrepancy in the NAR between Ratnaparkhi's data and our calculations certainly comes from the fact that we queried for all sequences noun+PP as possibly ambiguous whereas they looked only at such sequences within verb phrases. But since we will do the same in both German and Swedish, this is still worthwhile.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="83" end_page="85" type="metho">
    <SectionTitle>
3.2 Results for German
</SectionTitle>
    <Paragraph position="0"> The three German treebanks which we investigate are all annotated in more or less the same manner, i.e. according to the NEGRA guidelines which were slightly refined for the TIGER project. This enabled us to use the same set of queries for all  three of them. Since the German guidelines distinguish between node labels for coordinated phrases (e.g. CNP and CPP) and non-coordinated phrases (e.g. NP and PP), these distinctions needed to be taken into account. Table 1 summarizes the results. null Our own ComputerZeitung treebank (CZ) has a much higher occurrence rate of ambiguously located PPs because the sentences were preselected for this phenomenon. The general NEGRA and TIGER treebanks have an occurrence rate that is similar to English (0.8). The NAR varies between 59.1% for the NEGRA treebank and 63.0% for the CZ treebank for regular nouns.</Paragraph>
    <Paragraph position="1"> The German annotation also distinguishes between regular nouns and proper names. The proper names show a much lower noun attachment rate than the regular nouns. The NAR in the CZ treebank is 22%, in the NEGRA treebank it is 20%, and in the TIGER treebank it is only 17%.</Paragraph>
    <Paragraph position="2"> Here we suspect that the difference between the CZ and the other treebanks is based on the different text types. The computer journal CZ contains more person names with affiliation (e.g. Stan Sugarman von der Firma Telemedia) and more company names with location (e.g. Aviso aus Finn- null land) than a regular newspaper (that was used in the NEGRA and TIGER corpora).</Paragraph>
    <Paragraph position="3"> As mentioned above, our previous experiments in (Volk, 2001) were based on sets of extracted tuples from both the CZ and NEGRA treebanks.</Paragraph>
    <Paragraph position="4"> Our extracted data set from the CZ treebank had a noun attachment rate of 61%, and the one from the NEGRA treebank had a noun attachment rate of 56%.</Paragraph>
    <Paragraph position="5"> So why are our new results based on TIGER-Search queries two to three percents higher? The main reason is that our old data sets included proper names (with their low noun attachment rate). But our extraction procedure comprised also a number of other idiosyncracies. In an attempt to harvest as many interesting V-N-P-N tuples as possible from our treebanks we exploited coordinated phrases and pronominal PPs. Some examples: null 1. If the PP was preceded by a coordinated noun phrase, we created as many tuples as there were head nouns in the coordination. For example, the phrase &amp;quot;den Austausch und die gemeinsame Nutzung von Daten . . . erm&amp;quot;oglichen&amp;quot; leads to the tuples  (erm&amp;quot;oglichen, Austausch, von, Daten) and (erm&amp;quot;oglichen, Nutzung, von, Daten) both with the decision 'noun attachment'.</Paragraph>
    <Paragraph position="6"> 2. If the PP was introduced by coordinated prepositions (e.g. Die Argumente f&amp;quot;ur oder gegen den Netzwerkcomputer), we created as many tuples as there were prepositions.</Paragraph>
    <Paragraph position="7"> 3. If the verb group consists of coordinated verbs (e.g. Infos f&amp;quot;ur Online-Dienste aufbereiten und gestalten), we created as many tuples as there were verbs.</Paragraph>
    <Paragraph position="8"> 4. We regarded pronominal adverbs (darin,  dazu, hier&amp;quot;uber, etc.) and reciprocal pronouns (miteinander, untereinander, voneinander, etc.) as equivalent to PPs and created tuples when such pronominals appeared immediately after a noun. See (Volk, 2003) for a more detailed discussion of these pronouns.</Paragraph>
    <Section position="1" start_page="84" end_page="85" type="sub_section">
      <SectionTitle>
3.3 Results for Swedish
</SectionTitle>
      <Paragraph position="0"> Currently there is no large-scale Swedish treebank available. But there are some smaller treebanks from the 80s which have recently been converted to TIGER-XML so that they can also be queried with TIGER-Search.</Paragraph>
      <Paragraph position="1"> SynTag (J&amp;quot;arborg, 1986) is a treebank consisting of around 5100 sentences. Its conversion to TIGER-XML is documented in (Hagstr&amp;quot;om, 2004). The treebank focuses on predicate-argument structures and some grammatical functions such as subject, head and adverbials. It is thus different from the constituent structures that we find in the Penn treebank or the German treebanks. We had to adapt our queries accordingly. Since prepositional phrases are not marked as such, we need to query for constituents (marked as subject, as adverbial or simply as argument) that start with a preposition. This results in a noun attachment rate of 73% (which is very close to the rate reported by (Aasa, 2004)). Again this does not include proper names which have a NAR of 44% in SynTag.</Paragraph>
      <Paragraph position="2"> Let us compare these results to the second Swedish treebank, Talbanken (first described by (Telemann, 1974)). Talbanken was a remarkable achievement in the 80s as it comes with two written language parts (with a total of more than 10,000 sentences from student essays and from newspapers) and two spoken language parts (with another 10,000 trees from interviews and conversations). We concentrated on the 6100 trees from the written part taken from newspaper texts.</Paragraph>
      <Paragraph position="3"> The occurrence rate in Talbanken is 0.76 (4658 noun+PP sequences in 6100 sentences), which is similar to the rates observed for English and German. The occurrence rate in SynTag is higher 0.93 (4737 noun+PP sequences in 5114 sentences).</Paragraph>
      <Paragraph position="4"> Talbanken (in its converted form) is annotated with constituent structure labels (NP, PP, VP etc.) and also distinguishes coordinated phrases (CNP, CPP, CVP etc.). The queries for determining the noun attachment rate can thus be similar to the queries over the German treebanks. In addition, Talbanken comes with a rich set of grammatical features as edge labels (e.g. there are different labels for logical subject, dummy subject and other subject).</Paragraph>
      <Paragraph position="5"> We found that the NAR for regular nouns in Talbanken is 60.5%. Talbanken distinguishes between regular nouns, deverbal nouns (often with the derivation suffix -ing: tj&amp;quot;anstg&amp;quot;oring, utbildning, &amp;quot;ovning) and deadjectival nouns (mostly with the derivation suffix -het: skyldighet, snabbhet, verksamhet). Not surprisingly, these special nouns have higher NARs than the regular nouns. The  deadjectival nouns have a NAR of 69.5%, and the deverbal nouns even have a NAR of 77%. Taken together (i.e. regarding all regular, deadjectival and deverbal nouns) this results in a NAR of 64%.</Paragraph>
      <Paragraph position="6"> Thus, the NARs which we obtain from the two Swedish treebanks (SynTag 73% and Talbanken 64%) differ drastically. It is unclear what this difference depends on. The text genre (newspapers) is the same in both cases. We have noticed that SynTag contains a number of annotation errors, but we don't see that these errors favor noun attachment of PPs in a systematic way. One aspect might be the annotation decision in Talbanken to annotate PPs in light verb constructions.</Paragraph>
      <Paragraph position="7"> These are disturbing cases where the PP is a child node of the sentence node S (which means that it is interpreted as a verb attachment) with the edge label OA (objektadverbial). Nivre (2005, personal communication) pointed out that &amp;quot;OA is what some theoreticians would call a 'prepositional object' or a 'PP complement', i.e. a complement of the verb that semantically is close to an object but which is realized as a prepositional phrase.&amp;quot; In our judgement many of those cases should be noun attachments (and thus be a child of an NP).</Paragraph>
      <Paragraph position="8"> For example, we looked at f&amp;quot;oruts&amp;quot;attning f&amp;quot;or (= prerequisite for) which occurs 14 times, out of which 2 are annotated as OO (Other object) + OA, 11 are annotated as noun attachments, and 1 is erroneously annotated. If we compare that to betydelse f&amp;quot;or (= significance for) which occurs 16 times out of which 13 are annotated as OO+OA and 3 are annotated as noun attachments, we wonder. null First, it is obvious that there are inconsistencies in the treebank. We cannot see any reason why the 2 cases of f&amp;quot;oruts&amp;quot;attning f&amp;quot;or are annotated differently than the other 11 cases. The verbs do not justify these discrepancies. For example, we have skapa (= to create) with the verb attachments and f&amp;quot;orsvinna (= to disappear) with the noun attachment cases. And we find ge (= to give) on both sides.</Paragraph>
      <Paragraph position="9"> Second, we find it hard to follow the argument that the tendency for betydelse f&amp;quot;or is stronger for the OO+OA than for f&amp;quot;oruts&amp;quot;attning f&amp;quot;or. It might be based on the fact that betydelse f&amp;quot;or is often used with the verb ha (= to have) and thus may count as a light verb construction with a verb group consisting of both ha plus betydelse and the f&amp;quot;or-PP being interpreted as an object of this complex verb group.</Paragraph>
      <Paragraph position="10"> Third, unfortunately not all cases of PPs annotated as objektadverbial can be regarded as noun attachments. But after having looked at some 70 occurrences of such PPs immediately following a noun, we estimate that around 30% should be noun attachments.</Paragraph>
      <Paragraph position="11"> Concluding our observations on Swedish let us mention that the very few cases of proper names in Talbanken have a NAR of 24%.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="85" end_page="86" type="metho">
    <SectionTitle>
4 Comparison of the results
</SectionTitle>
    <Paragraph position="0"> For English we have computed a NAR of 76.4% based on the Penn Treebank, for German we found NARs between 59% and 63% based on three treebanks, and for Swedish we determined a puzzling difference between 73% NAR in SynTag and 64% NAR in Talbanken. So, why is the tendency of a PP to attach to a preceding noun stronger in English than in Swedish which in turn shows a stronger tendency than German? For English the answer is very clear. The strong NAR is solely based on the dominance of the preposition of. In our section of the Penn Tree-bank we found 20,858 noun+PP sequences. Out of these, 8412 (40% !!) were PPs with the preposition of. And 99% of all of-PPs are noun attachments. So, the preposition of dominates the English NAR to the point that it should be treated separately.6 null The Ratnaparkhi data sets (described above in section 2) contain 30% tuples with the preposition of in the test set and 27% of-tuples in the training set. The higher percentage of of-tuples in the test set may partially explain the higher NAR of 59% (vs. 52% in the training set).</Paragraph>
    <Paragraph position="1"> The dominance of of-tuples may also explain the relatively high NAR for proper names in English (54.5%) in comparison to 17% - 22% in German and similar figures for the Swedish Talbanken corpus. The Penn Treebank represents names that contain a PP (e.g. District of Columbia, American Association of Individual Investors) with a regular phrase structure. It turns out that 861 (35%) of the 2449 sequences 'proper name followed by PP' are based on of-PPs. The dominance becomes even more obvious if we consider that the following 6This is actually what has been done in some research on English PP attachment disambiguation. (Ratnaparkhi, 1998) first assumes noun attachment for all of-PPs and then applies his disambiguation methods to all remaining PPs.</Paragraph>
    <Paragraph position="2">  prepositions on the frequency ranks are in (with only 485 occurrences) and for (246 occurrences).</Paragraph>
    <Paragraph position="3"> The dominance of the preposition of is so strong in English that we will get a totally different picture of attachment preferences if we omit of-PPs. The Ratnaparkhi training set without of-tuples is left with a NAR of 35% (!) and the test set has a NAR of 42%. In other words, English has a clear tendency of attaching PPs to verbs if we ignore the dominating of-PPs.</Paragraph>
    <Paragraph position="4"> Neither German nor Swedish has such a dominating preposition. There are, of course, prepositions in both languages that exhibit a clear tendency towards noun attachment or verb attachment. But they are not as frequent as the preposition of in English. For example, clear temporal prepositions like German seit (= since) are much more likely as verb attachments.</Paragraph>
    <Paragraph position="5"> Closest to the English of is the Swedish preposition av which has a NAR of 88% in the Talbanken corpus. But its overall frequency does not dominate the Swedish ranking. The most frequent preposition in ambiguous positions is i (frequency: 651 and NAR: 53%) followed by av (frequency: 564; NAR: 88%) and f&amp;quot;or (frequency: 460; NAR: 42%).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML