File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1014_metho.xml

Size: 15,655 bytes

Last Modified: 2025-10-06 14:07:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1014">
  <Title>An Unsupervised Approach to Prepositional Phrase Attachment using Contextually Similar Words</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Training Data Extraction
</SectionTitle>
    <Paragraph position="0"> We parsed a 125-million word newspaper corpus with Minipar 3 , a descendent of Principar (Lin, 1994). Minipar outputs dependency trees (Lin, 1999) from the input sentences. For example, the following sentence is decomposed into a dependency tree: Occasionally, the parser generates incorrect dependency trees. For example, in the above sentence, the prepositional phrase headed by with should attach to saw (as opposed to dog ). Two separate sets of training data were then extracted from this corpus. Below, we briefly describe how we obtained these data sets.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Ambiguous Data Set
</SectionTitle>
      <Paragraph position="0"> For each input sentence, Minipar outputs a single dependency tree. For a sentence containing one or more prepositions, we use a program to detect any alternative prepositional attachment sites. For example, in the above sentence, the program would detect that with could attach to saw . Using an iterative algorithm, we initially create a table of co-occurrence frequencies for 3-tuples of the form  ( V , P , N 2 ) and ( N 1 , P , N 2 ). For each k possible attachment site of a preposition P , we increment the frequency of the corresponding 3-tuple by 1/ k . For example, Table 2 shows the initial co-occurrence frequency table for the corresponding 3-tuples of the above sentence.</Paragraph>
      <Paragraph position="1"> 3 Available at www.cs. ualberta .ca/~lindek/minipar.htm.  In the following iterations of the algorithm, we update the frequency table as follows. For each k possible attachment site of a preposition P , we refine its attachment score using the formulas described in Section 4: VScore ( V k , P k , N 2 k ) and NScore ( N 1 k , P k , N 2 k ). For any tuple ( W k , P k , N 2 k ), where W k is either V k or N 2 k , we update its frequency as:</Paragraph>
      <Paragraph position="3"> where Score ( W k , P k , N 2 k ) = VScore ( W k , P k , N 2 k ) if W k = V k ; otherwise Score ( W k , P k , N 2 k ) = NScore ( W k , P k , N 2 k ).</Paragraph>
      <Paragraph position="4"> Suppose that after the initial frequency table is set NScore ( man, in, park ) = 1.23, VScore ( saw, with, telescope ) = 3.65, and NScore ( dog, with, telescope ) = 0.35. Then, the updated co-occurrence frequencies for ( man , in , park ) and ( saw , with , telescope ) are: fr ( man , in , park ) = 23.1 23.1 = 1.0 fr ( saw , with, telescope ) = 35.065.3 65.3 + = 0.913 Table 3 shows the updated frequency table after the first iteration of the algorithm. The resulting database contained 8,900,000 triples.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Unambiguous Data Set
</SectionTitle>
      <Paragraph position="0"> As in ( Ratnaparkhi, 1998 ), we constructed a training data set consisting of only unambiguous Table 2 . Initial co-occurrence frequency table entries for A man in the park saw a dog with a telescope .</Paragraph>
      <Paragraph position="1"> V OR N 1 P N 2 F REQUENCY man in park 1.0 saw with telescope 0.5 dog with telescope 0.5 Table 3 . Co-occurrence frequency table entries for A man in the park saw a dog with a telescope after one iteration. V OR N 1 P N 2 F REQUENCY man in park 1.0 saw with telescope 0.913 dog with telescope 0.087 A man in the park saw a dog with a telescope. det det det det</Paragraph>
      <Paragraph position="3"> attachments of the form ( V , P , N 2 ) and ( N 1 , P , N 2). We only extract a 3-tuple from a sentence when our program finds no alternative attachment site for its preposition. Each extracted 3-tuple is assigned a frequency count of 1. For example, in the previous sentence, ( man , in , park ) is extracted since it contains only one attachment site; ( dog , with , telescope ) is not extracted since with has an alternative attachment site. The resulting database contained 4,400,000 triples.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Classification Model
</SectionTitle>
    <Paragraph position="0"> Roth (1998) presented a unified framework for natural language disambiguation tasks.</Paragraph>
    <Paragraph position="1"> Essentially, several language learning algorithms (e.g. naive Bayes estimation, back-off estimation, transformation-based learning) were successfully cast as learning linear separators in their feature space. Roth modelled prepositional phrase attachment as linear combinations of features. The features consisted of all 15 possible sub-sequences of the 4-tuple ( V , N 1 , P, N 2 ) shown in Table 4. The asterix ( * ) in features represent wildcards.</Paragraph>
    <Paragraph position="2"> Roth used supervised learning to adjust the weights of the features. In our experiments, we only considered features that contained P since the preposition is the most important lexical item (Collins and Brooks, 1995). Furthermore, we omitted features that included both V and N 1 since their co-occurrence is independent of the attachment decision. The resulting subset of features considered in our system is shown in bold in Table 4 (equivalent to assigning a weight of 0 or 1 to each feature).</Paragraph>
    <Paragraph position="3"> Let  |head , rel , mod  |represent the frequency, obtained from the training data, of the head occurring in the given relationship rel with the modifier . We then assign a score to each feature  as follows: 1. ( * , * , P , * ) = log ( |* , P , *  |/  |* , * , * |) 2. ( V , * , P , N 2 ) = log ( |V , P , N 2  |/  |* , * , * |) 3. ( * , N 1 , P , N 2 ) = log ( |N 1 , P , N 2  |/  |* , * , * |) 4. ( V , * , P , * ) = log ( |V , P , *  |/  |V , * , * |) 5. ( * , N 1 , P , * ) = log ( |N 1 , P , *  |/  |N 1 , * , * |) 6. ( * , * , P , N 2 ) = log ( |* , P , N 2  |/  |* , * , N 2 |) 1, 2, and 3 are the prior probabilities of P , V P  N 2, and N 1 P N 2 respectively. 4, 5, and 6 represent conditional probabilities P( V , P  |V ), P( N 1 , P  |N 1 ), and P( P N 2  |N 2 ) respectively. We estimate the adverbial and adjectival attachment scores, VScore ( V , P , N 2) and NScore ( N 1 , P , N 2), as a linear combination of these features: VScore ( V , P , N 2 ) = ( * , * , P , * ) + ( V , * , P , N 2 ) +</Paragraph>
    <Paragraph position="5"> For example, the attachment scores for ( eat , salad , with , fork ) are VScore ( eat , with , fork ) = -3.47 and NScore ( salad , with , fork ) = -4.77. The model correctly assigns a higher score to the adverbial attachment.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Contextually Similar Words
</SectionTitle>
    <Paragraph position="0"> The contextually similar words of a word w are words similar to the intended meaning of w in its context. Below, we describe an algorithm for constructing contextually similar words and we present a method for approximating the attachment scores using these words.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Algorithm
</SectionTitle>
      <Paragraph position="0"> For our purposes, a context of w is simply a dependency relationship involving w . For example, a dependency relationship for saw in the example sentence of Section 3 is saw : obj : dog . Figure 2 gives the data flow diagram for our algorithm for constructing the contextually similar words of w . We retrieve from the collocation database the words that occurred in the same dependency relationship as w . We refer to this set of words as the cohort of w for the dependency relationship. Consider the words eat and salad in the context eat salad .</Paragraph>
      <Paragraph position="1"> The cohort of eat consists of verbs that appeared  with object salad in Figure 1 (e.g. add, consume, cover, ... ) and the cohort of salad consists of nouns that appeared as object of eat in Figure 1 (e.g. almond, apple, bean, ...).</Paragraph>
      <Paragraph position="2"> Intersecting the set of similar words and the cohort then forms the set of contextually similar words of w. For example, Table 5 shows the contextually similar words of eat and salad in the context eat salad and the contextually similar words of fork in the contexts eat with fork and salad with fork . The words in the first row are retrieved by intersecting the similar words of eat in Table 1 with the cohort of eat while the second row represents the intersection of the similar words of salad in Table 1 and the cohort of salad . The third and fourth rows are determined in a similar manner. In the nonsensical context salad with fork (in row 4), no contextually similar words are found.</Paragraph>
      <Paragraph position="3"> While previous word sense disambiguation algorithms rely on a lexicon to provide sense inventories of words, the contextually similar words provide a way of distinguishing between different senses of words without committing to any particular sense inventory.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Attachment Approximation
</SectionTitle>
      <Paragraph position="0"> Often, sparse data reduces our confidence in the attachment scores of Section 4. Using contextually similar words, we can approximate these scores. Given the tuple ( V , N 1 , P , N 2 ), adverbial attachments are approximated as follows. We first construct a list CS V containing the contextually similar words of V in context V : obj : N 1 and a list CS N 2 V containing the contextually similar words of N 2 in context V : P : N 2 (i.e. assuming adverbial attachment). For each verb v in CS V , we compute VScore ( v , P , N 2 ) and set S V as the average of the largest k of these scores. Similarly, for each noun n in CS N 2 V , we compute VScore ( V , P , n ) and set S N 2 V as the average of the largest k of these scores. Then, the approximated adverbial attachment score, Vscore' , is: VScore' ( V , P , N 2 ) = max ( S V , S N 2 V ) We approximate the adjectival attachment score in a similar way. First, we construct a list CS N 1 containing the contextually similar words of N 1 in context V : obj : N 1 and a list CS N 2 N 1 containing the contextually similar words of N 2 in context N 1 : P : N 2 (i.e. assuming adjectival attachment). Now, we compute S N 1 as the average of the largest k of NScore ( n , P , N 2 ) for each noun n in CS N 1 and S N 2 N 1 as the average of the largest k of NScore ( N 1 , P , n ) for each noun n in CS N 2 N 1. Then, the approximated adjectival attachment score, NScore' , is: NScore' ( N 1 , P , N 2 ) = max ( S N 1 , S N 2 N 1 ) For example, suppose we wish to approximate the attachment score for the 4-tuple ( eat , salad , with , fork ). First, we retrieve the contextually similar words of eat and salad in context eat salad , and the contextually similar words of fork in contexts eat with fork and salad with fork as shown in Table 5 . Let k = 2. Table 6 shows the  harvest, love, sprinkle, Toss, ...</Paragraph>
      <Paragraph position="1"> SALAD eat salad soup, sandwich, pasta, dish, cheese, vegetable, bread, meat, cake, bean, ...</Paragraph>
      <Paragraph position="2"> FORK eat with fork spoon, knife, finger FORK salad with fork --top k = 2 scores are shown in these tables. We have: VScore' ( eat , with , fork ) = max ( S V , S N 2 V ) = -2.92 NScore' ( salad , with , fork ) = max ( S N 1 , S N 2 N 1 ) = -4.87 Hence, the approximation correctly prefers the adverbial attachment to the adjectival attachment.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Attachment Algorithm
</SectionTitle>
    <Paragraph position="0"> Figure 3 describes the prepositional phrase attachment algorithm. As in previous approaches, examples with P = of are always classified as adjectival attachments.</Paragraph>
    <Paragraph position="1"> Suppose we wish to approximate the attachment score for the 4-tuple ( eat , salad , with , fork ). From the previous section, Step 1 returns average V = -2.92 and average N 1 = -4.87. From Section 4, Step 2 gives a V = -3.47 and a N 1 = -4.77. In our training data, f V = 2.97 and f N 1 = 0, thus Step 3 gives f = 0.914. In Step 4, we  Given the 4-tuple ( eat , salad , with , croutons ), the algorithm returns S ( V ) = -4.31 and S ( N 1 ) = -3.88. Hence, the algorithm correctly attaches the prepositional phrase to the noun salad .</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section, we describe our test data and the baseline for our experiments. Finally, we present our results.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Test Data
</SectionTitle>
      <Paragraph position="0"> The test data consists of 3097 examples derived from the manually annotated attachments in the Penn Treebank Wall Street Journal data (Ratnaparkhi et al., 1994) 4 . Each line in the test data consists of a 4-tuple and a target classification: V N 1 P N 2 target .</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Available at ftp.cis.upenn.edu/pub/adwait/PPattachData.
</SectionTitle>
    <Paragraph position="0"> The data set contains several erroneous tuples and attachments. For instance, 133 examples contain the word the as N 1 or N 2 . There are also improbable attachments such as ( sing , birthday , to , you ) with the target attachment birthday .</Paragraph>
    <Paragraph position="1"> Table 6 . Calculation of S V and S N 2 V for ( eat , salad , with , fork ).</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4- TUPLE VS CORE
</SectionTitle>
    <Paragraph position="0"> (mix, salad, with, fork) -2.60 (sprinkle, salad, with, fork) -3.24 S V -2.92 (eat, salad, with, spoon) -3.06 (eat, salad, with, finger) -3.50 S N 2 V -3.28 Table 7 . Calculation of S N 1 and S N 2 N 1 for (eat, salad, with, fork).</Paragraph>
    <Paragraph position="1"> 4- TUPLE NS CORE (eat, pasta, with, fork) -4.71 (eat, cake, with, fork) -5.02  and the formulas from Section 5.2 compute: average V = VScore' ( V , P , N 2 ) average N 1 = NScore' ( N 1 , P , N 2 ) Step 2 : Compute the adverbial attachment score, a v , and the adjectival attachment score, a n 1 : a V = VScore ( V , P , N 2 ) a N 1 = NScore ( N 1 , P , N 2 ) Step 3 : Retriev e from the training data set the frequency of the 3-tuples ( V , P , N 2 ) and ( N 1 , P , N 2 ) f V and f N 1 , respectively.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Baseline
</SectionTitle>
      <Paragraph position="0"> Choosing the most common attachment site, N 1 , yields an accuracy of 58.96%. However, we achieve 70.39% accuracy by classifying each occurrence of P = of as N 1 , and V otherwise.</Paragraph>
      <Paragraph position="1"> Human accuracy, given the full context of a sentence, is 93.2% and drops to 88.2% when given only tuples of the form ( V , N 1 , P , N 2 ) (Ratnaparkhi et al., 1994). Assuming that human accuracy is the upper bound for automatic methods, we expect our accuracy to be bounded above by 88.2% and below by 70.39%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML