File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0707_metho.xml

Size: 22,054 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0707">
  <Title>Probabilistic Models for PP-attachment Resolution and NP Analysis</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Attachment Model
</SectionTitle>
    <Paragraph position="0"> Let us denote the a0a2a1a4a3 nucleus in a chain by a5 a0a7a6 , and the the nucleus to which it is attached by a8a9a5 a0a7a6 (for each chain, we introduce an additional empty nucleus to which the head of the chain is attached). Given a chain of nuclei a10 , we denote by a11a13a12 the set of dependency relations covering the chain of nuclei a10a15a14a17a16a19a18a21a20a23a22a24a20 a0 . We are interested in the set a11a13a25 such that a26a27a5a28a11a13a25 a6 is maximal. Assuming that the dependencies are built by processing the chain in linear order, We have:</Paragraph>
    <Paragraph position="2"> fies a particular attachment site a5</Paragraph>
    <Paragraph position="4"> cycle nor crossing dependencies are produced. In order to avoid sparse data problems, we make the simplifying assumption (similar to the one presented in (Eisner, 1996)) that the attachment of nucleus a5a28a51 a6 to nucleus a5 a0a7a6 depends only on the set of indices of the preceding dependency relations (in order to avoid cycles and crossing dependencies) and on the three nuclei a5a28a51 a6 ,  a6 allows capturing the fact that the object of a verb may depend on its subject, that the indirect object may depend on the direct object, and other similar indirect dependencies. In order to focus on the probabilities of interest, we use the following simplified notation:</Paragraph>
    <Paragraph position="6"> where a55a56a5a28a11 a47 a6 represents the graph produced by the dependencies generated so far. If this graph contains cycles or crossing links, the associated probability is</Paragraph>
    <Paragraph position="8"> since the graph a55a56a5a28a11 a47 a6 provides the index of the nucleus a5 a0a7a6 to which a5a28a51 a6 is attached to. Obviously, most of the above probabilities cannot be directly estimated.</Paragraph>
    <Paragraph position="9"> A number of simplifying assumptions preserving significant conditional dependencies were adopted.</Paragraph>
    <Paragraph position="10"> Assumption 1: except for graphs with cycles and crossing links, for which the associated probability is 0, we assume a uniform distribution on the set of possible graphs.</Paragraph>
    <Paragraph position="11"> A prior probability a26a29a5a4a55a56a5a2a11 a47 a6a59a6 could be used to model certain corpus-specific preferences such as privileging attachments to the immediatly preceding nucleus (in French or English for example). However, we decided not to make use of this possibility for the moment.</Paragraph>
    <Paragraph position="12"> Assumption 2: the semantic class of a nucleus depends only on the semantic class of its regent.</Paragraph>
    <Paragraph position="13"> This assumption, also used in (Lauer and Dras, 1994), amounts to considering a 1st-order Markov chain on the semantic classes of nuclei, and represents a good trade-off between model accuracy and practical estimation of the probabilities in (3). It leads to:</Paragraph>
    <Paragraph position="15"> Assumption 3: the preposition of a nucleus depends only on its semantic class and on the lexeme and POS category of its regent, thus leading to:</Paragraph>
    <Paragraph position="17"> The nucleus a5a52a22 a47 a6 does not provide any information on the generation of the preposition, and is thus not retained. As far as the regent nucleus a5 a0a4a6 is concerned, the dependence on the POS category controls the fact that adjectives are less likely to subcategorize prepositions than verbs. For arguments, the preposition is controlled by subcategorization frames, which directly depend on the lexeme under consideration, and to a less extent to its semantic class (even though this dependence does exist, as for movement verbs which tend to subcategorize prepositions associated with location and motion). In the absence of subcategorization frame information, the conditioning is placed on the lexeme, which also controls prepositional phrases corresponding to adjuncts. Lastly, the semantic class of the nucleus under consideration may also play a role in the selection of the preposition, and is thus retained in our model.</Paragraph>
    <Paragraph position="18"> Assumption 4: the POS category of a nucleus depends only on its semantic class.</Paragraph>
    <Paragraph position="19"> This assumption reflects the fact that our lexical resources assign semantic classes from disjoint sets for nouns, adjectives and adverbs (except for the TOP class, identical for adjectives and adverbs). This assumption leads to:</Paragraph>
    <Paragraph position="21"> Since any dependence on a5 a0a7a6 and a5a53a22 a47 a6 is lost, this factor has no impact on the choice of the most probable attachment for a5a28a51 a6 . However, it is important to note that this assumption relies on the specific semantic resource we have at our disposal, and could be replaced, in other situations, with a 1st-order Markov assumption. null Assumption 5: the gender and number of a nucleus depend on its POS category, the POS category of its regent, and the gender and number of its regent.</Paragraph>
    <Paragraph position="22"> In French, the language under study, gender and number agreements take place between the subject and the verb, and between adjectives, or past participles, and the noun they modify/qualify. All, and only, these dependencies are captured in assumption 5</Paragraph>
    <Paragraph position="24"> Assumption 6: the lexeme of a nucleus depends only on the POS category and the semantic class of the nucleus itself, the lexeme, POS category and semantic class of its regent lexeme, and the lexeme and POS category of its closest preceding sibling.</Paragraph>
    <Paragraph position="25"> This assumption allows us to take bigram frequencies for lexemes into account, as well as the dependencies a given lexeme may have on its closest sibling. In fact, it accounts for more than just bigram frequencies since it leads to:</Paragraph>
    <Paragraph position="27"> Assumptions 1 to 6 lead to a set of probabilities which, except for the last one, can be confidently estimated from training data. However, we still need to simplify equation (12) if we want to derive practical estimations of lexical affinities. This is the aim of the following assumption.</Paragraph>
    <Paragraph position="29"> Let us first see with an example what this assumption amounts to. Consider the sequence eat a fish with a fork. Assumption 7 says that given with a fork, eat and a fish are independent, that is, once we know with a fork, the additional observation of a fish doesn't change our expectation of observing eat as well, and vice-versa. This does not entail that with a fork and eat are independent given a fish, nor that a fish and with a fork are independent given eat, this last dependence being the one we try to account for. However, this independence assumption is violated as soon as nucleus</Paragraph>
    <Paragraph position="31"> a6 brings more or different constraints on the distribution of nucleus a5 a0a7a6 than nucleus a5a2a51 a6 does, i.e. when with a fork imposes constraints on the possible forms the verb of nucleus a5 a0a7a6 (eat in our example) can take, and so does a fish. With assumption 7, we claim that the constraints imposed by with a fork suffice to deter- null It is interesting to compare the proposed models to others previously studied. The probabilistic model described in (Lauer and Dras, 1994), addresses the problem of parsing English nominal compounds. A comparison with this model is of interest to us since the sequences we are interested in contain both verbal and nominal phrases in French. A second model relevant to our discussion is the one proposed in (Ratnaparkhi, 1998), addressing the problem of unsupervised learning for PP attachment resolution in VERB NOUN PP sequences. Lastly, the third model, even though used in a supervised setting, addresses the more complex problem of probabilistic dependency parsing on complete sentences 2.</Paragraph>
    <Paragraph position="32"> In the model proposed in (Lauer and Dras, 1994), that we will refer to as model L, the quantity denoted as a90a91a5a53a92 a47a94a93 a92 a12a49a30a95a17a96a98a97a99a96 a93 a92 a12 a6 is the same as the quantity defined by our equation (8). The quantity a26a29a5a52a100 a6 in model L is the same as our quantity a26a29a5a7a55a56a5a28a11 a47 a6a57a6 . There is no equivalent for probabilities involved in equations (9) to (11) in model L, since there is no need for them in analysing English nominal compounds. Lastly, our probability to generate a80a4a81a49a82 a47 depends only on a70 in model L (the dependency on the POS category is obvious since only nouns are considered). For the rest, i.e. the way these core quantities are combined to produce a probability for a parse as well as the decision rule (selection of the most probable parse), there is no difference between the two models. We thus can view our model as a generalization of model L since we can handle PP attachment and take into account indirect independencies.</Paragraph>
    <Paragraph position="33"> The model proposed in (Ratnaparkhi, 1998) is similar to a version of our model based solely on equation (9), with no semantic information. This is not surprising since the goal of this work is to disambiguate between prepositional attachment to the noun or to the verb in V N P sequences. In fact, by adding to the set of prepositions an empty preposition, a101 , the counts of which are estimated from unsafe configurations (that is</Paragraph>
    <Paragraph position="35"> tion (9) captures both the contribution from the random variable used in (Ratnaparkhi, 1998) to denote the presence or absence of any preposition that is unambiguously attached to the noun or the verb in question, and the contribution from the conditional probability that a particular preposition will occur as unambiguous attachment to the verb or to the noun. We present below the results we obtained with this model.</Paragraph>
    <Paragraph position="36"> From the models proposed in (Eisner, 1996), we retain only the model referred to as model C in this work, since the best results were obtained with it. Model C does not make use of semantic information, nor does it rely on nuclei. So the sequence with a fork, which corresponds to only one nucleus is treated as a three word sequence in model C. Apart from this difference, model C directly relies on a combination of equations (10) and (12), namely conditioning by a80a7a81a49a82a9a12 , a74a61a8a65a75a57a12 and a74a61a8a65a75a57a14a61a86 , both the probability of generating a74a61a8a65a75 a47 and the one of generating a80a7a81a49a82 a47 . Thus, model C uses a reduced version of equation (12) and an extended version of 2Other models, as (Collins and Brooks, 1995; Merlo et al., 1998) for PP-attachment resolution, or (Collins, 1997; Samuelsson, 2000) for probabilistic parsing, are somewhat related, but their supervised nature makes any direct comparison impossible.</Paragraph>
    <Paragraph position="37"> equation (10). This extension could be used in our case too, but, since the input to our processing chain consists of tagged words (unless the input of the stochastic dependency parser of (Eisner, 1996)), we do not think it necessary.</Paragraph>
    <Paragraph position="38"> Furthermore, by marginalizing the counts for the estimates of our general model, we can derive the probabilities used in other models. We thus view our model as a generalization of the previous ones.</Paragraph>
    <Paragraph position="39"> 5 Estimation of probabilities We followed a maximum likelihood approach to estimate the different probabilities our model relies on, by directly computing relative frequencies from our training data. We then used Laplace smoothing to smooth the obtained probabilities and deal with unobserved events.</Paragraph>
    <Paragraph position="40"> As mentioned before, we focus on safe configurations to extract counts for probability estimation, which implies that, except for particular configurations involving adverbs, we use only the first nuclei of the chains we arrived at. In most cases, only the first two nuclei of each chain are not ambiguous with respect to attachment. However, since equation (12) relies on  a0a7a6 , we consider the first three nuclei of each chain (but we skip adverbs since their attachment quite often obeys precise and simple rules), but treat the third nucleus as being ambiguous with respect to which nucleus it should be attached to, the two possibilities being a priori equi-probable. Thus, from the sequence: [implant'ee, VERB] (a) d'epartement, NOUN, Masc-Sg, PREP = dans(b) H'erault, NOUN, Masc-Sg, PREP = de(c) (located in the county of H'erault) (En.) we increment the counts between nuclei (a) and (b) by 1, then consider that nucleus (c) is attached to nucleus (a) and increment the respective counts (in particular the counts associated with equation 12) by 0.5, and finally consider that nucleus (c) is attached to nucleus (b) (which is wrong in this case) and increment the corresponding counts by 0.5.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> We made two series of experiments, the first one to assess whether relying on a subset of our training corpus to derive probability estimates was a good strategy, and the second one to assess the different information sources and probabilities our general model is based on. For all our experiments, we used articles from the French newspaper Le Monde consisting of 300000 sentences, split into training and test data.</Paragraph>
    <Paragraph position="1"> 6.1 Accurate vs. less accurate information We conducted a first experiment to check whether the accurate information extracted from safe chains was sufficient to estimate probabilities. We focused, for this purpose, on the task of preposition attachment on 200 VERB NP PP sequences randomly extracted and manually annotated. Furthermore, we restricted ourselves to a reduced version of the model, based on a reduced version of equation (9), so as to have a comparison point with previous models for PP-attachment.</Paragraph>
    <Paragraph position="2"> In addition to the accurate information, we used a windowing approach in order to extract less accurate information and assess the estimates derived from accurate information only. Each time a preposition is encountered with a verb or a noun in a window of k (k=3 in our experiment) words, the corresponding counts are incremented.</Paragraph>
    <Paragraph position="3"> The French lexicons we used for tagging, lemmatization and chunking contain subcategorization information for verbs and nouns. This information was encoded by several linguists over several years. Here's for example two entries, one for a verb and one for a noun, containing subcategorization information: qu^eter - en faveur de, pour to raise funds - in favor of, for constance - dans, en, de - constancy - in, of Subcategorization frames only contain part of the information we try to acquire from our training data, since they are designed to capture possible arguments, and not adjuncts, of a verb or a noun. In our approach, like in other ones, we do not make such a distinction and try to learn parameters for attaching prepositional phrases independently of their status, adjuncts or arguments. We used the following decision rule to test a method solely based on subcategorization information: null if the noun subcategorizes the preposition, then attachment to the noun else if the verb subcategorizes the preposition, then attachment to the verb else attachment according to the default rule and two default rules, one for attachment to nouns, the other to verbs, in order to which of these two alternatives is the best. Furthermore, since subcategorization frames aim at capturing information for specific prepositional phrases (namely the ones that might constitute arguments of a given word), we also evaluated the above decision rule on a subset of our test examples in which either the noun or the verb subcategorizes the preposition. The results we obtained are summarized in table 1.</Paragraph>
    <Paragraph position="4">  We then mixed the accurate and less accurate information with a weighting factor a113 to estimate the probability we are interested in, and let a113 vary from 0 to 1 in order to see what are the respective impacts of accurate and less accurate information. By using a74a85a18  (resp. a74a61a114 a47 ) to denote the number of times a26a73a72 a47 occurs with a5a7a80a4a81a49a82a9a12a41a16a83a74a61a8a65a75a57a12 a6 in accurate (resp. less accurate) configurations, and by using a74a50a12 to denote the number of occurrences of a5a7a80a4a81a19a82a17a12a41a16a83a74a60a8a85a75a57a12 a6 , the estimation we used is summarized in the following formula:</Paragraph>
    <Paragraph position="6"> where a78a40a5a71a26a76a72a85a81a57a26 a6 is the number of different prepositions introduced by our smoothing procedure. The results obtained are summarized in table 2, where an increment step of 0.2 is used.</Paragraph>
    <Paragraph position="7">  These results first show that the accurate information is sufficient to derive good estimates. Furthermore, discounting part of the less accurate information seems to be essential, since the worst results are obtained when</Paragraph>
    <Paragraph position="9"> a18 . We can also notice that the best results are well above the baseline obtained by relying only on information present in our lexicon, thus justifying a machine learning approach to the problem of PP attachment resolution. Lastly, the results we obtained are similar to the ones obtained by different authors on a similar task, as (Ratnaparkhi, 1998; Hindle and Rooth, 1993; Brill and Resnik, 1994) for example.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Evaluation of our general model
</SectionTitle>
      <Paragraph position="0"> The model described in Section 3 was tested against 900 manually annotated sequences of nuclei from the newspaper &amp;quot;Le Monde&amp;quot;, randomly selected from a portion of the corpus which was held out from training.</Paragraph>
      <Paragraph position="1"> The average length of sequences was of 3.33 nuclei.</Paragraph>
      <Paragraph position="2"> The trivial method consisting in linking every nucleus to the preceding one achieves an accuracy of 72.08%.</Paragraph>
      <Paragraph position="3"> The proposed model was used to assign probability estimates to dependency links between nuclei in our own implementation of the parser described in (Eisner, 1996). The latter is a &amp;quot;bare-bones&amp;quot; dependency parser which operates in a way very similar to the CKY parser for context-free grammars, in which the notion of a subtree is replaced by that of a span.</Paragraph>
      <Paragraph position="4"> A span consists of two or more adjacent nuclei together with the dependency links among them. No cycles, multiple parents, or crossing dependencies are allowed, and each nucleus not on the edge of the span must have a parent (i.e.: a regent) within the span itself. The parser proceeds by creating spans of increasing size by combining together smaller spans. Spans are combined using the &amp;quot;covered concatenation&amp;quot; operator, which connects two spans sharing a nucleus and possibly adds a dependency link between the leftmost and the rightmost nucleus, or vice-versa. The probability of a span is the product of the probabilities of the dependency links it contains. A span is pruned from the parse table every time that there is another span covering the same nuclei and having the same signature but a higher probability. The signature of a span consists of three things: a115 A flag indicating whether the span is minimal or not. A span is minimal if it is not the simple concatenation of other legal spans; a115 A flag indicating whether the leftmost nucleus in the span already has a regent within the span; a115 A flag indicating whether the rightmost nucleus in the span already has a regent within the span.</Paragraph>
      <Paragraph position="5"> Two spans covering the same nuclei and with the same signiture are interchangeable in terms of the complete parses they can appear in, and so the one with the lower probability can be dropped, assuming that we are only interested in the analysis having the overall highest probability. For more details concerning the parser, see (Eisner, 1996).</Paragraph>
      <Paragraph position="6"> A number of tests using different variants of the proposed models were done. For some of those tests, we decided to make use of the subcategorization frame information contained in our lexicon, by extending Laplace smoothing for the probability involved in equation (9) by considering Dirichlet priors over multinomial distributions of the observed data.</Paragraph>
      <Paragraph position="7"> We use three different variables to describe the different experiments we made: se, being 1 or 0 depending on whether or not we used semantic information, sb, which indicates the equivalent sample size for priors that we used in our smoothing procedure for equation (9) (when a92a49a104 a32 a18 , the subcategorization information contained in our lexicon is not used), Kj, which is 1 if the variables associated with the closest sister are used in equation (12), and 0 if not. The results obtained with the different experiments we conducted were evaluated in terms of accurracy of attachments against the manually annotated reference. We did not take into account the attachment of the second nucleus in a chain to the first one (since this attachment is obvious). Results are summarized in the following table: Exp. name se sb kj Accuracy</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML