File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0109_metho.xml

Size: 23,424 bytes

Last Modified: 2025-10-06 14:14:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0109">
  <Title>I ! Corpus Based PP Attachment Ambiguity Resolution 1 with a Semantic Dictionary II</Title>
  <Section position="4" start_page="67" end_page="67" type="metho">
    <SectionTitle>
SEMANTIC HIERARCHY
</SectionTitle>
    <Paragraph position="0"> The hierarchy we chose for semantic matching is the semantic network of WordNet \[IVHg0\], \[MI93\]. WordNet is a network of meanings connected by a variety of relations. WordNet presently contains approximately 95.000 different word forms organised into 70.100 word meanings, or sets of synonyms. It is divided into four categories (nouns, verbs, adjectives and adverbs), out of which we will be using only verbs and nouns. Nouns are organised as 11 topical hierarchies, where each root represents the most general concept for each topic. Verbs, which tend to be more polysemons and can change their meanings depending on the kind of the object they take, are formed into 15 groups and have altogether 337 possible roots. Verb hierarchies are more shallow than those of nouns, as nouns tend to be more easily organised by the is-a relation, while this is not always possible for verbs.</Paragraph>
  </Section>
  <Section position="5" start_page="67" end_page="68" type="metho">
    <SectionTitle>
SEMANTIC DISTANCE
</SectionTitle>
    <Paragraph position="0"> The traditional method of evaluating semantic distance between two meanings based merely on the length of the path between the nodes representing them, does not work well in WordNet, because the distance also depends on the depth at which the concepts appear in the hierarchy. For example, the root ent/ty is directly followed by the concept of life form, while a sedan, a type of a car, is in terms of path more distant from the concept of express_train, although they are both vehicles and therefore closer concepts. In the ease of verbs, the situation is even more complex, because many verbs do not share the same hierarchy, and therefore there is no direct path between the concepts they represent. There have been numerous attempts to define a measure for semantic distance of WordNet contained concepts \[P, E95\],\[K&amp;E96\], \[SU95\], \[SU96\], etc.</Paragraph>
    <Paragraph position="1"> For our proposes, we have based the semantic distance calculation on a combination of the path distance between two nodes and their depth. Having ascertained the nearest common ancestor in the hierarchy, we calculate the distance as an average of the distance of the two concepts to their nearest common ancestor divided by the depth in the WordNet Hierarchy: where L 1, L 2 are the lengths of paths between the concepts and the nearest common ancestor, and D 1, D 2 are the depths of each concept in the hierarchy (the distance to the root). The more abstract the concepts are (the higher in hierarchy), the bigger the distance. The same concepts have a distance equal to 0; concepts with no common ancestor have a distance equal to 1. Because the verb hierarchy is rather shallow and wide, the distance between many verbal concepts is often</Paragraph>
    <Paragraph position="3"/>
    <Paragraph position="5"> In order to determine the position of a word in the semantic hierarchy, we have to determine the meaning of the word from the context in which it appears. For example, the noun bank can take any of the nine meanings defined in WordNet (financial institution, building, ridge, container, slope, etc.). It is not a trivial problem and has been approached by many researchers \[GCY92\], \[YA93\], \[B&amp;W94\], IRE95\], \[YA95\], \[K&amp;E96\], \[LI96\], etc. We believe that the word sense disambiguation can be accompanied by PP attachment resolution, and that they complement each other. At the same time we would like to note, that PP attachment and sense disambiguation are heavily contextually dependent problems. Therefore, we know in advance that without incorporation of wide context, the full disambiguation will be never reached.</Paragraph>
  </Section>
  <Section position="6" start_page="68" end_page="72" type="metho">
    <SectionTitle>
2. WORD SENSE DISAMBIGUATION
</SectionTitle>
    <Paragraph position="0"> The supervised learning algorithm which we have devised for the PP attachment resolution, and which is discussed in Chapter 3, is based on the induction of a decision tree from a large set of training examples which contain verb-noun-preposition-noun quadruples with disambiguated senses. Unfortunately, at the time of writing this work, a sufficiently big corpus which was both syntactically analysed and semantically tagged did not exist. Therefore, we used the syntactically analysed corpus \[MA93\] and assigned the word senses ourselves. Manual assignment, however, in the case of a huge corpus would be beyond our capacity and therefore we devised an automatic method for an approximate word sense disambiguation based on the following notions: Determining the correct sense of an ambiguous word is highly dependent on the context in which the word occurs. Even without any sentential context, the human brain is capable of disambiguating word senses based on circumstances or experience 3. In natural language  processing, however, we rely mostly on the sentential contexts, i.e. on the surrounding concepts and relations between them. These two problems arise: 1. The surrounding concepts are very often expressed by ambiguous words and a correct sense for these words also has to be determined. 2.</Paragraph>
    <Paragraph position="1"> What relations and how deep an inference is needed for correct disambiguation is unknown.</Paragraph>
    <Paragraph position="2"> We based our word-sense disambiguating mechanism on the premise that two ambiguous words usually tend to stand for their most similar sense if they appear in the same context. In this chapter we present a similarity-based disambiguadon method aimed at disarnbiguating sentences for subsequent PP-attachment resolution. Similar contextual situations (these include information on the PP-attachment) are found in the training corpora and are used for the sense disarnbiguation. If, for example, the verb buy (4 senses) appears in the sentence: The investor bought the company for 5 million dollars arid somewhere else in the training corpus there is a sentence4: The investor purchased the company for 5 million dollars, we can take advantage of this similarity and disambiguate the verb &amp;quot;buy&amp;quot; to its sense that is nearest to the sense of the verb purchase, which is not ambiguous.</Paragraph>
    <Paragraph position="3"> The situation, however, might not bc as simplistic as that, because such obvious matches are extremely rare even in a huge corpus. The first problem is that the sample verb in the training corpus may be also ambiguous. Which sense do we therefore choose? The second problem is that there may, in fact, be no exact match in the training corpus for the context surrounding words and their relations. To overcome both of these problems we have applied the concept of semantic distance discussed above. Every possible sense of all the related context words is evaluated and the best match'chosen 5.</Paragraph>
    <Paragraph position="4"> The proposed unsupervised similarity-based iterafive algorithm for the word sense disambiguafion of the training corpus looks as follows:  1. From the training corpus, extract all the sentences which contain a prepositional phrase with a verb.object-preposition-description quadruple. Mark each quadruple with the corresponding PP attachment (explicitly present in the parsed corpus).</Paragraph>
    <Paragraph position="5"> 2. Set the Similarity Distance Threshold SDT = 0 3. Repeat</Paragraph>
    <Paragraph position="7"> The above algorithm can be described as iterafive clustering, because at first, the nearest quadruples are matched and disambiguated. Then, the similarity distance threshold is raised, and the process repeats itself in the next iteration. If a word is not successfully disambignated, it is assigned its first, i.e. the most frequent sense. The reason for starting with the best matches is that these tend to provide better disambignations. Consider, for example, the following set of quadruples: QI. shut plant for week Q2. buy company for million Q3. acquire business for million Q4. purchase company for million QS. shut facility for inspection Q6. acquire subsidiary for million At first, the algorithm tries to disambiguate quadruple Q1. Starting with the verb, the algorithm searches for other quadruples which have the quadruple distance (see below) smaller than the current similarity distance threshold. For SDTffi0 this means only for quadruples with all the words with semantic distance ffi 0, i.e. synonyms. There are no matches found for Q1 and the algorithm moves to Q2, finding quadruple Q4 as the only one matching such criteria. The verb buy in Q2 is disambiguated to the sense which is nearest to the sense of purchase in Q4, i.e. min(dist(buy,purchase))ffidist(BUY-1,PUROHASE-1)-.-O.O. The noun company cannot be disambiguated, because the matched nearest quadruple Q4 contains the same noun and such a disambignation is not allowed; the description million is monosemous. Same process is called for all the remaining quadruples but further disambigauuon with SDT=0 is not possible (the verb purchase in Q4 has only one sense in WordNet and therefore there is no need for disarnbiguation; the noun company cannot be disambiguated against the same word). The iteration threshold is increased by 0.1 and the algorithm starts again with the first quadruple. No match is found for Q 1 for any word and we have to move to quadruple Q2. Its verb is already disambiguated, therefore the algorithm looks for all the quadruples which have the quadruple distance for nouns below the SDT of 0.1 and which contain similar nouns (see definition of similar below). The quadruple Q3 satisfies this criteria. Distances of all the combinations of senses of the noun company and business are calculated and the nearest match chosen to disambiguate the noun company in Q2: min( dist( company, business) )ffidist( OOMP ANY-1, BUSINESS-1)~0.083 The algorithm then proceeds to the next quadruple, i.e. Q3. There are two quadruples which satisfy the similarity threshold for verbs: Q2 and Q4 (Q6 is not considered, because its verb is identical and therefore not similar). The verb buy in Q2 is already disambiguated and the distance to both Q2 and Q4 is the same, i.e.: dqv(Q3,Q2)ffidqv(Q3,Q4)ffi(0.2Yz+0.083+0)/3ffi0.0485 where the minimum semantic distance between the nearest senses of the verb acquire and buy is: min( dist( acquire, buy)ffidist( AOOUIRE-1,BUY-1)ffiO.25 The verb acquire is disambiguated to the sense nearest to the sense of the verb buy and the algorithm proceeds to the noun business in Q3. The same two quadruples fall below the SDT for nouns, as dqn(Q3,Q2)=dqv(Q3,Q4)ffi(0.25+0.007+0)/3=0.0857 and the noun business of Q3 is disarnbiguated to its sense nearest to the disambiguated sense of company in Q2. The verb in Q4 is monosemous, therefore the algorithm finds a set of similar quadruples for nouns (Q2 qualifies in spite if having the same noun (company), because it has already been disambiguated in the previous steps): Q2, Q3 and Q6. The nearest quadruple in this set is Q2 (dqn(Q4,Q2)ffi0) and the noun company in Q4 is disambiguated to the sense of the noun in Q2. The quadruple Q5 has no similar quadruples for the current SDT and therefore the next  quadruple is Q6. Similarly to the above disarnbiguations, both its verb and noun are disarnbiguated. There is no further match for any quadruple and therefore SDT is increased to 0.2 and the algonthm starts with QI again (the quadruples Q2, Q3, Q4 and Q6 are already fully disambiguated). No matches are found for SDT=0.2 for neither Ql or QS. The algorithm iterates until SDT=0.6 which enables the disambiguation of the noun p/am in Q1 to its sense nearest to the noun facility in Q5: dqn(Q1,Q5)=(0+0.3752+l/)2=0.57 as rnin(dist(plantJaci!ity)-~.dist(PLANT-1,FACIClTY-1)=0.375. Similarly, the noun facility in Q6 is disambiguated, whereas the descriptions in both QI and Q5 cannot be successfully disarnbiguated because only a very small set of quadruples was used in this example. In this case, both the description week and inspection would be assigned their most frequent senses, i.e. the first senses of WordNet. In case of a bigger training set, most of the quadruples get disambiguated, however, wlth increasing SDT the disambiguation quality decreases. The above example shows the importance of iteration, because starting with lower SDT guarantees better results. If, for example, there was no iteration cycle and the algorithm tried to disambiguate the quadruples in the order in which they appear, the quadruple Q1 would be matched with Q6 and all its words would be disambiguated to inappropriate senses. Such a wrong disambiguation would further force wrong disambiguations in other quadruples and the overall result would be substantially less accurate. Another advantage of this disambiguation mechanism is that the proper nouns, which usually refer to people or companies, can be also disambiguated. For example, an unknown name ARBY in quadruple: acquire ARBY for million is matched with disambiguated noun in Q6 and also disambiguated to the COMPANY-1 sense, rather than to,PERSON (note, that even if Q6 was not disarnbiguated, the COMPANY-1 sense of subsidiary is semantically closer to the sense company of ARBY and therefore, although possible, the disambiguation of ARBY to the first sense of subsidiary (PERSON) would be dismissed).  where P is the number of pairs of words in the quadruples which have a common semantic ancestor, i.e. P = 1, 2 or 3 (if there is no such a pair, Dq = .o) and its purpose is to give higher priority to matches on more words. The distance of the currently disambiguated word is squared in order to have a bigger weight in the distance Do(the currently disambiguated word must be different from the corresponding word in the matched quadruple 6 unless it has been previously disambiguated). The distance between two words D(wl,w 2) is defined as the minimum semantic distance between all the possible senses of the words w 1 and w 2. Two quadruples are similar, if their distance is less or equal to the current Similarity Distance Threshold, and if the currently disambiguated word is similar to the corresponding word in the matched quadruple. Two words are similar if their semantic distance is less than 1.0 and if either their character strings are different or if one of the words has been previously disambiguated.</Paragraph>
  </Section>
  <Section position="7" start_page="72" end_page="74" type="metho">
    <SectionTitle>
3. PP-ATTACHMENT
</SectionTitle>
    <Paragraph position="0"> For the attachment of the prepositional phrases in unseen sentences, we have modified Quinlan's ID3 algorithm \[Q86\], \[BR91\] which belongs to the the family of inductive learning algorithms.</Paragraph>
    <Paragraph position="1"> Using a huge training set of classified examples, it uncovers the importance of the individual words (attributes) and creates a decision tree that is later used for classification of unseen examples 7. The algorithm uses the concepts of the WordNet hierarchy as attribute values and creates the decision tree in the following way:</Paragraph>
    <Section position="1" start_page="72" end_page="74" type="sub_section">
      <SectionTitle>
3.1 DECISION TREE INDUCTION
</SectionTitle>
      <Paragraph position="0"> Let T be a training set of classified quadruples.</Paragraph>
      <Paragraph position="1">  1. If all the examples in T are of the same PP attachment type (or satisfy the homogeneity termination condition, see below) then the result is a leaf labelled with this type, else 2. Select the most informative attribute A among verb, noun and description among the attributes not selected so far (the attributes can be selected repeatedly after all of them were already used in the current subtree) 3. For each possible value Aw of the selected attribute A construct recursively a subtree S w calling the same algorithm on a set of quadruples for which A belongs to the same WordNet class as A w.</Paragraph>
      <Paragraph position="2"> 4. Return a tree whose root is A and whose subtrees are S w and links between,A and S w are labelled A w.</Paragraph>
      <Paragraph position="3"> Let us briefly explain each step of the algorithm.</Paragraph>
      <Paragraph position="4"> 1. If the examples belong to the same class (set T is homogenous), the tree expansion terminates.  However, such situation is very unlikely due to the non-perfect training data. Therefore, we relaxed the complete homogeneity condition by terminating the expansion when more than 77% of the examples in the set belonged to the same class (the value of 77% was set experimentally as it provided the best classification results). If the set T is still heterogeneous and there are no more attribute values to divide with, the tree is terminated and the leaf is marked by the majority class of the node.</Paragraph>
      <Paragraph position="5"> 2. We consider the most informative attribute to be the one which splits the set T into the most homogenous subsets, i.e. subsets with either a high percentage of samples with adjectival attachments and a low percentage of adverbial ones, or vice-versa. The optimal split would be such that all the subsets would contain only samples of one attachment type. For each attribute A, we split the set into subsets, each associated with attribute value A w and containing samples which were unifiable with value A w (belong to the same WordNet class). Then, we calculate the overall heterogeneity (OH) of all these subsets as a weighted sum of their expected information:</Paragraph>
      <Paragraph position="7"> where p(PPADvIA=Aw) and p(PADjIA=Aw) represent the conditional probabilities of the adverbial and adjectival attachments, respectively. The attribute with the lowest overall heterogeneity is selected for the decision tree expansion. In the following example (Figure 2) we</Paragraph>
      <Paragraph position="9"> Verbs of all the node quadruples belong to the WordNet class V, nouns to the class N and descriptions tq.the class D. We assume, in this example, that the WordNet hierarchy class V has three subclasses (V1, V2, V3), class N has two subclasses (N1, N2) and class D has also two subclasses (191, D2) 8. We use the values V1, V2 and V3, N1 and N2, and D1 and D2 as potential values of the attribute A. Splitting by verb results in three subnodes with an overall heterogeneity 0.56, splitting by noun in two subnodes with OH~0.99 and by description with OH=0.88.</Paragraph>
      <Paragraph position="10"> Therefore, in this case we would choose the verb as an attribute for the tree expansion.</Paragraph>
      <Paragraph position="11"> 3. The attribute is either a verb, noun, or a description noun 9. Its values correspond to the concept identificators (synsets) of WordNet. At the beginning of the tree induction, the top roots of the WordNet hierarchy are taken as attribute values for splitting the set of training examples. At first, all the training examples (separately for each preposition) are split into subsets which correspond to the topmost concepts of WordNet, which contains 11 topical roots for nouns and description nouns, and 337 for verbs (both nouns and verbs have hierarchical structure, although the hierarchy for verbs is shallower and wider). The training examples are grouped into subnodes according to the disambiguated senses of their content words. This means that quadruples with words that belong to the same top classes start at the same node. Each group is further split by the attribute, which provides less heterogeneous splitting (all verb, noun and description attributes are tried for each group and the one by which the current node can be split into the least heterogeneous set of subnodes is selected). Branches that lead to empty subnodes (as a result of not having a matching training example for the given attribute value) are pruned. This process repeats in all the emerging subnodes, using the attribute values which correspond to the WordNet hierarchy, moving from its top to its leaves. When splitting the set of training examples by the attribute A according to its values A w, the emerging subsets contain those quadruples whose attribute A value is lower.in the WordNet hierarchy, i.e. belongs to the same class. If some quadruples had the attribute value equal</Paragraph>
      <Paragraph position="13"> to the values of A, an additional subset is added but its further splitting by the same attribute is prohibited.</Paragraph>
    </Section>
    <Section position="2" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
3.2 CLASSIFICATION
</SectionTitle>
      <Paragraph position="0"> As soon as the decision tree is induced, classifying an unseen quadruple is a relatively simple procedure. At first, the word senses of the quadruple are disambiguated by the algorithm described in Chapter 2, which is modified to exclude the SDT iteration cycles. Then a path is traversed in the decision tree, starting at its root and ending at a leaf. At each internal node, we follow the branch labelled by the attribute value which is the semantic ancestor of the attribute value of the quadruple (i.e. the branch attribute value is a semantic ancestor 10 of the value of the quadruple attribute). The quadruple is assigned the attachment type associated with the leaf, i.e. adjectival or adverbial. If no match is found for the attribute value of the quadruple at any given node, the quadruple is assigned the majority type of the current node.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="74" end_page="74" type="metho">
    <SectionTitle>
4 TRAINING AND TESTING DATA
</SectionTitle>
    <Paragraph position="0"> The training and testing data, extracted from the Penn Tree Bank \[MA93\], are identical to that used by \[RRR94\], \[C&amp;B95\] for comparison purposes II. The data contained 20801 training and 3097 testing quadruples with 51 prepositions and ensured that there was no implicit training of the method on the test set itself. We have processed the training data in the following way: converted all the verbs into lower cases ~- converted all the words into base forms replaced four digit numbers by 'year' ~- replaced all other numbers by 'definitequantity' ~&amp;quot; replaced nouns ending by -ing and not in WordNet by 'action' eliminated examples with verbs that are not in WordNet ~,&amp;quot; eliminated examples with lower-case nouns that are not in WordNet, except for pronouns, whose senses were substituted by universal pronoun synsets ,~&amp;quot; the upper-case nouns were assigned their lower case equivalent senses plus the senses of 'company' and 'person' ~- the upper case nouns not contained in WordNet were assigned the senses of 'company' and 'person' ~,&amp;quot; disabled all the intransitive senses of verbs assigned all the words (yet ambiguous) the sets of WordNet senses (synsets) The above processing together with the elimination of double occurrences and contradicting examples, reduced the training set to 17577 quadruples, with an average quadruple ambiguity of 86, as of the ambiguity definition in section 1.2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML