File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1032_metho.xml

Size: 25,286 bytes

Last Modified: 2025-10-06 14:07:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1032">
  <Title>Mapping Lexical Entries in a Verbs Database to WordNet Senses</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Lexical Resources
</SectionTitle>
    <Paragraph position="0"> We use an existing classification of 4076 English verbs, based initially on English Verbs Classes and Alternations (Levin, 1993) and extended through the splitting of some classes into sub-classes and the addition of new classes. The resulting 491 classes (e.g., &amp;quot;Roll Verbs, Group I&amp;quot;, which includes drift, drop, glide, roll, swing) are referred to here as Levin+ classes. As verbs may be assigned to multiple Levin+ classes, the actual number of entries in the database is larger, 9611.</Paragraph>
    <Paragraph position="1"> Following the model of (Dorr and Olsen, 1997), each Levin+ class is associated with a thematic grid (henceforth abbreviated a6 -grid), which summarizes a verb's syntactic behavior by specifying its predicate argument structure. For example, the Levin+ class &amp;quot;Roll Verbs, Group I&amp;quot; is associated with the a6 -grid [th goal], in which a theme and a goal are used (e.g., The ball dropped to the ground).1 Each a6 -grid specification corresponds to a Grid class. There are 48 Grid classes, with a one-to-many relationship between Grid and Levin+ classes.</Paragraph>
    <Paragraph position="2"> WordNet, the lexical resource to which we are mapping entries from the lexical database, groups synonymous word senses into &amp;quot;synsets&amp;quot; and structures the synsets into part-of-speech hierarchies. Our mapping operation uses several other data elements pertaining to WordNet: semantic relationships between synsets, frequency data, and syntactic information.</Paragraph>
    <Paragraph position="3"> Seven semantic relationship types exist between synsets, including, for example, antonymy, hyperonymy, and entailment. Synsets are often related to a half dozen or more other synsets; they 1There is also a Levin+ class &amp;quot;Roll Verbs, Group II&amp;quot; which is associated with the a7 -grid [th particle(down)], in which a theme and a particle 'down' are used (e.g., The ball dropped down).</Paragraph>
    <Paragraph position="4"> may be related to multiple synsets through a single relationship or may be related to a single synset through multiple relationship types.</Paragraph>
    <Paragraph position="5"> Our frequency data for WordNet senses is derived from SEMCOR--a semantic concordance incorporating tagging of the Brown corpus with WordNet senses.2 Syntactic patterns (&amp;quot;frames&amp;quot;) are associated with each synset, e.g., Somebody s something; Something s; Somebody s somebody into V-ing something. There are 35 such verb frames in WordNet and a synset may have only one or as many as a half dozen or so frames assigned to it.</Paragraph>
    <Paragraph position="6"> Our mapping of verbs in Levin+ classes to WordNet senses relies in part on the relation between thematic roles in Levin+ and verb frames in WordNet. Both reflect how many and what kinds of arguments a verb may take. However, constructing a direct mapping between a6 -grids and WordNet frames is not possible, as the underlying classifications differ in significant ways. The correlations between the two sets of data are better viewed probabilistically.</Paragraph>
    <Paragraph position="7"> Table 1 illustrates the relation between Levin+ classes and WordNet for the verb drop. In our multilingual applications (e.g., lexical selection in machine translation), the Grid information provides a context-based means of associating a verb with a Levin+ class according to its usage in the SL sentence. The WordNet sense possibilities are thus pared down during SL analysis, but not sufficiently for the final selection of a TL verb. For example, Levin+ class 9.4 has three possible Word-Net senses for drop. However, the WordNet sense 8 is not associated with any of the other classes; thus, it is considered to have a higher &amp;quot;information content&amp;quot; than the others. The upshot is that the lexical-selection routine prefers dejar caer over other translations such as derribar and bajar.3 The other classes are similarly associated with ap- null notion of reduction in entropy, measured by information gain (Mitchell, 1997). Using information content to quantify the &amp;quot;value&amp;quot; of a node in the WordNet hierarchy has also been used for measuring semantic similarity in a taxonomy (Resnik, 1999b). More recently, context-based models of disambiguation have been shown to represent significant improvements over the baseline (Bangalore and Rambow, 2000), (Ratnaparkhi, 2000).</Paragraph>
    <Paragraph position="8">  1. move, displace 3. decline, go down, wane 1. derribar, echar 3. disminuir</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Meander
</SectionTitle>
      <Paragraph position="0"> [th src goal] The river dropped from the lake to the sea  propriate TL verbs during lexical selection: disminuir (class 45.6), hundir (class 47.7), and bajar (class 51.3.1).4</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Training Data
</SectionTitle>
    <Paragraph position="0"> We began with the lexical database of (Dorr and Jones, 1996), which contains a significant number of WordNet-tagged verb entries. Some of the assignments were in doubt, since class splitting had occurred subsequent to those assignments, with all old WordNet senses carried over to new subclasses. New classes had also been added since the manual tagging. It was determined that the tagging for only 1791 entries--including 1442 verbs in 167 classes--could be considered stable; for these entries, 2756 assignments of WordNet senses had been made. Data for these entries, taken from both WordNet and the verb lexicon, constitute the training data for this study.</Paragraph>
    <Paragraph position="1"> The following probabilities were generated from the training data:</Paragraph>
    <Paragraph position="3"> where a55 a32 is a relation (of relationship type a56 , e.g., synonymy) between two synsets, a57a19a58 and a57a54a59 , where a57 a58 is mapped to by a verb in Grid class Ga58 and a57a54a59 is mapped to by a verb in Grid class Ga59 .</Paragraph>
    <Paragraph position="4"> 4The full set of Spanish translations is selected from WordNet associations developed in the EuroWordNet effort (Dorr et al., 1997).</Paragraph>
    <Paragraph position="5"> This is the probability that if one synset is related to another through a particular relationship type, then a verb mapped to the first synset will belong to the same Grid class as a verb mapped to the second synset. Computed values generally range between .3 and .35.</Paragraph>
    <Paragraph position="7"> where a55 a32 is as above, except that sa58 is mapped to by a verb in Levin+ class L+a58 and sa59 is mapped to by a verb in Levin+ class L+a59 . This is the probability that if one synset is related to another through a particular relationship type, then a verb mapped to the first synset will belong to the same Levin+ class as a verb mapped to the second synset. Computed values generally range between .25 and .3.</Paragraph>
    <Paragraph position="9"> where a6 a87a27a88a101 is the occurrence of the entire a6 -grid a102 for verb entry a103 and cfa89a38a88a101 is the occurrence of the entire frame sequence a104 for a WordNet sense to which verb entry a103 is mapped. This is the probability that a verb in a Levin+ class is mapped to a WordNet verb sense with some specific combination of frames. Values average only .11, but in some cases the probability is 1.0.</Paragraph>
    <Paragraph position="11"> where a6 a87a27a88a101 is the occurrence of the single a6 -grid componenta102 for verb entry a103 and cfa89a38a88a101 is the occurrence of the single framea104 for a WordNet sense to which verb entry a103 is mapped. This is the probability that a verb in a Levin+ class with a particular a6 -grid component (possibly among others) is mapped to a WordNet verb sense assigned a specific frame (possibly among others). Values average .20, but in some cases the probability is 1.0.</Paragraph>
    <Paragraph position="13"> a122a46a123 is an occurrence of tag a57 (for a particular synset) in SEMCOR and a122 a101 is an occurrence of any of a set of tags for verb a103 in SEMCOR, with a57 being one of the senses possible for verb a103 . This probability is the prior probability of specific WordNet verb senses. Values average .11, but in some cases the probability is 1.0.</Paragraph>
    <Paragraph position="14"> In addition to the foregoing data elements, based on the training set, we also made use of a semantic similarity measure, which reflects the confidence with which a verb, given the total set of verbs assigned to its Levin+ class, is mapped to a specific WordNet sense. This represents an implementation of a class disambiguation algorithm (Resnik, 1999a), modified to run against the WordNet verb hierarchy.5 We also made a powerful &amp;quot;same-synset assumption&amp;quot;: If (1) two verbs are assigned to the same Levin+ class, (2) one of the verbs a103 a58 has been mapped to a specific WordNet sense a57a19a58 , and (3) the other verb a103a2a59 has a WordNet sense a57a54a59 synonymous with a57 a58 , then a103 a59 should be mapped to a57 a59 . Since WordNet groups synonymous word senses into &amp;quot;synsets,&amp;quot; a57a19a58 and a57a54a59 would correspond to the same synset. Since Levin+ verbs are mapped to WordNet senses via their corresponding synset identifiers, when the set of conditions enumerated above are met, the two verb entries would be mapped to the same WordNet synset.</Paragraph>
    <Paragraph position="15"> As an example, the two verbs tag and mark have been assigned to the same Levin+ class. In WordNet, each occurs in five synsets, only one in which they both occur. If tag has a WordNet synset assigned to it for the Levin+ class it shares with mark, and it is the synset that covers senses 5The assumption underlying this measure is that the appropriate word senses for a group of semantically related words should themselves be semantically related. Given WordNet's hierarchical structure, the semantic similarity between two WordNet senses corresponds to the degree of informativeness of the most specific concept that subsumes them both.</Paragraph>
    <Paragraph position="16"> of both tag and mark, we can safely assume that that synset is also appropriate for mark, since in that context, the two verb senses are synonymous.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Subsequent to the culling of the training set, several processes were undertaken that resulted in full mapping of entries in the lexical database to WordNet senses. Much, but not all, of this mapping was accomplished manually.</Paragraph>
    <Paragraph position="1"> Each entry whose WordNet senses were assigned manually was considered by at least two coders, one coder who was involved in the entire manual assignment process and the other drawn from a handful of coders working independently on different subsets of the verb lexicon. In the manual tagging, if a WordNet sense was considered appropriate for a lexical entry by any one of the coders, it was assigned. Overall, 13452 Word-Net sense assignments were made. Of these, 51% were agreed upon by multiple coders. The kappa coefficient (a124 ) of intercoder agreement was .47 for a first round of manual tagging and (only) .24 for a second round of more problematic cases.6 While the full tagging of the lexical database may make the automatic tagging task appear superfluous, the low rate of agreement between coders and the automatic nature of some of the tagging suggest there is still room for adjustment of WordNet sense assignments in the verb database. On the one hand, even the higher of the kappa coefficients mentioned above is significantly lower than the standard suggested for good reliability (a124a126a125a128a127a130a129 ) or even the level where tentative conclusions may be drawn (a127a130a131a133a132a135a134a72a124 a134 a127a130a129 ) (Carletta, 1996), (Krippendorff, 1980). On the other hand, if the automatic assignments agree with human coding at levels comparable to the degree of agreement among humans, it may be used to identify current assignments that need review  wise agreement of coders on a classification task surpasses what would be expected by chance; the standard definition of this coefficient is: a136a138a137a140a139a15a141a81a139a86a142a33a143a145a144a146a141a81a139a15a147a33a143a99a143a99a148a51a139a99a149a33a144a146a141a81a139a15a147a33a143a99a143 , where a141a81a139a86a142a150a143 is the actual percentage of agreement and a141a81a139a15a147a33a143 is the expected percentage of agreement, averaged over all pairs of assignments. Several adjustments in the computation of the kappa coefficient were made necessary by the possible assignment of multiple senses for each verb in a Levin+ class, since without prior knowledge of how many senses are to be assigned, there is no basis on which to compute a141a81a139a15a147a33a143 . and to suggest new assignments for consideration.</Paragraph>
    <Paragraph position="2"> In addition, consistency checking is done more easily by machine than by hand. For example, the same-synset assumption is more easily enforced automatically than manually. When this assumption is implemented for the 2756 senses in the training set, another 967 sense assignments are generated, only 131 of which were actually assigned manually. Similarly, when this premise is enforced on the entirety of the lexical database of 13452 assignments, another 5059 sense assignments are generated. If the same-synset assumption is valid and if the senses assigned in the database are accurate, then the human tagging has a recall of no more than 73%.</Paragraph>
    <Paragraph position="3"> Because a word sense was assigned even if only one coder judged it to apply, human coding has been treated as having a precision of 100%. However, some of the solo judgments are likely to have been in error. To determine what proportion of such judgments were in reality precision failures, a random sample of 50 WordNet senses selected by only one of the two original coders was investigated further by a team of three judges. In this round, judges rated WordNet senses assigned to verb entries as falling into one of three categories: definitely correct, definitely incorrect, and arguable whether correct. As it turned out, if any one of the judges rated a sense definitely correct, another judge independently judged it definitely correct; this accounts for 31 instances. In 13 instances the assignments were judged definitely incorrect by at least two of the judges. No consensus was reached on the remaining 6 instances.</Paragraph>
    <Paragraph position="4"> Extrapolating from this sample to the full set of solo judgments in the database leads to an estimate that approximately 1725 (26% of 6636 solo judgments) of those senses are incorrect. This suggests that the precision of the human coding is approximately 87%.</Paragraph>
    <Paragraph position="5"> The upper bound for this task, as set by human performance, is thus 73% recall and 87% precision. The lower bound, based on assigning the WordNet sense with the greatest prior probability, is 38% recall and 62% precision.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Mapping Strategies
</SectionTitle>
    <Paragraph position="0"> Recent work (Van Halteren et al., 1998) has demonstrated improvement in part-of-speech tagging when the outputs of multiple taggers are combined. When the errors of multiple classifiers are not significantly correlated, the result of combining votes from a set of individual classifiers often outperforms the best result from any single classifier. Using a voting strategy seems especially appropriate here: The measures outlined in Section 3 average only 41% recall on the training set, but the senses picked out by their highest values vary significantly.</Paragraph>
    <Paragraph position="1"> The investigations undertaken used both simple and aggregate voters, combined using various voting strategies. The simple voters were the 7 measures previously introduced.7 In addition, three aggregate voters were generated: (1) the product of the simple measures (smoothed so that zero values wouldn't offset all other measures); (2) the weighted sum of the simple measures, with weights representing the percentage of the training set assignments correctly identified by the highest score of the simple probabilities; and (3) the maximum score of the simple measures.</Paragraph>
    <Paragraph position="2"> Using these data, two different types of voting schemes were investigated. The schemes differ most significantly on the circumstances under which a voter casts its vote for a WordNet sense, the size of the vote cast by each voter, and the circumstances under which a WordNet sense was selected. We will refer to these two schemes as Majority Voting Scheme and Threshold Voting Scheme.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Majority Voting Scheme
</SectionTitle>
      <Paragraph position="0"> Although we do not know in advance how many WordNet senses should be assigned to an entry in the lexical database, we assume that, in general, there is at least one. In line with this intuition, one strategy we investigated was to have both simple and aggregate measures cast a vote for whichever sense(s) of a verb in a Levin+ class received the highest (non-zero) value for that measure. Ten variations are given here:  all variations of this voting scheme, both with and without enforcement of the same-synset assumption. If we use the harmonic mean of recall and precision as a criterion for comparing results, the best voting scheme is MajAggr, with 58% recall and 72% precision without enforcement of the same-synset assumption. Note that if the same-synset assumption is correct, the drop in precision that accompanies its enforcement mostly reflects inconsistencies in human judgments in the training set; the true precision value for MajAggr after enforcing the same-synset assumption is probably close to 67%.</Paragraph>
      <Paragraph position="1"> Of the simple voters, only PriorProb and SemSim are individually strong enough to warrant discussion. Although PriorProb was used to establish our lower bound, SemSim proves to be the stronger voter, bested only by MajAggr (the majority vote of SimpleProd and SimpleWtdSum) in voting that enforces the same-synset assumption.</Paragraph>
      <Paragraph position="2"> Both PriorProb and SemSim provide better results than the majority vote of all 7 simple voters (MajSimpleSgl) and the majority vote of all 21 pairs of simple voters (MajSimplePair). Moreover, the inclusion of MajSimpleSgl and MajSimplePair in a majority vote with MajAggr (in MajSgl+Aggr 8A pair cast a vote for a sense if, among all the senses of a verb, a specific sense had the highest value for both measures.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
old Voting Scheme
</SectionTitle>
      <Paragraph position="0"> and MapPair+Aggr, respectively) turn in poorer results than MajAggr alone.</Paragraph>
      <Paragraph position="1"> The poor performance of MajSimpleSgl and MajSimplePair do not point, however, to a general failure of the principle that multiple voters are better than individual voters. SimpleProd, the product of all simple measures, and SimpleWtd-Sum, the weighted sum of all simple measures, provide reasonably strong results, and a majority vote of the both of them (MajAggr) gives the best results of all. When they are joined by SemSim in Maj3Best, they continue to provide good results.</Paragraph>
      <Paragraph position="2"> The bottom line is that SemSim makes the most significant contribution of any single simple voter, while the product and weighted sums of all simple voters, in concert with each other, provide the best results of all with this voting scheme.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Threshold Voting Scheme
</SectionTitle>
      <Paragraph position="0"> The second voting strategy first identified, for each simple and aggregate measure, the threshold value at which the product of recall and precision scores in the training set has the highest value if that threshold is used to select WordNet senses.</Paragraph>
      <Paragraph position="1"> During the voting, if a WordNet sense has a higher score for a measure than its threshold, the measure votes for the sense; otherwise, it votes against it.</Paragraph>
      <Paragraph position="2"> The weight of the measure's vote is the precision-recall product at the threshold. This voting strategy has the advantage of taking into account each individual attribute's strength of prediction.</Paragraph>
      <Paragraph position="3"> Five variations on this basic voting scheme were investigated. In each, senses were selected if their vote total exceeded a variation-specific threshold. Table 3 summarizes recall and precision for these variations at their optimal vote thresholds.</Paragraph>
      <Paragraph position="4"> In the AutoMap+ variation, Grid and Levin+ probabilities abstain from voting when their values are zero (a common occurrence, because of data sparsity in the training set); the same-synset assumption is automatically implemented.</Paragraph>
      <Paragraph position="5"> AutoMap- differs in that it disregards the Grid and Levin+ probabilities completely. The Triples variation places the simple and composite measures into three groups, the three with the highest weights, the three with the lowest weights, and the middle or remaining three. Voting first occurs within the group, and the group's vote is brought forward with a weight equaling the sum of the group members' weights. This variation also adds to the vote total if the sense was assigned in the training data. The Combo variation is like Triples, but rather than using the weights and thresholds calculated for the single measures from the training data, this variation calculates weights and thresholds for combinations of two, three, four, five, six, and, seven measures. Finally, the Combo&amp;Auto variation adds the same-synset assumption to the previous variation.</Paragraph>
      <Paragraph position="6"> Although not evident in Table 3 because of rounding, AutoMap- has slightly higher values for both recall and precision than does AutoMap+, giving it the highest recall-precision product of the threshold voting schemes. This suggests that the Grid and Levin+ probabilities could profitably be dropped from further use.</Paragraph>
      <Paragraph position="7"> Of the more exotic voting variations, Triples voting achieved results nearly as good as the AutoMap voting schemes, but the Combo schemes fell short, indicating that weights and thresholds are better based on single measures than combinations of measures.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"> The voting schemes still leave room for improvement, as the best results (58% recall and 72% precision, or, optimistically, 63% recall and 67% precision) fall shy of the upper bound of 73% recall and 87% precision for human coding.9 At the same time, these results are far better than the lower bound of 38% recall and 62% precision for the most frequent WordNet sense.</Paragraph>
    <Paragraph position="1"> As has been true in many other evaluation studies, the best results come from combining classifiers (MajAggr): not only does this variation use a majority voting scheme, but more importantly, the two voters take into account all of the simple voters, in different ways. The next-best results come from Maj3Best, in which the three best single measures vote. We should note, however, that the single best measure, the semantic similarity measure from SemSim, lags only slightly behind the two best voting schemes.</Paragraph>
    <Paragraph position="2"> This research demonstrates that credible word sense disambiguation results can be achieved without recourse to contextual data. Lexical resources enriched with, for example, syntactic information, in which some portion of the resource is hand-mapped to another lexical resource may be rich enough to support such a task. The degree of success achieved here also owes much to the confluence of WordNet's hierarchical structure and SEMCOR tagging, as used in the computation of the semantic similarity measure, on the one hand, and the classified structure of the verb lexicon, which provided the underlying groupings used in that measure, on the other hand. Even where one measure yields good results, several data sources needed to be combined to enable its success.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML