File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1107_evalu.xml
Size: 8,822 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1107"> <Title>using selectional preferences</Title> <Section position="7" start_page="852" end_page="855" type="evalu"> <SectionTitle> 6 Experimental Evaluation </SectionTitle> <Paragraph position="0"> The aim of the experimental evaluation is to establish if the nominalized pattern is useful to help in detecting verb entailment. We experiment with the method by itself or in combination with other sets of patterns. We are then interested only in verb pairs where the nominalized pattern is applicable. The best pattern or the best combined method should be the one that gives the highest values of S to verb pairs in entailment relation, and the lowest value to other pairs.</Paragraph> <Paragraph position="1"> We need a corpus C over which to estimate probabilities, and two dataset, one of verb entailment pairs, the True Set (TS), and another with verbs not in entailment, the Control Set (CS). We use the web as corpus C where to estimate Smi and GoogleTM as a count estimator. The web has been largely employed as a corpus (e.g., (Turney, 2001)). The findings described in (Keller and Lapata, 2003) suggest that the count estimations we need in our study over Subject-Verb bigrams are highly correlated to corpus counts.</Paragraph> <Section position="1" start_page="853" end_page="854" type="sub_section"> <SectionTitle> 6.1 Experimental settings </SectionTitle> <Paragraph position="0"> Since we have a predefined (but not exhaustive) set of verb pairs in entailment, i.e. ent in Word-Net, we cannot replicate a natural distribution of verb pairs that are or are not in entailment. Recall and precision lose sense. Then, the best way to compare the patterns is to use the ROC curve (Green and Swets, 1996) mixing sensitivity and specificity. ROC analysis provides a natural means to check and estimate how a statistical measure is able to distinguish positive examples, the True Set (TS), and negative examples, the Control Set (CS). Given a threshold t, Se(t) is the probability of a candidate pair (vh,vt) to belong to True Set if the test is positive, while Sp(t) is the probability of belonging to ControlSet if the test is negative, i.e.:</Paragraph> <Paragraph position="2"> The ROC curve (Se(t) vs. 1 [?] Sp(t)) naturally follows (see Fig. 1). Better methods will have ROC curves more similar to the step function f(1 [?] Sp(t)) = 0 when 1 [?] Sp(t) = 0 and</Paragraph> <Paragraph position="4"> The ROC analysis provides another useful evaluation tool: the AROC, i.e. the total area under the ROC curve. Statistically, AROC represents the probability that the method in evaluation will rank a chosen positive example higher than a randomly chosen negative instance. AROC is usually used to better compare two methods that have similar ROC curves. Better methods will have higher AROCs.</Paragraph> <Paragraph position="5"> As True Set (TS) we use the controlled verb entailment pairs ent contained in WordNet. As described in Sec. 3, the entailment relation is a semantic relation defined at the synset level, standing in the verb sub-hierarchy. That is, each pair of synsets (St,Sh) is an oriented entailment relation between St and Sh. WordNet contains 409 entailed synsets. These entailment relations are consequently stated also at the lexical level. The pair (St,Sh) naturally implies that vt entails vh for each possible vt [?] St and vh [?] Sh. It is possible to derive from the 409 entailment synset a test set of 2,233 verb pairs. As Control Set we use two sets: random and ent. The random set is randomly generated using verb in ent, taking care of avoiding to capture pairs in entailment relation. A pair is considered a control pair if it is not in the True Set (the intersection between the True Set and the Control Set is empty). The ent is the set of pairs in ent with pairs in the reverse order. These two Control Sets will give two possible ways of evaluating the methods: a general and a more complex task.</Paragraph> <Paragraph position="6"> As a pre-processing step, we have to clean the two sets from pairs in which the hypotheses can not be nominalized, as our pattern Pnom is applicable only in these cases. The pre-processing step retains 1,323 entailment verb pairs. For comparative purposes the random Control Set is kept with the same cardinality of the True Set (in all, 1400 verb pairs).</Paragraph> <Paragraph position="7"> S is then evaluated for each pattern over the True Set and the Control Set, using equation (3) for Pnom, and equation (6) for Ppe and Phb. The best pattern or combined method is the one that is able to most neatly split entailment pairs from random pairs. That is, it should in average assign higher S values to pairs in the True Set.</Paragraph> </Section> <Section position="2" start_page="854" end_page="855" type="sub_section"> <SectionTitle> 6.2 Results and analysis </SectionTitle> <Paragraph position="0"> In the first experiment we compared the performances of the methods in dividing the ent test set and the random control set. The compared methods are: (1) the set of patterns taken alone, i.e.</Paragraph> <Paragraph position="1"> nom, hb, and pe; (2) some combined methods, i.e. nom + pe, hb + pe, and nom + hb + pe. Results of this first experiment are reported in Tab. 2 and Fig. 1.(a). As Figure 1.(a) shows, our nominalization pattern Pnom performs better than the others. Only Phb seems to outperform nominalization in some point of the ROC curve, where Pnom presents a slight concavity, maybe due to a consistent overlap between positive and negative examples at specific values of the S threshold t.</Paragraph> <Paragraph position="2"> In order to understand which of the two patterns has the best discrimination power a comparison of the AROC values is needed. As Table 2 shows, Pnom has the best AROC value (59.94%) indicating a more interesting behaviour with respect to Phb and Ppe. It is respectively 2 and 3 absolute percent point higher. Moreover, the combinations nom + hb + pe and nom + pe that includes the Pnom pattern have a very high performance considering the difficulty of the task, i.e. 66% and 64%. If compared with the combina- null vs. ent tion hb+pe that excludes the Pnom pattern (61%), the improvement in the AROC is of 5% and 3%.</Paragraph> <Paragraph position="3"> Moreover, the shape of the nom + hb + pe ROC curve in Fig. 1.(a) is above all the other in all the points.</Paragraph> <Paragraph position="4"> In the second experiment we compared methods in the more complex task of dividing the ent set from the ent set. In this case methods are asked to determine if win - play is a correct entailment and play - win is not. Results of these set of experiments is presented in Tab. 3. The nominalized pattern nom preserves its discriminative power. Its AROC is over the chance line even if, as expected, it is worse than the one obtained in the general case. Surprisingly, the happens-before (hb) set of patterns seems to be not correlated the entailment relation. The temporal relation vh-happens-before-vt does not seem to be captured by those patterns. But, if this evidence is seen in a positive way, it seems that the patterns are better capturing the entailment when used in the reversed way (hb). This is confirmed by its AROC value. If we observe for example one of the implications in the True Set, reach - go what is happening may become clearer. Sample sentences respectively for the hb case and the hb case are &quot;The group therefore elected to go to Tyso and then reach Anskaven&quot; and &quot;striving to reach personal goals and then go beyond them&quot;. It seems that in the second case then assumes an enabling role more than only a temporal role. After this sur- null prising result, as we expected, in this experiment even the combined approach hb + nom behaves better than hb + nom and better than hb, respectively around 8% and 1.5% absolute points higher (see Tab. 3).</Paragraph> <Paragraph position="5"> The above results imposed the running of a third experiment over the general case. We need to compare the entailment indicators derived exploiting the new use of hb, i.e. hb, with respect to the methods used in the first experiment. Results are reported in Tab. 2 and Fig. 1.(b). As Fig. 1.(b) shows, the hb has a very interesting behaviour for small values of 1 [?] Sp(t). In this area it behaves extremely better than the combined method nom+hb+pe. This is an advantage and the combined method nom+hb+pe exploit it as both the AROC and the shape of the ROC curve demonstrate. Again the method nom + hb + pe that includes the Pnom pattern has 1,5% absolute points with respect to the combined method hb + pe that does not include this information.</Paragraph> </Section> </Section> class="xml-element"></Paper>