File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1005_evalu.xml
Size: 7,469 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1005"> <Title>Automatically Distinguishing Literal and Figurative Usages of Highly Polysemous Verbs</Title> <Section position="7" start_page="42" end_page="44" type="evalu"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> To evaluate our compositionality and acceptability measures, we compare them to the relevant consensus human ratings using the Spearman rank correlation coefficient, rs. For simplicity, we report the absolute value of rs for all experiments. Since in most cases, correlations are statistically significant (p a3a5a401), we omit p values; those rs values for which p is marginal (i.e., a401 a0 p a0 a410) are subscripted with an &quot;m&quot; in the tables. Correlation scores in boldface are those that show an improvement over the baseline, PMILVC.</Paragraph> <Paragraph position="1"> The PMILVC measure is an informed baseline, since it draws on properties of LVCs. Specifically, PMILVC measures the strength of the association between a light verb and a noun appearing in syntactic patterns preferred by LVCs, i.e., PMILVCa3 PMI</Paragraph> <Paragraph position="3"> Assuming that an acceptable LVC forms a detectable collocation, PMILVC can be interpreted as an informed baseline for degree of acceptability. PMILVC can also positionality ratings and COMP measure (counts from BNC). be considered as a baseline for the degree of compositionality of an expression (with respect to the light verb component), under the assumption that the less compositional an expression, the more its components appear as a fixed collocation.</Paragraph> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 6.1 Compositionality Results </SectionTitle> <Paragraph position="0"> Table 4 displays the correlation scores of the human compositionality ratings with COMPvdn, our compositionality measure estimated with counts from the BNC. Given the variety of light verb usages in expressions used in the compositionality data, we report correlations not only on test data (bncT), but also on development and test data combined (bncDT) to get more data points and hence more reliable correlation scores. Compared to the baseline, COMPvdn has generally higher correlations with human ratings of compositionality.</Paragraph> <Paragraph position="1"> There are two different types of expressions among those used in compositionality experiments: expressions with an indefinite determiner a (e.g., give a kick) and those without a determiner (e.g., give guidance). Despite shared properties, the two types of expressions may differ with respect to syntactic flexibility, due to differing semantic properties of the noun complements in the two cases. We thus calculate correlation scores for expressions with the indefinite determiner only, from both development and test data (bncDT/a). We find that COMPvdn has higher correlations (and larger improvements over the baseline) on this subset of expressions.</Paragraph> <Paragraph position="2"> (Note that there are comparable numbers of items in bncDT and bncDT/a, and the correlation scores are highly significant--very small p values--in both cases.) To explore the effect of using a larger but noisier corpus, we compare the performance of COMPvdn ered &quot;fair&quot; or &quot;good&quot; in each class, and the log10 of the mean ACPT score for that class.</Paragraph> <Paragraph position="3"> with COMPd, the compositionality measure using web data. The correlation scores for COMPd on bncDT are .41 and .35, for give and take, respectively, compared to a baseline (using web counts) of .37 and .32. We find that COMPvdn has significantly higher correlation scores (larger rs and much smaller p values), as well as larger improvements over the baseline. This is a confirmation that using more syntactic information, from less noisy data, improves the performance of our compositionality measure.4</Paragraph> </Section> <Section position="2" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 6.2 Acceptability Results </SectionTitle> <Paragraph position="0"> We have two goals in assessing our ACPT measure: one is to demonstrate that the measure is indeed indicative of the level of acceptability of an LVC, and the other is to explore whether it helps to indicate class-based patterns of acceptability.</Paragraph> <Paragraph position="1"> Regarding the latter, Stevenson et al. (2004) found differing overall levels of (human) acceptability for different Levin classes combined with give and take.</Paragraph> <Paragraph position="2"> This indicates a strong influence of semantic similarity on the possible LV and complement combinations. Our ACPT measure also yields differing patterns across the semantic classes. Table 5 shows, for each light verb and test class, the proportion of acceptable LVCs according to human ratings, and the log of the mean ACPT score for that LV and class combination. For take, the ACPT score generally reflects the difference in proportion of accepted expressions according to the human ratings, while for give, the measure is less consistent. (The three development classes show the same pattern.) The ACPT measure thus appears to reflect the differing patterns of acceptability across the classes, at least 4Using the automatically parsed BNC as a source of less noisy data improves performance. However, since these constructions may be infrequent with any particular complement, we do not expect the use of cleaner but more plentiful text (such as existing treebanks) to improve the performance any further. and each set of human ratings (counts from web).</Paragraph> <Paragraph position="3"> for take.</Paragraph> <Paragraph position="4"> To get a finer-grained notion of the degree to which ACPT conforms with human ratings, we present correlation scores between the two, in Table 6. The results show that ACPT has higher correlation scores than the baseline--substantially higher in the case of give. The correlations for give also vary more widely across the classes.</Paragraph> <Paragraph position="5"> These results together indicate that the acceptability measure may be useful, and indeed taps into some of the differing levels of acceptability across the classes. However, we need to look more closely at other linguistic properties which, if taken into account, may improve the consistency of the measure.</Paragraph> </Section> <Section position="3" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 6.3 Comparing the Two Measures </SectionTitle> <Paragraph position="0"> Our two measures are intended for different purposes, and indeed incorporate differing linguistic information about LVCs. However, we also noted that PMILVC can be viewed as a baseline for both, indicating some underlying commonality. It is worth exploring whether each measure taps into the different phenomena as intended. To do so, we correlate COMP with the human ratings of acceptability, and ACPT with the human ratings of compositionality, as shown in Table 7. (The formulation of the ACPT measure here is adapted for use with determiner-less LVCs.) For comparability, both measures use counts from the web. The results confirm that COMPd correlates better than does ACPT with compositionality ratings, while ACPT correlates best with acceptability ratings.</Paragraph> </Section> </Section> class="xml-element"></Paper>