XML Viewer - w06-1202

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1202_evalu.xml
Size: 9,135 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1202">
  <Title>Measuring MWE Compositionality Using Semantic Annotation</Title>
  <Section position="8" start_page="7" end_page="10" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> In our evaluation, we focused on testing the performance of the D-score against human raters' judgment on ranking different MWEs by their degree of compositionality, as well as distinguishing the different degrees of compositionality for each sense in the case of multiple tags. The first step of the evaluation was to implement the algorithm in a program and run the tool on the 89 test MWEs we prepared. Fig. 1 illustrates the D-score distribution in a bar chart. As shown by the chart, the algorithm produces a widely dispersed distribution of D-scores across  Selected at random from the Lancaster semantic lexicon.  http://ucrel.lancs.ac.uk/projects/assist/ the sample MWEs, ranging from 0.000032 to 1.000000. For example, the tool assigned the score of 1.0 to the FOOD sense and 0.001 to the THIEF senses of &amp;quot;tea leaf&amp;quot; successfully distinguishing the different degrees of compositionality of these two senses.</Paragraph>
    <Paragraph position="1">  As shown in Fig. 1, some MWEs share the same scores, reflecting the limitation of the number of ranks that our algorithm can produce as well as the limited amount of semantic information available from a lexicon. Nonetheless, the algorithm produced 45 different scores which ranked the MWEs into 45 groups (see the steps in the figure). Compared to the eleven scores used by the human raters, this provides a fine-grained ranking of the compositionality.</Paragraph>
    <Paragraph position="2"> The primary issue in our evaluation is the extent to which the automatic ranking of the MWEs correlates with the manual ranking of them. As described in the previous section, we created a list of 89 manually ranked MWEs for this purpose. Since we are mainly interested in the ranks rather than the actual scores, we examined the correlation between the automatic and manual rankings using Spearman's correlation coefficient. (For the full ranking list, see Appendix). In the manually created list, each MWE was ranked by 3-6 human raters. In order to create a unified single test data of human ranking, we calculated the average of the human ranks for each MWE. For example, if two human raters give ranks 3 and 4 to a MWE, then its rank is (3+4)/2=3.5. Next, the MWEs are sorted by the averaged ranks in descending order to obtain the combined ranks of the MWEs. Finally, we sorted the MWEs by the D-score in the same way to obtain a parallel list of automatic ranks. For the calculation of Spearman's correlation coefficient, if n MWEs are tied to a score (either D-score or the average manual ranks), their ranks were ad- null justed by dividing the sum of their ranks by the number of MWEs involved. Fig. 2 illustrates the correspondence between the adjusted automatic and manual rankings.</Paragraph>
    <Paragraph position="3">  As shown in Fig. 2, the overall correlation seems quite weak. In the automatic ranking, quite a few MWEs are tied up to three ranks, illustrated by the vertically aligned points. The precise correlation between the automatic and manual rankings was calculated using the function provided in R for Windows 2.2.1. Spearman's rank correlation (rho) for these data was 0.2572 (p=0.01495), indicating a significant though rather weak positive relationship.</Paragraph>
    <Paragraph position="4"> In order to find the factors causing this weak correlation, we tested the correlation for those MWEs whose rank differences were less than 20, 30, 40 and 50 respectively. We are interested to find out how many of them fall under each of the categories and which of their features affected the performance of the algorithm. As a result, we found 43, 54, 66 and 77 MWEs fall under these categories respectively, which yield different correlation scores, as shown in Table 1.</Paragraph>
    <Paragraph position="5">  different rank differences.</Paragraph>
    <Paragraph position="6"> As we expected, the rho decreases as the rank difference increases, but all of the four categories containing a total of 77 MWEs (86.52%) show reasonably high correlations, with the minimum score of 0.5084.</Paragraph>
    <Paragraph position="7">  In particular, 66 of them (74.16%), whose ranking differences are less than 40, demonstrate a strong correlation with rho-score 0.7016, as illustrated by Fig. 3 ScatterPlot of Auto vs. Man Ranks for 66 MWEs  40) which shows a strong correlation Our manual examination shows that the algorithm generally pushes the highly compositional and non-compositional MWEs towards opposite ends of the spectrum of the D-score. For example, those assigned with score 1 include &amp;quot;aid worker&amp;quot;, &amp;quot;audio tape&amp;quot; and &amp;quot;unemployment figure&amp;quot;. On the other hand, MWEs such as &amp;quot;tea leaf&amp;quot; (meaning thief), &amp;quot;kick the bucket&amp;quot; and &amp;quot;hot dog&amp;quot; are given a low score of 0.001. We assume these two groups of MWEs are generally treated as highly compositional and opaque MWEs respectively.</Paragraph>
    <Paragraph position="8"> However, the algorithm could be improved. A major problem found is that the algorithm punishes longer MWEs which contain function words. For example, &amp;quot;make an appearance&amp;quot; is scored 0.000114 by the algorithm, but when the article &amp;quot;an&amp;quot; is removed, it gets a higher score 0.003608. Similarly, when the preposition &amp;quot;up&amp;quot; is removed from &amp;quot;keep up appearances&amp;quot;, it gets 0.014907 compared to the original 0.000471, which would push up their rank much higher. To address this problem, the algorithm needs to be refined to minimise the impact of the function words to the scoring process.</Paragraph>
    <Paragraph position="9"> Our analysis also reveals that 12 MWEs with rank differences (between automatic and manual ranking) greater than 50 results in a degraded overall correlation. Table 2 lists these words, in which the higher ranks indicate higher compositionality. null  Salkind (2004: 88) suggests that r-score ranges 0.4~0.6, 0.6~0.8 and 0.8~1.0 indicate moderate, strong and very strong correlations respectively.</Paragraph>
    <Paragraph position="10">  greater than 50.</Paragraph>
    <Paragraph position="11"> Let us take &amp;quot;pillow fight&amp;quot; as an example. The whole expression is given the semantic tag K6, whereas neither &amp;quot;pillow&amp;quot; nor &amp;quot;fight&amp;quot; as individual word is given this tag. In the lexicon, &amp;quot;pillow&amp;quot; is classified as H5 {FURNITURE AND HOUSEHOLD FITTINGS} and &amp;quot;fight&amp;quot; is assigned to four semantic categories including S8{HINDERING}, X8+ {HELPING}, E3- {VIO-LENT/ANGRY}, and K5.1 {SPORTS}. For this reason, the automatic score of this MWE is as low as 0.003953 on the scale of [0, 1]. On the contrary, human raters judged the meaning of this expression to be fairly transparent, giving it a high score of 8.5 on the scale of [0, 10]. Similar contrasts occurred with the majority of the MWEs with rank differences greater than 50, which are responsible for weakening the overall correlation.</Paragraph>
    <Paragraph position="12"> Another interesting case we noticed is the MWE &amp;quot;pass away&amp;quot;. This MWE has two major senses in the semantic lexicon L1- {DIE} and T2- {END} which were ranked separately. Remarkably, they were ranked in the opposite order by human raters and the algorithm. Human raters felt that the sense DIE is less idiomatic, or more compositional, than END, while the algorithm indicated otherwise. The explanation of this again lies in the semantic classification of the lexicon, where &amp;quot;pass&amp;quot; as a single word contains the sense T2- but not L1-. Consequently, the automatic score for &amp;quot;pass away&amp;quot; with the sense  Semantic tags occurring in Table 2: A8 (seem), A9 (giving possession), B2 (health and disease), F2 (drink), K6 (children's games and toys), M3 (land transport), M4 (swimming), P1 (education), S1.1.1 (social actions), S1.1.3 (participation), S2 (people), S3.2 (relationship), T3 (time: age), X1 (psychological actions), X5.2 (excited), Z4 (discourse bin) L1- is much lower (0.001) than that with the sense of T2- (0.007071).</Paragraph>
    <Paragraph position="13"> In order to evaluate our algorithm in comparison with previous work, we also tested it on the manual ranking list created by McCarthy et al (2003).</Paragraph>
    <Paragraph position="14">  We found that 79 of the 116 phrasal verbs in that list are included in the Lancaster semantic lexicon. We applied our algorithm on those 79 items to compare the automatic ranks against the average manual ranks using the Spearman's rank correlation coefficient (rho). As a result, we obtained rho=0.3544 with significance level of p=0.001357. This result is comparable with or better than most measures reported by McCarthy et al (2003).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML