File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1621_evalu.xml
Size: 8,736 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1621"> <Title>Using a Corpus of Sentence Orderings Defined by Many Experts to Evaluate Metrics of Coherence for Text Structuring</Title> <Section position="8" start_page="0" end_page="3" type="evalu"> <SectionTitle> 7 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.1 Distances between the expert pairs </SectionTitle> <Paragraph position="0"> On the first step in our analysis, we computed the T score for each expert pair, namely T(E0E1), T(E0E2), T(E0E3), T(E1E2), T(E1E3) and T(E2E3). Then we performed all 15 pairwise comparisons between them using the Tukey test, the results of which are summarised in Table 1.11 The cells in the Table report the level of significance returned by the Tukey test when the difference between two 10Criterion (ii) can only be applied provided that the average distance between the experts and at least one metric Mx is found to be significantly lower than T(EXPEXP). Then, if the average distance between the experts and another metric My does not differ significantly from T(EXPEXP), My performs better than Mx.</Paragraph> <Paragraph position="1"> E1, E2, E3) and the random baseline (RB) distances exceeds the critical difference (CD). Significance beyond the 0.05 threshold is reported with one asterisk (*), while significance beyond the 0.01 threshold is reported with two asterisks (**). A cell remains empty when the difference between two distances does not exceed the critical difference. For example, the value of T(E0E1) is 0.692 and the value of T(E0E3) is 0.258. Since their difference exceeds the CD at the 0.01 threshold, it is reported to be significant beyond that level by the Tukey test, as shown in the top cell of the third column in Table 1.</Paragraph> <Paragraph position="2"> As the Table shows, the T scores for the distance between E0 and E1 or E2, i.e. T(E0E1) and T(E0E2), as well as the T for the distance between E1 and E2, i.e. T(E1E2), are quite high which indicates that on average the orderings of the three experts are quite close to each other. Moreover, these T scores are not significantly different from each other which suggests that E0, E1 and E2 share quite a lot of common ground in the ordering task. Hence, E0 is found to give rise to similar orderings to the ones of E1 and E2.</Paragraph> <Paragraph position="3"> However, when any of the previous distances is compared with a distance that involves the orderings of E3 the difference is significant, as shown by the cells containing two asterisks in Table 1. In other words, although the orderings of E1 and E2 seem to deviate from each other and the orderings of E0 to more or less the same extent, the orderings of E3 stand much further away from all of them. Hence, there exists a &quot;stand-alone&quot; expert among the ones consulted in our studies, yet this is not E0 but E3.</Paragraph> <Paragraph position="4"> This finding can be easily explained by the fact that by contrast to the other three experts, E3 followed a very schematic way for ordering sentences. Because the orderings of E3 manifest rather peculiar strategies, at least compared to the orderings of E0, E1 and E2, the upper bound of the analysis, i.e. the average distance between the expert pairs T(EXPEXP), is computed without taking into account these orderings: (2) T(EXPEXP) = 0:722 = T(E0E1)+T(E0E2)+T(E1E2)3</Paragraph> </Section> <Section position="2" start_page="0" end_page="3" type="sub_section"> <SectionTitle> 7.2 Distances between the experts and RB </SectionTitle> <Paragraph position="0"> As the upper part of Table 2 shows, the T score between any two experts other than E3 is significantly greater than their distance from RB beyond the 0.01 threshold. Only the distances between E3 and another expert, shown in the lower section of Table 2, are not significantly different from the distance between E3 and RB.</Paragraph> <Paragraph position="1"> Although this result does not mean that the orders of E3 are similar to the orders of RB,12 it shows that E3 is roughly as far away from e.g. E0 as she is from RB. By contrast, E0 stands significantly closer to E1 than to RB, and the same holds for the other distances in the upper part of the Table.</Paragraph> <Paragraph position="2"> In accordance with the discussion in the previous section, the lower bound, i.e. the overall average distance between the experts (excluding E3) and RB T(EXPRB), is computed as shown in (3): (3) T(EXPRB) = 0:341 = T(E0RB)+T(E1RB)+T(E2RB)3 7.3 Distances between the experts and each metric So far, E3 was identified as an &quot;stand-alone&quot; expert standing further away from the other three experts than they stand from each other. We also identified the distance between E3 and each expert as similar to her distance from RB.</Paragraph> <Paragraph position="3"> Similarly, E3 was found to stand further away from the metrics compared to their distance from the other three experts.13 This result, gives rise to the set of formulas in (4) for calculating the overall average distance between the experts (excluding E3) and each metric.</Paragraph> <Paragraph position="4"> In the next section, we present the concluding analysis for this study which compares the overall distances in formulas (2), (3) and (4) with each other. As we have already mentioned, T(EXPEXP) serves as the upper bound of the analysis whereas T(EXPRB) is the lower bound. The aim is to specify which scores in (4) are significantly greater than T(EXPRB), but not significantly lower than T(EXPEXP).</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 7.4 Concluding analysis </SectionTitle> <Paragraph position="0"> The results of the comparisons of the scores in (2), (3) and (4) are shown in Table 3. As the top cell in the last column of the Table shows, the T score between the experts and RB, T(EXPRB), is significantly lower than the average distance between the expert pairs, T(EXPEXP) at the 0.01 level.</Paragraph> <Paragraph position="1"> 12This could have been argued, if the value of T(E3RB) had been much closer to 1.</Paragraph> <Paragraph position="2"> 13Due to space restrictions, we cannot report the scores for these comparisons here. The reader is referred to Table 9.4 on page 175 of Chapter 9 in [Karamanis, 2003].</Paragraph> <Paragraph position="3"> This result verifies one of our main predictions showing that the orderings of the experts (modulo E3) stand much closer to each other compared to their distance from randomly assembled orderings.</Paragraph> <Paragraph position="4"> As expected, most of the scores that involve the metrics are not significantly different from each other, except for T(EXPPF:BFP) which is significantly greater than T(EXPM:NOCB) at the 0.05 level. Yet, what we are mainly interested in is how the distance between the experts and each metric compares with T(EXPEXP) and T(EXPRB). This is shown in the first row and the last column of Table 3.</Paragraph> <Paragraph position="5"> Crucially, T(EXPRB) is significantly lower than T(EXPPF:BFP) as well as T(EXPPF:NOCB) and T(EXPPF:KP) at the 0.01 level. Notably, even the distance of the experts from M.NOCB, T(EXPM:NOCB), is significantly greater than T(EXPRB), albeit at the 0.05 level. These results show that the distance from the experts is significantly reduced when using the best scoring orderings of any metric, even M.NOCB, instead of the orderings of RB. Hence, all metrics score significantly better than RB in this experiment.</Paragraph> <Paragraph position="6"> However, simply using M.NOCB to output the best scoring orders is not enough to yield a distance from the experts which is comparable to T(EXPEXP). Although the PF constraint appears to help towards this direction, T(EXPPF:KP) remains significantly lower than T(EXPEXP), whereas T(EXPPF:NOCB) falls only 0.009 points short of CD at the 0.05 threshold. Hence, PF.BFP is the most robust metric, as the difference between T(EXPPF:BFP) and T(EXPEXP) is clearly not significant. null Finally, the difference between T(EXPPF:NOCB) and T(EXPM:NOCB) is only 0.006 points away from the CD.</Paragraph> <Paragraph position="7"> This result shows that the distance from the experts is reduced to a great extent when the best scoring orderings are computed according to PF.NOCB instead of simply M.NOCB.</Paragraph> <Paragraph position="8"> Hence, this experiment provides additional evidence in favour of enhancing M.NOCB with the PF constraint of coherence, as suggested in [Karamanis, 2003].</Paragraph> </Section> </Section> class="xml-element"></Paper>