File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1621_metho.xml

Size: 11,245 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1621">
  <Title>Using a Corpus of Sentence Orderings Defined by Many Experts to Evaluate Metrics of Coherence for Text Structuring</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Metrics of coherence
</SectionTitle>
    <Paragraph position="0"> [Karamanis, 2003] discusses how a few basic notions of coherence captured by Centering Theory (CT) can be used to define a large range of metrics which might be useful for TS in our domain of interest.3 The metrics employed in the experiments of [Karamanis et al., 2004] include: M.NOCB which penalises NOCBs, i.e. pairs of adjacent facts without any arguments in common [Karamanis and Manurung, 2002]. Because of its simplicity M.NOCB serves as the first baseline in the experiments of [Karamanis et al., 2004].</Paragraph>
    <Paragraph position="1"> PF.NOCB, a second baseline, which enhances M.NOCB with a global constraint on coherence that [Karamanis, 2003] calls the PageFocus (PF).</Paragraph>
    <Paragraph position="2"> PF.BFP which is based on PF as well as the original formulation of CT in [Brennan et al., 1987].</Paragraph>
    <Paragraph position="3"> PF.KP which makes use of PF as well as the recent reformulation of CT in [Kibble and Power, 2000].</Paragraph>
    <Paragraph position="4"> [Karamanis et al., 2004] report that PF.NOCB outperformed M.NOCB but was overtaken by PF.BFP and PF.KP. The two metrics beating PF.NOCB were not found to differ significantly from each other.</Paragraph>
    <Paragraph position="5"> This study employs PF.BFP and PF.KP, i.e. two of the best performing metrics of the experiments in [Karamanis et al., 2004], as well as M.NOCB and PF.NOCB, the two previously used baselines. An additional random baseline is also defined following [Lapata, 2003].</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Data collection
</SectionTitle>
    <Paragraph position="0"> 16 sets of facts were randomly selected from the corpus of [Dimitromanolaki and Androutsopoulos, 2003].4 The sentences that each fact corresponds to and the order defined by E0 was made available to us as well. We will subsequently refer to an unordered set of facts (or sentences that the facts correspond to) as a Testitem.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Generating the BestOrders for each metric
</SectionTitle>
      <Paragraph position="0"> Following [Karamanis et al., 2004], we envisage a TS approach in which a metric of coherence M assigns a score to 3Since discussing the metrics in detail is well beyond the scope of this paper, the reader is referred to Chapter 3 of [Karamanis, 2003] for more information on this issue.</Paragraph>
      <Paragraph position="1"> 4These are distinct from, yet very similar to, the sets of facts used in [Karamanis et al., 2004].</Paragraph>
      <Paragraph position="2"> each possible ordering of the input set of facts and selects the best scoring ordering as the output. When many orderings score best, M chooses randomly between them. Crucially, our hypothetical TS component only considers orderings starting with the subclass fact (e.g. subclass(ex1, amph) in Figure 1) following the suggestion of [Dimitromanolaki and Androutsopoulos, 2003]. This gives rise to 5! = 120 orderings to be scored by M for each Testitem.</Paragraph>
      <Paragraph position="3"> For the purposes of this experiment, a simple algorithm was implemented that first produces the 120 possible orderings of facts in a Testitem and subsequently ranks them according to the scores given by M. The algorithm outputs the set of BestOrders for the Testitem, i.e. the orderings which score best according to M. This procedure was repeated for each metric and all Testitems employed in the experiment.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Random baseline
</SectionTitle>
      <Paragraph position="0"> Following [Lapata, 2003], a random baseline (RB) was implemented as the lower bound of the analysis. The random baseline consists of 10 randomly selected orderings for each Testitem. The orderings are selected irrespective of their scores for the various metrics.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Consulting domain experts
</SectionTitle>
      <Paragraph position="0"> Three archaeologists (E1, E2, E3), one male and two females, between 28 and 45 years of age, all trained in cataloguing and museum labelling, were recruited from the Department of Classics at the University of Edinburgh.</Paragraph>
      <Paragraph position="1"> Each expert was consulted by the first author in a separate interview. First, she was presented with a set of six sentences, each of which corresponded to a database fact and was printed on a different filecard, as well as with written instructions describing the ordering task.5 The instructions mention that the sentences come from a computer program that generates descriptions of artefacts in a virtual museum. The first sentence for each set was given by the experimenter.6 Then, the expert was asked to order the remaining five sentences in a coherent text.</Paragraph>
      <Paragraph position="2"> When ordering the sentences, the expert was instructed to consider which ones should be together and which should come before another in the text without using hints other than the sentences themselves. She could revise her ordering at any time by moving the sentences around. When she was satisfied with the ordering she produced, she was asked to write next to each sentence its position, and give them to the experimenter in order to perform the same task with the next randomly selected set of sentences. The expert was encouraged to comment on the difficulty of the task, the strategies she followed, etc.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Dependent variable
</SectionTitle>
    <Paragraph position="0"> Given an unordered set of sentences and two possible orderings, a number of measures can be employed to calculate the  distance between them. Based on the argumentation in [Howell, 2002], [Lapata, 2003] selects Kendall's ? as the most appropriate measure and this was what we used for our analysis as well. Kendall's ? is based on the number of inversions between the two orderings and is calculated as follows:</Paragraph>
    <Paragraph position="2"> PN stands for the number of pairs of sentences and N is the number of sentences to be ordered.7 I stands for the number of inversions, that is, the number of adjacent transpositions necessary to bring one ordering to another. Kendall's ? ranges from !1 (inverse ranks) to 1 (identical ranks). The higher the ? value, the smaller the distance between the two orderings.</Paragraph>
    <Paragraph position="3"> Following [Lapata, 2003], the Tukey test is employed to investigate significant differences between average ? scores.8 First, the average distance between (the orderings of)9 two experts e.g. E0 and E1, denoted as T(E0E1), is calculated as the mean ? value between the ordering of E0 and the ordering of E1 taken across all 16 Testitems. Then, we compute T(EXPEXP) which expresses the overall average distance between all expert pairs and serves as the upper bound for the evaluation of the metrics. Since a total of E experts gives rise to PE = E(E!1)2 expert pairs, T(EXPEXP), is computed by summing up the average distances between all expert pairs and dividing the sum by PE.</Paragraph>
    <Paragraph position="4"> While [Lapata, 2003] always appears to single out a unique best scoring ordering, we often have to deal with many best scoring orderings. To account for this, we first compute the average distance between e.g. the ordering of an expert E0 and the BestOrders of a metric M for a given Testitem. In this way, M is rewarded for a BestOrder that is close to the expert's ordering, but penalised for every BestOrder that is not. Then, the average T(E0M) between the expert E0 and the metric M is calculated as their mean distance across all 16 Testitems. Finally, yet most importantly, T(EXPM) is the average distance between all experts and M. It is calculated by summing up the average distances between each expert and M and dividing the sum by the number of experts. As the next section explains in more detail, T(EXPM) is compared with the upper bound of the evaluation T(EXPEXP) to estimate the performance of M in our experiments.</Paragraph>
    <Paragraph position="5"> RB is evaluated in a similar way as M using the 10 randomly selected orderings instead of the BestOrders for each Testitem. T(EXPRB) is the average distance between all experts and RB and is used as the lower bound of the evaluation.  can be used to specify which of the conditions c1;:::;cn measured by the dependent variable differ significantly. It uses the set of means m1;:::;mn (corresponding to conditions c1;:::;cn) and the mean square error of the scores that contribute to these means to calculate a critical difference between any two means. An observed difference between any two means is significant if it exceeds the critical difference.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Predictions
</SectionTitle>
    <Paragraph position="0"> Despite any potential differences between the experts, one expects them to share some common ground in the way they order sentences. In this sense, a particularly welcome result for our purposes is to show that the average distances between E0 and most of her colleagues are short and not significantly different from the distances between the other expert pairs, which in turn indicates that she is not a &amp;quot;stand-alone&amp;quot; expert. Moreover, we expect the average distance between the expert pairs to be significantly smaller than the average distance between the experts and RB. This is again based on the assumption that even though the experts might not follow completely identical strategies, they do not operate with absolute diversity either. Hence, we predict that T(EXPEXP) will be significantly greater than T(EXPRB).</Paragraph>
    <Paragraph position="1"> Due to the small number of Testitems employed in this study, it is likely that the metrics do not differ significantly from each other with respect to their average distance from the experts. Rather than comparing the metrics directly with each other (as [Karamanis et al., 2004] do), this study compares them indirectly by examining their behaviour with respect to the upper and the lower bound. For instance, although T(EXPPF:KP) and T(EXPPF:BFP) might not be significantly different from each other, one score could be significantly different from T(EXPEXP) (upper bound) and/or T(EXPRB) (lower bound) while the other is not.</Paragraph>
    <Paragraph position="2"> We identify the best metrics in this study as the ones whose average distance from the experts (i) is significantly greater from the lower bound and (ii) does not differ significantly from the upper bound.10</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML