File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1050_metho.xml

Size: 14,071 bytes

Last Modified: 2025-10-06 14:09:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1050">
  <Title>Evaluating Centering-based metrics of coherence for text structuring using a reliably annotated corpus</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Exploring the space of possible
</SectionTitle>
    <Paragraph position="0"> orderings In section 2, we discussed how an ordering of utterances in a text like (1) can be translated into a sequence of CF lists, which is the representation that the Centering-based metrics operate on. We use the term Basis for Comparison (BfC) to indicate this sequence of CF lists. In this section, we discuss how the BfC is used in our search-oriented evaluation methodology to calculate a performance measure for each metric and compare them with each other. In the next section, we will see how our corpus was used to identify the most promising Centering-based metric for a text classifier.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Computing the classification rate
</SectionTitle>
      <Paragraph position="0"> The performance measure we employ is called the classification rate of a metric M on a certain BfC B. The classification rate estimates the ability of M to produce B as the output of text structuring according to a specific generation scenario.</Paragraph>
      <Paragraph position="1"> The first step of seec is to search through the space of possible orderings defined by the permutations of the CF lists that B consists of, and to divide the explored search space into sets of orderings that score better, equal, or worse than B according to M.</Paragraph>
      <Paragraph position="2"> Then, the classification rate is defined according to the following generation scenario. We assume that an ordering has higher chances of being selected as the output of text structuring the better it scores for M. This is turn means that the fewer the members of the set of better scoring orderings, the better the chances of B to be the chosen output.</Paragraph>
      <Paragraph position="3"> Moreover, we assume that additional factors play a role in the selection of one of the orderings that score the same for M. On average, B is expected to sit in the middle of the set of equally scoring orderings with respect to these additional factors. Hence, half of the orderings with the same score will have better chances than B to be selected by M.</Paragraph>
      <Paragraph position="4"> The classification rate u of a metric M on B expresses the expected percentage of orderings with a higher probability of being generated than B according to the scores assigned by M and the additional biases assumed by the generation scenario as follows:</Paragraph>
      <Paragraph position="6"> Better(M) stands for the percentage of orderings that score better than B according to M, whilst Equal(M) is the percentage of orderings that score equal to B according to M. If u(Mx,B) is the classification rate of Mx on B, and u(My,B) is the classification rate of My on B, My is a more suitable candidate than Mx for generating B if u(My,B) is smaller than u(Mx,B).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Generalising across many BfCs
</SectionTitle>
      <Paragraph position="0"> In order for the experimental results to be reliable and generalisable, Mx and My should be compared on more than one BfC from a corpus C. In our standard analysis, the BfCs B1,...,Bm from C are treated as the random factor in a repeated measures design since each BfC contributes a score for each metric. Then, the classification rates for Mx and My on the BfCs are compared with each other and significance is tested using the Sign Test. After calculating the number of BfCs that return a lower classification rate for Mx than for My and vice versa, the Sign Test reports whether the difference in the number of BfCs is significant, that is, whether there are significantly more BfCs with a lower classification rate for Mx than the BfCs with a lower classification rate for My (or vice versa).9 Finally, we summarise the performance of M on m BfCs from C in terms of the average classification rate Y:  We will now discuss how the methodology discussed above was used to compare the Centering-based metrics discussed in Section 3, using the original ordering of texts in the gnome corpus to compute the average classification rate of each metric.</Paragraph>
      <Paragraph position="1"> The gnome corpus contains texts from different genres, not all of which are of interest to us. In order to restrict the scope of the experiment to the text-type most relevant to our study, we selected 20 &amp;quot;museum labels&amp;quot;, i.e., short texts that describe a concrete artefact, which served as the input to seec together with the metrics in section 3.10</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Permutation and search strategy
</SectionTitle>
      <Paragraph position="0"> In specifying the performance of the metrics we made use of a simple permutation heuristic exploiting a piece of domain-specific communication knowledge (Kittredge et al., 1991). Like Dimitromanolaki and Androutsopoulos (2003), we noticed that utterances like (a) in example (1), should always appear at the beginning of a felicitous museum label. Hence, we restricted the orderings considered by the seec  to those in which the first CF list of B, CF1, appears in first position.11 For very short texts like (1), which give rise to a small BfC, the search space of possible orderings can be enumerated exhaustively. However, when B consists of many more CF lists, it is impractical to explore the search space in this way. Elsewhere we show that even in these cases it is possible to estimate u(M,B) reliably for the whole population of orderings using a large random sample. In the experiments reported here, we had to resort to random sampling only once, for a BfC with 16 CF lists.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Comparing M.NOCB with other
</SectionTitle>
      <Paragraph position="0"> metrics The experimental results of the comparisons of the metrics from section 3, computed using the methodology in section 4, are reported in Table 3.</Paragraph>
      <Paragraph position="1"> In this table, the baseline metric M.NOCB is compared with each of M.CHEAP, M.KP and M.BFP. The first column of the Table identifies the comparison in question, e.g. M.NOCB versus M.CHEAP. The exact number of BfCs for which the classification rate of M.NOCB is lower than its competitor for each comparison is reported in the next column of the Table. For example, M.NOCB has a lower classification rate than M.CHEAP for 18 (out of 20) BfCs from the gnome corpus. M.CHEAP only achieves a lower classification rate for 2 BfCs, and there are no ties, i.e. cases where the classification rate of the two metrics is the same. The p value returned by the Sign Test for the difference in the number of BfCs, rounded to the third decimal place, is reported in the fifth column of the Table. The last column of the Table 3 shows M.NOCB as the &amp;quot;winner&amp;quot; of the comparison with M.CHEAP since it has a lower classifica11Thus, we assume that when the set of CF lists serves as the input to text structuring, CF1 will be identified as the initial CF list of the ordering to be generated using annotation features such as the unit type which distinguishes (a) from the other utterances in (1). tion rate than its competitor for significantly more BfCs in the corpus.12 Overall, the Table shows that M.NOCB does significantly better than the other three metrics which employ additional Centering concepts.</Paragraph>
      <Paragraph position="2"> This result means that there exist proportionally fewer orderings with a higher probability of being selected than the BfC when M.NOCB is used to guide the hypothetical text structuring algorithm instead of the other metrics.</Paragraph>
      <Paragraph position="3"> Hence, M.NOCB is the most suitable among the investigated metrics for structuring the CF lists in gnome. This in turn indicates that simply avoiding nocb transitions is more relevant to text structuring than the combinations of the other Centering notions that the more complicated metrics make use of. (However, these notions might still be appropriate for other tasks, such as anaphora resolution.)</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion: the performance of
M.NOCB
</SectionTitle>
    <Paragraph position="0"> We already saw that Poesio et al. (2004) found that the majority of the recorded transitions in the configuration of Centering used in this study are nocbs. However, we also explained in section 2.3 that what really matters when trying to determine whether a text might have been generated only paying attention to Centering constraints is the extent to which it would be possible to 'improve' upon the ordering chosen in that text, given the information that the text structuring algorithm had to convey. The average classification rate of M.NOCB is an esti12No winner is reported for a comparison when the p value returned by the Sign Test is not significant (ns), i.e. greater than 0.05. Note also that despite conducting more than one pairwise comparison simultaneously we refrain from further adjusting the overall threshold of significance (e.g. according to the Bonferroni method, typically used for multiple planned comparisons that employ parametric statistics) since it is assumed that choosing a conservative statistic such as the Sign Test already provides substantial protection against the possibility of  in MPIRO mate of exactly this variable, indicating whether M.NOCB is likely to arrive at the BfC during text structuring.</Paragraph>
    <Paragraph position="1"> The average classification rate Y for M.NOCB on the subcorpus of gnome studied here, for the parameter configuration of Centering we have assumed, is 19.95%. This means that on average the BfC is close to the top 20% of alternative orderings when these orderings are ranked according to their probability of being selected as the output of the algorithm.</Paragraph>
    <Paragraph position="2"> On the one hand, this result shows that although the ordering of CF lists in the BfC might not completely minimise the number of observed nocb transitions, the BfC tends to be in greater agreement with the preference to avoid nocbs than most of the alternative orderings. In this sense, it appears that the BfC optimises with respect to the number of potential nocbs to a certain extent. On the other hand, this result indicates that there are quite a few orderings which would appear more likely to be selected than the BfC.</Paragraph>
    <Paragraph position="3"> We believe this finding can be interpreted in two ways. One possibility is that M.NOCB needs to be supplemented by other features in order to explain why the original text was structured this way. This is the conclusion arrived at by Poesio et al. (2004) and those text structuring practitioners who use notions derived from Centering in combination with other coherence constraints in the definitions of their metrics.</Paragraph>
    <Paragraph position="4"> There is also a second possibility, however: we might want to reconsider the assumption that human text planners are trying to ensure that each utterance in a text is locally coherent.</Paragraph>
    <Paragraph position="5"> They might do all of their planning just on the basis of Centering constraints, at least in this genre -perhaps because of resource limitationsand simply accept a certain degree of incoherence. Further research on this issue will require psycholinguistic methods; our analysis nevertheless sheds more light on two previously unaddressed questions in the corpus-based evaluation of Centering - a) which of the Centering notions are most relevant to the text structuring task, and b) to which extent Centering on its own can be useful for this purpose.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Further results
</SectionTitle>
    <Paragraph position="0"> In related work, we applied the methodology discussed here to a larger set of existing data (122 BfCs) derived from the MPIRO system and ordered by a domain expert (Dimitromanolaki and Androutsopoulos, 2003). As Table 4 shows, the results from MPIRO verify the ones reported here, especially with respect to M.KP and M.CHEAP which are overwhelmingly beaten by the baseline in the new domain as well. Also note that since M.BFP fails to overtake M.NOCB in MPIRO, the baseline can be considered the most promising solution among the ones investigated in both domains by applying Occam's logical principle.</Paragraph>
    <Paragraph position="1"> We also tried to account for some additional constraints on coherence, namely local rhetorical relations, based on some of the assumptions in Knott et al. (2001), and what Karamanis (2003) calls the &amp;quot;PageFocus&amp;quot; which corresponds to the main entity described in a text, in our example de374. These results, reported in (Karamanis, 2003), indicate that these constraints conflict with Centering as formulated in this paper, by increasing - instead of reducing - the classification rate of the metrics. Hence, it remains unclear to us how to improve upon M.NOCB.</Paragraph>
    <Paragraph position="2"> In our future work, we would like to experiment with more metrics. Moreover, although we consider the parameter configuration of Centering used here a plausible choice, we intend to apply our methodology to study different instantiations of the Centering parameters, e.g. by investigating whether &amp;quot;indirect realisation&amp;quot; reduces the classification rate for M.NOCB compared to &amp;quot;direct realisation&amp;quot;, etc.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML