File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1320_evalu.xml

Size: 6,105 bytes

Last Modified: 2025-10-06 13:59:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1320">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Analysis of Quantitative Aspects in the Evaluation of Thematic Segmentation Algorithms</Title>
  <Section position="9" start_page="147" end_page="149" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="147" end_page="148" type="sub_section">
      <SectionTitle>
5.1 Test Procedure
</SectionTitle>
      <Paragraph position="0"> For the three datasets we first performed two common preprocessing steps: common words are eliminated using the same stop-list and remaining words are stemmed by using Porter's algorithm (1980). Next, we ran the three segmenters described in Section 2, by employing the default values for any system parameters and by letting the  systems estimate the number of thematic boundaries. null We also considered the fact that C99 and TextSeg algorithms can take into account a fixed number of thematic boundaries. Even if the number of segments per document can vary in TDT and meeting reference data, we consider that in a real application it is impossible to provide to the systems the exact number of boundaries for each document to be segmented. Therefore, we ran C99 and TextSeg algorithms (for a second time), by providing them only the average number of segments per document in the reference data, which gives an estimation of the expected level of segmentation granularity.</Paragraph>
      <Paragraph position="1"> Four additional naive segmentations were also used for evaluation, namely: no boundaries, where the whole text is a single segment; all boundaries, i.e. a thematic boundary is placed after each utterance; random known, i.e. the same number of boundaries as in gold standard, distributed randomly throughout text; and random unknown: the number of boundaries is randomly selected and boundaries are randomly distributed throughout text. Each of the segmentations was evaluated with Pk, Pprimek and WindowDiff, as described in Section 4.</Paragraph>
    </Section>
    <Section position="2" start_page="148" end_page="148" type="sub_section">
      <SectionTitle>
5.2 Comparative Performance of
Segmentation Systems
</SectionTitle>
      <Paragraph position="0"> The results of applying each segmentation algorithm to the three distinct datasets are summarized in Figures 1, 2 and 3. Percent error values are given in the figures and we used the following abbreviations: WD to denote WindowDiff error metric; TextSeg KA to denote the TextSeg algorithm (Utiyama and Isahara, 2001) when the average number of boundaries in the reference data was provided to the algorithm; C99 KA to denote the C99 algorithm (Choi, 2000) when the average number of boundaries in the reference data was provided to the algorithm; N0 to denote the algorithm proposing a segmentation with no boundaries; All to denote the algorithm proposing the degenerate segmentation all boundaries; RK to denote the algorithm that generates a random known segmentation; and RU to denote the algorithm that generates a random unknown segmentation.</Paragraph>
    </Section>
    <Section position="3" start_page="148" end_page="149" type="sub_section">
      <SectionTitle>
5.2.1 Comparison of System Performance
</SectionTitle>
      <Paragraph position="0"> from Artificial to Realistic Data From the artificial data to the more realistic data, we expect to have more noise and thus the algorithms to constantly degrade, but as our experiments show a reversal of the assessment can appear. More exactly: as can be seen from Figure 1, both C99 and TextSeg algorithms significantly outperformed TextTiling algorithm on the artificially created dataset, when the number of segments was determined by the systems. A comparison between the error rates given in Figure 1 and Figure 2 show that C99 and TextSeg have a similar trend, by significantly decreasing their performance on TDT data, but still giving better results than TextTiling on TDT data. When comparing the systems by Prerror, C99 has similar performance with TextTiling on meeting data (see Figure 3). Moreover, when assessment is done by using WindowDiff, Pk or Pprimek, both C99 and TextSeg came out worse than TextTiling on meeting data. This demonstrates that rankings obtained when evaluating on artificial data are different from those obtained when evaluating on realistic data. An alternative interpretation can be given by taking into account that the degenerative no boundaries segmentation has an error rate of only 30% by the WindowDiff, Pk and Pprimek metrics on meeting data. That is, we could interpret that all three systems give completely wrong segmentations on meeting data (due to the fact that topic shifts are subtler and not as abrupt as in TDT and artificial data). Nevertheless, we tend to adopt the first interpretation, given the weaknesses of Pk, Pprimek and WindowDiff (where misses are less penalised than false alarms), as discussed in Section 4.</Paragraph>
      <Paragraph position="1">  By following the quantitative assessment given by the WindowDiff metric, we observe that the algorithm labeled N0 is three times better than the algorithm All on meeting data (see Figure 3), while the same algorithm N0 is considered only two times better than All on the artificial data (see Figure 1). This verifies the limitation of the WindowDiff metric discussed in Section 4.</Paragraph>
      <Paragraph position="2"> The four error metrics described in detail in Section 4 have shown that the effect of knowing the average number of boundaries on C99 is positive when testing on meeting data. However if we want to take into account all the four error met- null rics, it is difficult to draw definite conclusions regarding the influence of knowing the average number of boundaries on TextSeg and C99 algorithms.</Paragraph>
      <Paragraph position="3"> For example, when tested on TDT data, C99 KA seems to work better than C99 by Pk and Pprimek metrics, while the WindowDiff metric gives a contradictory assessment.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML