File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2914_evalu.xml

Size: 6,314 bytes

Last Modified: 2025-10-06 13:59:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2914">
  <Title>Word Distributions for Thematic Segmentation in a Support Vector Machine Approach</Title>
  <Section position="9" start_page="104" end_page="106" type="evalu">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
6.1 Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> Beeferman et al. (1999) underlined that the standard evaluation metrics of precision and recall are inadequate for thematic segmentation, namely by the fact that these metrics did not account for how far away a hypothesized boundary (i.e. a boundary found by the automatic procedure) is from the reference boundary. On the other hand, for instance, an algorithm that places a boundary just one utterance away from the reference boundary should be penalized less than an algorithm that places a boundary ten (or more) utterances away from the reference boundary.</Paragraph>
      <Paragraph position="1"> Hence the use of two other evaluation metrics is favored in thematic segmentation: the Pk metric (Beeferman et al., 1999) and the WindowDiff error metric (Pevzner and Hearst, 2002). In con- null trast to precision and recall, these metrics allow for a slight vagueness in where the hypothesized thematic boundaries are placed and capture &amp;quot;the notion of nearness in a principled way, gently penalizing algorithms that hypothesize boundaries that aren't quite right, and scaling down with the algorithm's degradation&amp;quot; (Beeferman et al., 1999). That is, computing both Pk and WindowDiff metrics involves the use of a fixed-size (i.e. having a fixed number of either words or utterances) window that is moved step by step over the data. At each step, Pk and WindowDiff are basically increased (each metric in a slightly different way) if the hypothesized boundaries and the reference boundaries are not within the same window.</Paragraph>
      <Paragraph position="2"> During the model selection phase, we used precision and recall in order to measure the system's error rate. This was motivated by the fact that posing the TS task as a classification problem leads to a loss of the sequential nature of the data, which is an inconvenient in computing the Pk and WindowDiff measures. However, during the final testing phase of our system, as well as for the evaluation of the previous systems, we use both the Pk and the WindowDiff error metric.</Paragraph>
      <Paragraph position="3"> The relatively small size of our datasets does not allow for dividing our test set into multiple sub-test sets for applying statistical significance tests. This would be desirable in order to indicate whether the differences in system error rates are statistically significant over different data sets. Nevertheless, we believe that measuring differences in error rates obtained on the test set is indicative of the relative performance. Thus, the experimental results shown in this paper should be considered as illustrative rather than exhaustive.</Paragraph>
    </Section>
    <Section position="2" start_page="105" end_page="106" type="sub_section">
      <SectionTitle>
6.2 Results
</SectionTitle>
      <Paragraph position="0"> In order to determine the adequacy of our SVM approach over different genres, we ran our system over three datasets, namely the ICSI meeting data, the TDT broadcast data and the Brown written genre data.</Paragraph>
      <Paragraph position="1"> By measuring the system error rates using the Pk and the WindowDiff metrics, Figure 1 summarizes the quantitative results obtained in our empirical evaluation. In Figure 1, our SVM approach is labeled as SVM and we abbreviate WindowDiff as WD. The results of our SVM system correspond to the parameter values detected during model selection (see Table 2). We compare our system against an existing thematic segmenter in the literature: C99 (Choi, 2000). We also give for comparison the error rates of a naive algorithm, labeled as Rand algorithm, which randomly distributes boundaries throughout the text.</Paragraph>
      <Paragraph position="2"> The LCseg system (Galley et al., 2003), labeled here as G03, is to our knowledge the only word distribution based system evaluated on ICSI meeting data. Therefore, we replicate the results reported by (Galley et al., 2003) when evaluation of LCseg was done on ICSI data. The so-labeled G03* algorithm  indicates the error rates obtained by (Galley et al., 2003) when extra (meeting specific) features have been adopted in a decision tree classifier. However, note that the results reported by (Galley et al.) are not directly comparable with our results because of a slight difference in the evaluation procedure: (Galley et al.) performed 25-fold cross validation and the average Pk and WD error rates have been computed on the held-out sets.</Paragraph>
      <Paragraph position="3"> Figure 1 illustrates the following interesting results. For the ICSI meeting data, our SVM approach provides the best performance relative to the competing word distribution based state-of-the-art methods. This proves that our SVM-based system is able to build a parametric model that leads to a segmentation that highly correlates to a human thematic segmentation. Furthermore, by taking into account the relatively small size of the data set we used for training, it can be concluded that the SVM can build qualitatively good models even with a small training data. The work of (Galley et al., 2003) shows that the G03* algorithm is better than G03 by approximately 10%, which indicates that on meeting data the performance of our word-distribution based approach could possibly be increased by using other meeting-specific features.</Paragraph>
      <Paragraph position="4"> By examining the error rates given by Pk metric for the three systems on the TDT data set, we observe that our system and C99 performed more or less equally. With respect to the WindowDiff metric, our system has an error rate approximately 10% smaller than C99.</Paragraph>
      <Paragraph position="5"> On the synthetic data set, the SVM approach performed slightly worse than C99, avoiding however catastrophic failure, as observed with the C99 method on ICSI data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML