File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0405_evalu.xml

Size: 8,749 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0405">
  <Title>Feature-Based Segmentation of Narrative Documents</Title>
  <Section position="6" start_page="35" end_page="38" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> In this section, we examine a number of narrative segmentation tasks with different segmentation methods. The only data used during development was the rst two thirds from Biohazard (exp1 and exp2). All other data sets were only examined after the algorithm was developed and were used for testing purposes. Unless stated otherwise, results for the feature based method are using the SVM classi er.1</Paragraph>
    <Section position="1" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
5.1 Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> We use three segmentation evaluation metrics that have been recently developed to account for close but not exact placement of hypothesized boundaries: word error probability, sentence error probability, and WindowDiff. Word error probability  that a randomly chosen pair of words k words apart is incorrectly classi ed, i.e. a false positive or false negative of being in the same segment. In contrast to the standard classi cation measures of precision and recall, which would consider a close hypothesized boundary (e.g., off by one sentence) to be incorrect, word error probability gently penalizes close hypothesized boundaries. We also compute the sentence error probability, which estimates the probability that a randomly chosen pair of sentences s sentences apart is incorrectly classi ed. k and s are chosen to be half the average length of a section in the test data. WindowDiff (Pevzner and Hearst, 2002) uses a sliding window over the data and measures the difference between the number of hypothesized boundaries and the actual boundaries within the window. This metric handles several criticisms of the word error probability metric.</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
5.2 Segmenting Narrative Books
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the results of the SVM-segmenter on Biohazard and Demon in the Freezer. A baseline performance for segmentation algorithms is whether the algorithm performs better than naive segmenting algorithms: choose no boundaries, choose all boundaries and choose randomly. Choosing all boundaries results in word and sentence error probabilities of approximately 55%. Choosing no boundaries is about 45%. Table 2 also shows the results for random placement of the correct number of segments. Both random boundaries at sentence locations and random boundaries at paragraph locations are shown (values shown are the averages of 500 random runs). Similar results were obtained for random segmentation of the Demon data.</Paragraph>
      <Paragraph position="1">  For Biohazard the holdout set was not used during development. When trained on either of the development thirds of the text (i.e., exp1 or exp2) and tested on the test set, a substantial improvement is seen over random. 3-fold cross validation was done by training on two-thirds of the data and testing on the other third. Recalling from Table 1 that both PLSA and TextTiling result in performance similar to random even when given the correct number of segments, we note that all of the single train/test splits performed better than any of the naive algorithms and previous methods examined.</Paragraph>
      <Paragraph position="2"> To examine the ability of our algorithm to perform on unseen data, we trained on the entire Biohazard book and tested on Demon in the Freezer. Performance on Demon in the Freezer is only slightly worse than the Biohazard results and is still much better than the baseline algorithms as well as previous methods. This is encouraging since Demon was not used during development, is written by a different author and has a segment length distribution that is different than Biohazard (average segment length of 30 vs. 18 in Biohazard).</Paragraph>
    </Section>
    <Section position="3" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
5.3 Segmenting Articles
</SectionTitle>
      <Paragraph position="0"> Unfortunately, obtaining a large number of narrative books with meaningful labeled segmentation is difcult. To evaluate our algorithm on a larger data set as well as a wider variety of styles similar to narrative documents, we also examine 1000 articles from Groliers Encyclopedia that contain subsections denoted by major and minor headings, which we consider to be the true segment boundaries. The articles contained 8,922 true and 102,116 possible boundaries. We randomly split the articles in half, and perform two-fold cross-validation as recommended by Dietterich (1998). Using 500 articles from one half of the pair for testing, 50 articles are randomly selected from the other half for training. We used  a subset of only 50 articles due to the high cost of labeling data. Each split yields two test sets of 500 articles and two training sets. This procedure of two-fold cross-validation is performed ve times, for a total of 10 training and 10 corresponding test sets.</Paragraph>
      <Paragraph position="1"> Signi cance is then evaluated using the t-test.</Paragraph>
      <Paragraph position="2"> The results for segmenting Groliers Encyclopedia articles are given in Table 3. We compare the performance of different segmentation models: two feature-based models (SVMs, boosted decision stumps), two similarity-based models (PLSA-based segmentation, TextTiling), and randomly selecting segmentation points. All segmentation systems are given the estimated number of segmentation points based based on the training data. The feature based approaches are signi cantly2 better than either PLSA, TextTiling or random segmentation. For our selected features, boosted stump performance is similar to using an SVM, which reinforces our intuition that the selected features (and not just classi cation method) are appropriate for this problem.</Paragraph>
      <Paragraph position="3"> Table 1 indicates that the previous TextTiling and PLSA-based approaches perform close to random on narrative text. Our experiments show a performance improvement of &gt;24% by our feature-based system, and signi cant improvement over other methods on the Groliers data. Hearst (1994) examined the task of identifying the paragraph boundaries in expository text. We provide analysis of this data set here to emphasize that identifying segments in natural text is a dif cult problem and since current evaluation methods were not used when this data was initially presented. Human performance on this task is in the 15%-35% error rate. Hearst asked seven human judges to label the paragraph  boundaries of four different texts. Since no ground truth was available, true boundaries were identi ed by those boundaries that had a majority vote as a boundary. Table 4 shows the average human performance for each text. We show these results not for direct comparison with our methods, but to highlight that even human segmentation on a related task does not achieve particularly low error rates.</Paragraph>
    </Section>
    <Section position="4" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
5.4 Analysis of Features
</SectionTitle>
      <Paragraph position="0"> The top section of Table 5 shows features that are intuitively hypothesized to be positively correlated with boundaries and the bottom section shows negatively correlated. For this analysis, exp1 from Alibek was used for training and the holdout set for testing.</Paragraph>
      <Paragraph position="1"> There are 74 actual boundaries and 2086 possibly locations. Two features have perfect recall: paragraph and conversation. Every true section boundary is at a paragraph and no section boundaries are within conversation regions. Both the word group and entity group features have good correlation with boundary locations and also generalized well to the training data by occurring in over half of the positive test examples.</Paragraph>
      <Paragraph position="2"> The bene t of generalization using outside resources can be seen by comparing the boundary words found using word groups versus those found only in the training set as in Section 3. Using word groups triples the number of signi cant words found in the training set that occur in the test set. Also, the number of shared words that occur signi cantly in both the training and test set goes from none to 9.</Paragraph>
      <Paragraph position="3"> More importantly, signi cant words occur in 37 of the test segments instead of none without the groups.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML