File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0405_metho.xml
Size: 18,842 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0405"> <Title>Feature-Based Segmentation of Narrative Documents</Title> <Section position="3" start_page="32" end_page="32" type="metho"> <SectionTitle> 2 Previous Approaches </SectionTitle> <Paragraph position="0"> Previous topic segmentation methods fall into three groups: similarity based, lexical chain based, and feature based. In this section we give a brief overview of each of these groups.</Paragraph> <Section position="1" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 2.1 Similarity-based </SectionTitle> <Paragraph position="0"> One popular method is to generate similarities between blocks of text (such as blocks of words, sentences or paragraphs) and then identify section boundaries where dips in the similarities occur.</Paragraph> <Paragraph position="1"> The cosine similarity measure between term vectors is used by Hearst (1994) to de ne the similarity between blocks. She notes that the largest dips in similarity correspond to de ned boundaries.</Paragraph> <Paragraph position="2"> Brants et al. (2002) learn a PLSA model using EM to smooth the term vectors. The model is parameterized by introducing a latent variable, representing the possible topics . They show good performance on a number of different synthetic data sets.</Paragraph> <Paragraph position="3"> Kozima and Furugori (1994) use another similarity metric they call lexical cohesion . The cohesiveness of a pair of words is calculated by spreading activation on a semantic network as well as word frequency. They showed that dips in lexical cohesion plots had some correlation with human subject boundary decisions on one short story.</Paragraph> </Section> <Section position="2" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 2.2 Lexical Chains </SectionTitle> <Paragraph position="0"> Semantic networks de ne relationships between words such as synonymy, specialization/generalization and part/whole. Stokes et al. (2002) use these relationships to construct lexical chains. A lexical chain is a sequence of lexicographically related word occurrences where every word occurs within a set distance from the previous word.</Paragraph> <Paragraph position="1"> A boundary is identi ed where a large numbers of lexical chains begin and end. They showed that lexical chains were useful for determining the text structure on a set of magazine articles, though they did not provide empirical results.</Paragraph> </Section> <Section position="3" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 2.3 Feature-based </SectionTitle> <Paragraph position="0"> Beeferman et al. (1999) use an exponential model and generate features using a maximum entropy selection criterion. Most features learned are cue-based features that identify a boundary based on the occurrence of words or phrases. They also include a feature that measures the difference in performance of a long range vs. short range model. When the short range model outperforms the long range model, this indicates a boundary. Their method performed well on a number of broadcast news data sets, including the CNN data set from TDT 1997.</Paragraph> <Paragraph position="1"> Reynar (1999) describes a maximum entropy model that combines hand selected features, including: broadcast news domain cues, number of content word bigrams, number of named entities, number of content words that are WordNet synonyms in the left and right regions, percentage of content words in the right segment that are rst uses, whether pronouns occur in the rst ve words, and whether a word frequency based algorithm predicts a boundary. He found that for the HUB-4 corpus, which is composed of transcribed broadcasts, that the combined feature model performed better than TextTiling.</Paragraph> <Paragraph position="2"> Mochizuki et al. (1998) use a combination of linguistic cues to segment Japanese text. Although a number of cues do not apply to English (e.g., topical markers), they also use anaphoric expressions and lexical chains as cues. Their study was small, but did indicate that lexical chains are a useful cue in some domains.</Paragraph> <Paragraph position="3"> These studies indicate that a combination of features can be useful for segmentation. However, Mochizuki et al. (1998) analyzed Japanese texts, and Reynar (1999) and Beeferman et al. (1999) evaluated on broadcast news stories, which have many cues that narrative texts do not. Beeferman et al.</Paragraph> <Paragraph position="4"> (1999) also evaluated on concatenated Wall Street Journal articles, which have stronger topic changes than within a document. In our work, we examine the use of linguistic features for segmentation of narrative text in English.</Paragraph> </Section> </Section> <Section position="4" start_page="32" end_page="34" type="metho"> <SectionTitle> 3 Properties of Narrative Text </SectionTitle> <Paragraph position="0"> Characterizing data set properties is the rst step towards deriving useful features. The approaches in the previous section performed well on broad- null cast news, expository and synthetic data sets. Many properties of these documents are not shared by narrative documents. These properties include: 1) cue phrases, such as welcome back and joining us that feature-based methods used in broadcast news, 2) strong topic shifts, as in synthetic documents created by concatenating newswire articles, and 3) large data sets such that the training data and testing data appeared to come from similar distributions.</Paragraph> <Paragraph position="1"> In this paper we examine two narrative-style books: Biohazard by Ken Alibek and The Demon in the Freezer by Richard Preston. These books are segmented by the author into sections. We manually examined these author identi ed boundaries and they are reasonable. We take these sections as true locations of segment boundaries. We split Biohazard into three parts, two for experimentation (exp1 and exp2) and the third as a holdout for testing. Demon in the Freezer was reserved for testing. Biohazard contains 213 true and 5858 possible boundaries. Demon has 119 true and 4466 possible boundaries.</Paragraph> <Paragraph position="2"> Locations between sentences are considered possible boundaries and were determined automatically.</Paragraph> <Paragraph position="3"> We present an analysis of properties of the book Biohazard by Ken Alibek as an exemplar of narrative documents (for this section, test=exp1 and train=exp2). These properties are different from previous expository data sets and will result in poor performance for the algorithms mentioned in Section 2.</Paragraph> <Paragraph position="4"> These properties help guide us in deriving features that may be useful for segmenting narrative text.</Paragraph> <Paragraph position="5"> Vocabulary The book contains a single topic with a number of sub-topics. These changing topics, combined with the varied use of words for narrative documents, results in many unseen terms in the test set. 25% of the content words in the test set do not occur in the training set and a third of the words in the test set occur two times or less in the training set.</Paragraph> <Paragraph position="6"> This causes problems for those methods that learn a model of the training data such as Brants et al.</Paragraph> <Paragraph position="7"> (2002) and Beeferman et al. (1999) because, without outside resources, the information in the training data is not suf cient to generalize to the test set.</Paragraph> <Paragraph position="8"> Boundary words Many feature-based methods rely on cues at the boundaries (Beeferman et al., 1999; Reynar, 1999). 474 content terms occur in the rst sentence of boundaries in the training set. Of these terms, 103 occur at the boundaries of the test set.</Paragraph> <Paragraph position="9"> However, of those terms that occur signi cantly at a training set boundary (where signi cant is determined by a likelihood-ratio test with a signi cance level of 0.1), only 9 occur at test boundaries.</Paragraph> <Paragraph position="10"> No words occur signi cantly at a training boundary AND also signi cantly at a test boundary.</Paragraph> <Paragraph position="11"> Segment similarity Table 1 shows that two similarity-based methods that perform well on synthetic and expository text perform poorly (i.e., on par with random) on Biohazard. The poor performance occurs because block similarities provide little information about the actual segment boundaries on this data set. We examined the average similarity for two adjacent regions within a segment versus the average similarity for two adjacent regions that cross a segment boundary. If the similarity scores were useful, the within segment scores would be higher than across segment scores. Similarities were generated using the PLSA model, averaging over multiple models with between 8 and 20 latent classes. The average similarity score within a segment was 0.903 with a standard deviation of 0.074 and the average score across a segment boundary was 0.914 with a standard deviation of 0.041. In this case, the across boundary similarity is actually higher. Similar values were observed for the cosine similarities used by the TextTiling algorithm, as well as with other numbers of latent topics for the PLSA model. For all cases examined, there was little difference between inter-segment similarity and across-boundary similarity, and there was always a large standard deviation. null Lexical chains Lexical chains were identi ed as synonyms (and exact matches) occurring within a distance of one-twentieth the average segment length and with a maximum chain length equal to the average segment length (other values were ex- null amined with similar results). Stokes et al. (2002) suggest that high concentrations of lexical chain beginnings and endings are indicative of a boundary location. On the narrative data, of the 219 over-all chains, only 2 begin at a boundary and only 1 ends at a boundary. A more general heuristic identi es boundaries where there is an increase in the number of chains beginning and ending near a possible boundary while also minimizing chains that span boundaries. Even this heuristic does not appear indicative on this data set. Over 20% of the chains actually cross segment boundaries. We also measured the average distance from a boundary and the nearest beginning and ending of a chain if a chain begins/ends within that segment. If the chains are a good feature, then these should be relatively small.</Paragraph> <Paragraph position="12"> The average segment length is 185 words, but the average distance to the closest beginning chain is 39 words away and closest ending chain is 36 words away. Given an average of 4 chains per segment, the beginning and ending of chains were not concentrated near boundary locations in our narrative data, and therefore not indicative of boundaries.</Paragraph> </Section> <Section position="5" start_page="34" end_page="35" type="metho"> <SectionTitle> 4 Feature-Based Segmentation </SectionTitle> <Paragraph position="0"> We pose the problem of segmentation as a classi cation problem. Sentences are automatically identi ed and each boundary between sentences is a possible segmentation point. In the classi cation framework, each segmentation point becomes an example. We examine both support vector machines (SVMlight (Joachims, 1999)) and boosted decision stumps (Weka (Witten and Frank, 2000)) for our learning algorithm. SVMs have shown good performance on a variety of problems, including natural language tasks (Cristianini and Shawe-Taylor, 2000), but require careful feature selection. Classi cation using boosted decisions stumps can be a helpful tool for analyzing the usefulness of individual features. Examining multiple classi cation methods helps avoid focusing on the biases of a particular learning method.</Paragraph> <Section position="1" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.1 Example Reweighting </SectionTitle> <Paragraph position="0"> One problem with formulating the segmentation problem as a classi cation problem is that there are many more negative than positive examples. To discourage the learning algorithm from classifying all results as negative and to instead focus on the positive examples, the training data must be reweighted.</Paragraph> <Paragraph position="1"> We set the weight of positive vs. negative examples so that the number of boundaries after testing agrees with the expected number of segments based on the training data. This is done by iteratively adjusting the weighting factor while re-training and retesting until the predicted number of segments on the test set is approximately the expected number. The expected number of segments is the number of sentences in the test set divided by the number of sentences per segment in the training data. This value can also be weighted based on prior knowledge.</Paragraph> </Section> <Section position="2" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.2 Preprocessing </SectionTitle> <Paragraph position="0"> A number of preprocessing steps are applied to the books to help increase the informativeness of the texts. The book texts were obtained using OCR methods with human correction. The text is pre-processed by tokenizing, removing stop words, and stemming using the Inxight LinguistiX morphological analyzer. Paragraphs are identi ed using formatting information. Sentences are identi ed using the TnT tokenizer and parts of speech with the TnT part of speech tagger (Brants, 2000) with the standard English Wall Street Journal n-grams. Named entities are identi ed using nite state technology (Beesley and Karttunen, 2003) to identify various entities including: person, location, disease and organization. Many of these preprocessing steps help provide salient features for use during segmentation.</Paragraph> </Section> <Section position="3" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 4.3 Engineered Features </SectionTitle> <Paragraph position="0"> Segmenting narrative documents raises a number of interesting challenges. First, labeling data is extremely time consuming. Therefore, outside resources are required to better generalize from the training data. WordNet is used to identify words that are similar and tend to occur at boundaries for the word group feature. Second, some sources of information, in particular entity chains, do not t into the standard feature based paradigm. This requires extracting features from the underlying information source. Extracting these features represents a trade-off between information content and generalizability. In the case of entity chains, we extract features that characterize the occurrence distribution of the entity chains. Finally, the word groups and entity groups feature groups generate candidate features and a selection process is required to select useful features. We found that a likelihood ratio test for signi cance worked well for identifying those features that would be useful for classi cation. Throughout this section, when we use the term signi cant we are referring to signi cant with respect to the likelihood ratio test (with a signi cance level of 0.1). We selected features both a priori and dynamically during training (i.e., word groups and entity groups are selected dynamically). Feature selection has been used by previous segmentation methods (Beeferman et al., 1999) as a way of adapting better to the data. In our approach, knowledge about the task is used more strongly in de ning the feature types, and the selection of features is performed prior to the classi cation step. We also used mutual information, statistical tests of signi cance and classi cation performance on a development data set to identify useful features.</Paragraph> <Paragraph position="1"> Word groups In Section 3 we showed that there are not consistent cue phrases at boundaries. To generalize better, we identify word groups that occur significantly at boundaries. A word group is all words that have the same parent in the WordNet hierarchy. A binary feature is used for each learned group based on the occurrence of at least one of the words in the group. Groups found include months, days, temporal phrases, military rankings and country names.</Paragraph> <Paragraph position="2"> Entity groups For each entity group (i.e. named entities such as person, city, or disease tagged by the named entity extractor) that occurs signi cantly at a boundary, a feature indicating whether or not an entity of that group occurs in the sentence is used.</Paragraph> <Paragraph position="3"> Full name The named entity extraction system tags persons named in the document. A rough co-reference resolution was performed by grouping together references that share at least one token (e.g., General Yury Tikhonovich Kalinin and Kalinin ). The full name of a person is the longest reference of a group referring to the same person.</Paragraph> <Paragraph position="4"> This feature indicates whether or not the sentence contains a full name.</Paragraph> <Paragraph position="5"> Entity chains Word relationships work well when the documents have disjoint topics; however, when topics are similar, words tend to relate too easily. We propose a more stringent chaining method called entity chains. Entity chains are constructed in the same fashion as lexical chains, except we consider named entities. Two entities are considered related (i.e. in the same chain) if they refer to the same entity. We construct entity chains and extract features that characterize these chains: How many chains start/end at this sentence? How many chains cross over this sentence/previous sentence/next sentence? Distance to the nearest dip/peak in the number of chains? Size of that dip/peak? Pronoun Does the sentence contain a pronoun? Does the sentence contain a pronoun within 5 words of the beginning of the sentence? Numbers During training, the patterns of numbers that occur signi cantly at boundaries are selected.</Paragraph> <Paragraph position="6"> Patterns considered are any number and any number with a speci ed length. The feature then checks if that pattern appears in the sentence. A commonly found pattern is the number pattern of length 4, which often refers to a year.</Paragraph> <Paragraph position="7"> Conversation Is this sentence part of a conversation, i.e. does this sentence contain direct speech ? This is determined by tracking beginning and ending quotes. Quoted regions and single sentences between two quoted regions are considered part of a conversation.</Paragraph> <Paragraph position="8"> Paragraph Is this the beginning of a paragraph?</Paragraph> </Section> </Section> class="xml-element"></Paper>