File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1632_metho.xml
Size: 16,106 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1632"> <Title>Using Linguistically Motivated Features for Paragraph Boundary Identification</Title> <Section position="5" start_page="267" end_page="268" type="metho"> <SectionTitle> 3 Data </SectionTitle> <Paragraph position="0"> The data we used is a collection of biographies from the German version of Wikipedia. We selected all biographies under the Wikipedia categories of physicists, chemists, mathematicians and biologists and obtained 970 texts with an average length of 20 sentences and 413,776 tokens in total.</Paragraph> <Paragraph position="1"> Although our corpus is substantially smaller than the German corpora of Sporleder & Lapata (2006), it should be big enough for a fair comparison between their algorithm and the algorithm proposed here. Having investigated the effect of the training size, Sporleder & Lapata (2006) came to the conclusion that their system performs well being trained on a small data set. In particular, the learning curve for German shows an improvement of only about 2% when the amount of training data is increased from 20%, which in case of German fiction approximately equals 370,000 tokens, to 100%.</Paragraph> <Paragraph position="2"> Fully automatic preprocessing in our system comprises the following stages: First, a list of people of a certain Wikipedia category is taken and for every person an article is extracted The text is purged from Wiki tags and comments, the information on subtitles and paragraph structure is preserved. Second, sentence boundaries are identified with a Perl CPAN module2 whose performance we improved by extending the list of abbreviations and modifying the output format. Next, the sentences are split into tokens. The TnT tagger (Brants, 2000) and the TreeTagger (Schmid, 1997) are used for tagging and lemmatizing. Finally, the texts are parsed with the CDG dependency parser (Foth & Menzel, 2006). Thus, the text is split on three levels: paragraphs, sentences and tokens, and morphological and syntactic information is provided.</Paragraph> <Paragraph position="3"> A publicly available list of about 300 discourse connectives was downloaded from the Internet site of the Institute for the German Language3 (Institut f&quot;ur Deutsche Sprache, Mannheim) and slightly extended. These are identified in the text and annotated automatically as well. Named entities are classified according to their type using information from Wikipedia: person, location, organization or undefined. Given the peculiarity of our corpus, we are able to identify all mentions of the biographee in the text by simple string matching. We also annotate different types of referring expressions (first, last, full name) and resolve anaphora by linking personal pronouns to the biographee provided that they match in number and gender.</Paragraph> <Paragraph position="4"> The annotated corpus is split into training (85%), development (10%) and testing (5%) sets.</Paragraph> <Paragraph position="5"> Distribution of data among the three sets is presented in Table 1. Sentences which serve as subtitles in a text are filtered out because they make identifying a paragraph boundary for the following sentence trivial.</Paragraph> </Section> <Section position="6" start_page="268" end_page="271" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="268" end_page="268" type="sub_section"> <SectionTitle> 4.1 Machine Learners </SectionTitle> <Paragraph position="0"> The PBI task was reformulated as a binary classification problem: every training instance represent- null ing a sentence was classified either as paragraph-initial or not.</Paragraph> <Paragraph position="1"> We used two machine learners: BoosTexter (Schapire & Singer, 2000) and TiMBL (Daelemans et al., 2004). BoosTexter was developed for text categorization, and combines simple rules (decision stumps) in a boosting manner. Sporleder & Lapata used this learner because it has the ability to combine many only moderately accurate hypotheses. TiMBL is a memory-based learner which classifies every test instance by finding the most similar examples in the training set, hence it does not abstract from the data and is well suited to handle features with many values, e.g. the list of discourse cues. For both classifiers, all experiments were run with the default settings.</Paragraph> </Section> <Section position="2" start_page="268" end_page="269" type="sub_section"> <SectionTitle> 4.2 Baselines </SectionTitle> <Paragraph position="0"> We compared the performance of our algorithm against three baselines. The first one (distance) trivially inserts a paragraph break after each third sentence, which is the average number of sentences in a paragraph. The second baseline (Galley) hypothesizes that paragraph breaks coincide with topic boundaries and utilizes Galley et al.'s (2003) topic boundary identification tool LCseg.</Paragraph> <Paragraph position="1"> The third baseline (Sporleder) is a reimplementation of Sporleder & Lapata's 2006 algorithm with the following features: Word and Sentence Distances from the current sentence to the previous paragraph break; Sentence Length and Relative Position (relPos) of the sentence in a text; Quotes encodes whether this and the previous sentences contain a quotation, and whether the quotation is continued in the current sentence or not; Final Punctuation of the previous sentence; Words - the first (word1), the first two (word2), the first three and all words from the sentence; null Parsed has positive value in case the sentence is parsed, negative otherwise; Number of S, VP, NP and PP nodes in the sentence; null Signature is the sequence of PoS tags with and without punctuation; Children of Top-Level Nodes are two features representing the sequence of syntactic labels of the children of the root of the parse tree and the children of the highest S-node; Branching Factor features express the average number of children of S, VP, NP and PP nodes in the parse; Tree Depth is the average length of the path from the root to the leaves; Per-word Entropy is a feature based on Genzel & Charniak's (2003) observation that paragraph-initial sentences have lower entropy than non-initial ones; Sentence Probability according to a language model computed from the training data; Character-level n-gram models are built using the CMU toolkit (Clarkson & Rosenfeld, 1997).</Paragraph> <Paragraph position="2"> Since the parser we used produces dependency trees as an output, we could not distinguish between such features as children of the root of the tree and children of the top-level S-node. Apart from this minor change, we reimplemented the algorithm in every detail.</Paragraph> </Section> <Section position="3" start_page="269" end_page="270" type="sub_section"> <SectionTitle> 4.3 Our Features </SectionTitle> <Paragraph position="0"> For our algorithm we first selected the features of Sporleder & Lapata's (2006) system which performed best on the development set. These are relative position, the first and the first two words (relPos, word1, word2). Quote and final punctuation features, which were particularly helpful in Sporleder & Lapata's experiments on the German fiction data, turned out to be superfluous given the infrequency of quotations and the prevalent use of the period as sentence delimiter in our data.</Paragraph> <Paragraph position="1"> We experimented with text cohesion features assuming that the paragraph structure crucially depends on cohesion and that paragraph breaks are likely to occur between sentences where cohesive links are weak. In order to estimate the degree of cohesion, we looked at lexical cohesion, pronominalization, discourse cues and information structure. null nounOver, verbOver: Similar to Sporleder & Lapata (2006), we introduced an overlap feature, but measured the degree of overlap as a number of common noun and verb lemmas between two adjacent sentences. We preferred lemmas over words in order to match all possible forms of the same word in German. null LCseg: Apart from the overlap, a boolean feature based on LCseg (Galley et al., 2003) marked whether the tool suggests that a new topic begins with the current sentence. This feature, relying on lexical chains, was supposed to provide more fine-grained information on the degree of similarity between two sentences.</Paragraph> <Paragraph position="2"> As Stark (1988) points out, humans tend to interpret over-reference as a clue for the beginning of a new paragraph: In a sentence, if a non-pronominal reference is preferred over a pronominal one where the pronoun would be admissible, humans are likely to mark this sentence as a paragraph-initial one. In order to check whether over-reference indeed correlates with paragraph-initial sentences, we described the way the biographee is referred to in the current and the previous sentences.</Paragraph> <Paragraph position="3"> prevSPerson, currSPerson: This feature4 with the values NA, biographee, other indicates whether there is a reference to the biographee or some other person in the sentence.</Paragraph> <Paragraph position="4"> prevSRE, currSRE: This feature describes the biographee's referring expression and has three possible values: NA, name, pronoun.</Paragraph> <Paragraph position="5"> Although our annotation distinguishes between first, last and full names, we found out that, for the PBI task, the distinction is spurious and unifying these three under the same category improves the results.</Paragraph> <Paragraph position="6"> REchange: Since our classifiers assume feature independence and can not infer the information on the change in referring expression, we explicitly encoded that information by merging the values of the previous feature for the current and the preceding sentences into one, which has nine possible values (name-name, NA-name, pronoun-name, etc.).</Paragraph> <Paragraph position="7"> The intuition behind these features is that cue words and phrases are used to signal the relation between the current sentence and the preceding sentence or context (Mann & Thompson, 1988).</Paragraph> <Paragraph position="8"> Such connectives as endlich (finally), abgesehen davon (apart from that), danach (afterwards) explicitly mark a certain relation between the sentence they occur in and the preceding context. We hypothesize that the relations which hold across paragraph boundaries should differ from those which hold within paragraphs and that the same is true for the discourse cues. Absence of a connective is supposed to be informative as well, being more typical for paragraph-initial sentences.</Paragraph> <Paragraph position="9"> Three features describe the connective of the current sentence. Another three features describe the one from the preceding sentence.</Paragraph> <Paragraph position="10"> prevSCue, currSCue: This feature is the connective itself (NA in case of none).</Paragraph> <Paragraph position="11"> prevSCueClass, currSCueClass: This feature represents the semantic class of the cue word or phrase as assigned by the IDS Mannheim.</Paragraph> <Paragraph position="12"> There are 25 values, including NA in case of no connective, altogether, with the most frequent values being temporal, concessive, conclusive, etc.</Paragraph> <Paragraph position="13"> prevSProCue, currSProCue: The third binary feature marks whether the connective is proadverbial or not (NA if there is no connective). Being anaphors, proadverbials, such as deswegen (because of that), dar&quot;uber (about that) explicitly link a sentence to the preceding one(s).</Paragraph> <Paragraph position="14"> Information structure, which is in German to a large extent expressed by word order, provides additional clues to the degree of connectedness between two sentences. In respect to the PBI task, Stark (1988) reports that paragraph-initial sentences are often theme-marking which means that the subject of such sentences is not the first element. Given the lower frequency of paragraph-initial sentences, this feature can not be considered reliable, but in combination with others it provides an additional clue. In German, the first element best corresponds to the prefield (Vorfeld) - normally, the single constituent placed before the finite verb in the main clause.</Paragraph> <Paragraph position="15"> currSVF encodes whether the constituent in the prefield is a NP, PP, ADV, CARD, or Sub.Clause. Values different from NP unambiguously represent theme-marking sentences, whereas the NP value may stand for both: theme-marking as well as not theme-marking sentence.</Paragraph> </Section> <Section position="4" start_page="270" end_page="270" type="sub_section"> <SectionTitle> 4.4 Discussion </SectionTitle> <Paragraph position="0"> Note, that we did not exclude text-initial sentences from the study because the encoding we used does not make such cases trivial for classification. Although some of the features refer to the previous sentence, none of them has to be necessarily realized and therefore none of them explicitly indicates the absence of the preceding sentence. For example, the label NA appears in cases where there is no discourse cue in the preceding sentence as well as in cases where there is no preceding sentence. The same holds for all other features prefixed with prevS-.</Paragraph> <Paragraph position="1"> Another point concerns the use of pronominalization-based features. Sporleder & Lapata (2006) waive using such features because they consider pronominalization dependent on the paragraph structure and not the other way round. At the same time they mention speech and optical character recognition tasks as possible application domains for the PBI.</Paragraph> <Paragraph position="2"> There, pronouns are already given and need not be regenerated, hence for such applications features which utilize pronouns are absolutely appropriate. Unlike the recognition tasks, for multi-document summarization both decisions have to be made, and the order of the two tasks is not self-evident. The best decision would probably be to decide simultaneously on both using optimization methods (Roth & Yih, 2004; Marciniak & Strube, 2005). Generating pronouns before inserting boundaries seems as reasonable as doing it the other way round.</Paragraph> </Section> <Section position="5" start_page="270" end_page="271" type="sub_section"> <SectionTitle> 4.5 Feature Selection </SectionTitle> <Paragraph position="0"> We determine the relevant feature set and evaluate which features from this set contribute most to the performance of the system by the following procedures. null First, we follow an iterative algorithm similar to the wrapper approach for feature selection (Kohavi & John, 1997) using the development data and TiMBL. The feature subset selection algorithm performs a hill-climbing search along the feature space. We start with a model based on all available features. Then we train models obtained by removing one feature at a time. We choose the worst performing feature, namely the one whose removal gives the largest improvement based on the F-measure, and remove it from the model. We then train classifiers removing each of the remaining features separately from the enhanced model. The process is iteratively run as long as significant improvement is observed.</Paragraph> <Paragraph position="1"> To measure the contribution of the relevant features we start with the three best features from Sporleder & Lapata (2006) (see Section 4.3) and train TiMBL combining the current feature set with each feature in turn. We then choose the best performing feature based on the F-measure and add it to the model. We iterate the process until all features are added to the three-feature system. Thus, we optimize the default setting and obtain the information on what the paragraph structure crucially depends.</Paragraph> </Section> </Section> class="xml-element"></Paper>