File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2032_metho.xml
Size: 7,695 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2032"> <Title>Story Segmentation of Brodcast News in English, Mandarin and Arabic</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The NIGHTINGALE Corpus </SectionTitle> <Paragraph position="0"> The training data used for NIGHTINGALE includes the TDT-4 and TDT5 corpora (Strassel and Glenn, 2003; Strassel et al., 2004). TDT-4 includes newswire text and broadcast news audio in English, Arabic and Mandarin; TDT-5 contains only text data, and is therefore not used by our system. The TDT-4 audio corpus includes 312.5 hours of English Broadcast News from 450 shows, 88.5 hours of Arabic news from 109 shows, and 134 hours of Mandarin broadcasts from 205 shows.</Paragraph> <Paragraph position="1"> This material was drawn from six English news shows ABC World News Tonight , CNN Headline News , NBC Nightly News , Public Radio International The World , MS-NBC News with Brian Williams , and Voice of America, English three Mandarin newscasts China National Radio, China Television System, and Voice of America, Mandarin Chinese and two Arabic newscasts Nile TV and Voice of America, Modern Standard Arabic. All of these shows aired between Oct. 1, 2000 and Jan. 31, 2001.</Paragraph> </Section> <Section position="5" start_page="0" end_page="126" type="metho"> <SectionTitle> 4 Our Features and Approach </SectionTitle> <Paragraph position="0"> Our story segmentation system procedure is essentially one of binary classi cation, trained on a variety of acoustic and lexical cues to the presence or absence of story boundaries in BN. Our classier was trained using the JRip machine learning al- null gorithm, a Java implementation of the RIPPER algorithm of (Cohen, 1995).1 All of the cues we use are automatically extracted. We use as input to our classi er three types of automatic annotation produced by other components of the NIGHTINGALE system, speech recognition (ASR) transcription, speaker diarization, sentence segmentation. Currently, we assume that story boundaries occur only at these hypothesized sentence boundaries. For our English corpus, this assumption is true for only 47% of story boundaries; the average reference story boundary is 9.88 words from an automatically recognized sentence boundary2. This errorful input immediately limits our overall performance.</Paragraph> <Paragraph position="1"> For each such hypothesized sentence boundary, we extract a set of features based on the previous and following hypothesized sentences. The classier then outputs a prediction of whether or not this sentence boundary coincides with a story boundary.</Paragraph> <Paragraph position="2"> The features we use for story boundary prediction are divided into three types: lexical, acoustic and speaker-dependent.</Paragraph> <Paragraph position="3"> The value of even errorful lexical information in identifying story boundaries has been con rmed for many previous story segmentation systems (Beeferman et al., 1999; Stokes, 2003)). We include some previously-tested types of lexical features in our own system, as well as identifying our own 'cue-word' features from our training corpus. Our lexical features are extracted from ASR transcripts produced by the NIGHTINGALE system. They include lexical similarity scores calculated from the TextTiling algorithm.(Hearst, 1997), which determines the lexical similarity of blocks of text by analyzing the cosine similarity of a sequence of sentences; this algorithm tests the likelihood of a topic boundary between blocks, preferring locations between blocks which have minimal lexical similarity. For English, we stem the input before calculating these features, using an implementation of the Porter stemmer (Porter, 1980); we have not yet attempted to identify root forms for Mandarin or Arabic. We also calculate scores from (Galley et al., 2003)'s LCseg 62% with the average distance between sentence and story boundary of 1.97 and 2.91 words.</Paragraph> <Paragraph position="4"> method, a TextTiling-like approach which weights the cosine-similarity of a text window by an additional measure of its component LEXICAL CHAINS, repetitions of stemmed content words. We also identify 'cue-words' from our training data that we nd to be signi cantly more likely (determined by kh2) to occur at story boundaries within a window preceeding or following a story boundary. We include as features the number of such words observed within 3, 5, 7 and 10 word windows before and after the candidate sentence boundary. For English, we include the number of pronouns contained in the sentence, on the assumption that speakers would use more pronouns at the end of stories than at the beginning. We have not yet obtained reliable part-of-speech tagging for Arabic or Mandarin. Finally, for all three languages, we include features that represent the sentence length in words, and the relative sentence position in the broadcast.</Paragraph> <Paragraph position="5"> Acoustic/prosodic information has been shown to be indicative of topic boundaries in both spontaneous dialogs and more structured speech, such as, broadcast news (cf. (Hirschberg and Nakatani, 1998; Shriberg et al., 2000; Levow, 2004)). The acoustic features we extract include, for the current sentence, the minimum, maximum, mean, and standard deviation of F0 and intensity, and the median and mean absolute slope of F0 calculated over the entire sentence. Additionally, we compute the rst-order difference from the previous sentence of each of these. As a approximation of each sentence's speaking rate, we include the ratio of voiced 10ms frames to the total number of frames in the sentence.</Paragraph> <Paragraph position="6"> These acoustic values were extracted from the audio input using Praat speech analysis software(Boersma, 2001). Also, using the phone alignment information derived from the ASR process, we calculate speaking rate in terms of the number of vowels per second as an additional feature. Under the hypothesis that topic-ending sentences may exhibit some additional phrase- nal lenghthening, we compare the length of the sentence- nal vowel and of the sentence- nal rhyme to average durations for that vowel and rhyme for the speaker, where speaker identify is available from the NIGHTINGALE diarization component; otherwise we use unnormalized values.</Paragraph> <Paragraph position="7"> We also use speaker identi cation information from the diarization component to extract some fea- null tures indicative of a speaker's participation in the broadcast as a whole. We hypothesize that participants in a broadcast may have different roles, such as an anchor providing transitions between stories and reporters beginning new stories (Barzilay et al., 2000) and thus that speaker identity may serve as a story boundary indicator. To capture such information, we include binary features answering the questions: Is the speaker preceeding this boundary the rst speaker in the show? , Is this the rst time the speaker has spoken in this broadcast? , The last time? , and Does a speaker boundary occur at this sentence boundary? . Also, we include the percentage of sentences in the broadcast spoken by the current speaker.</Paragraph> <Paragraph position="8"> We assumed in the development of this system that the source of the broadcast is known, specifically the source language and the show identity (e. g. ABC World News Tonight , CNN Headline News ). Given this information, we constructed different classi ers for each show. This type of source-speci c modeling was shown to improve performance by Tcurrency1ur (2001).</Paragraph> </Section> class="xml-element"></Paper>