XML Viewer - n04-4040

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4040_metho.xml
Size: 13,828 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4040">
  <Title>A Lexically-Driven Algorithm for Disfluency Detection</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 EARS Disfluency Annotation
</SectionTitle>
    <Paragraph position="0"> One of the major goals of the DARPA program for Effective, Affordable, Reusable Speech-to-Text (EARS) (Wayne, 2003) is to provide a rich transcription of speech recognition output, including speaker identification, sentence boundary detection and the annotation of disfluencies in the transcript (This collection of additional features is also known as Metadata). One result of this program has been production of an annotation specification for disfluencies in speech transcripts and the transcription of sizable amounts of speech data, both from conversational telephone speech and broadcast news, according to this specification (Strassel, 2003).</Paragraph>
    <Paragraph position="1"> The task of disfluency detection is to distinguish fluent from disfluent words. The EARS MDE (MetaData Extraction) program addresses two types of disfluencies: (i) edits--words that were not intended to be said and that are normally replaced with the intended words, such as repeats, restarts, and revisions; and (ii) fillers--words with no meaning that are used as discourse markers and pauses, such as &amp;quot;you know&amp;quot; and &amp;quot;um&amp;quot;.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Algorithm
</SectionTitle>
    <Paragraph position="0"> We set out to solve the task of disfluency detection using primarily lexical features in a system we call System A.</Paragraph>
    <Paragraph position="1"> This section describes our algorithm, including the set of features we use to identify disfluencies.</Paragraph>
    <Paragraph position="2"> The training data for the system are time aligned reference speech transcripts, with speaker identification, sentence boundaries, edits, fillers and interruption points annotated. The input for evaluation is a transcript, either a reference transcript or a speech recognizer output transcript. Some of the evaluation data may be marked with sentence boundaries and speaker identification. The task is to identify which words in the transcript are fillers, edits, or fluent. The evaluation data was held out, and not available for tuning system parameters.</Paragraph>
    <Paragraph position="3"> The input to System A is a transcript of either conversational telephone speech (CTS) or broadcast news speech (BNEWS). In all experiments, the system was trained on reference transcripts, but was tested on both reference and speech output transcripts.</Paragraph>
    <Paragraph position="4"> We use a Transformation-Based Learning (TBL) (Brill, 1995) algorithm to induce rules from the training data. TBL is a technique for learning a set of rules that transform an initial hypothesis for the purpose of reducing the error rate of the hypothesis. The set of possible rules is found by expanding rule templates, which are given as an input. The algorithm greedily selects the rule that reduces the error rate the most, applies it to the data, and then searches for the next rule. The algorithm halts when there are no more rules that can reduce the error rate by more than the threshold. The output of the system is an ordered set of rules, which can then be applied to the test data to annotate it for disfluencies.</Paragraph>
    <Paragraph position="5"> We allow one of three tags to be assigned to each word: edit, filler or fluent. Since only 15% of the words in conversational speech are disfluent, we begin with the initial hypothesis that all the words in the corpus are fluent. The system then learns rules to relabel words as edits or fillers in order to reduce the number of errors. The rules are iteratively applied to the data from left to right.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Feature Set
</SectionTitle>
      <Paragraph position="0"> The rules learned by the system are conditioned on several features of each of the words including the lexeme (the word itself), a POS tag for the word, whether the word is followed by a silence and whether the word is a high frequency word. That is, whether the word is more frequent for this speaker than in the rest of the corpus.</Paragraph>
      <Paragraph position="1"> The last feature (high frequency of the word) is useful for identifying when words that are usually fluent--but are sometimes disfluent (such as &amp;quot;like&amp;quot;)--are more likely to be disfluencies, with the intuition being that if a speaker is using the word &amp;quot;like&amp;quot; very frequently, then it is likely that the word is being used as a filler. The word &amp;quot;like&amp;quot; for example was only a disfluency 22% of the time it occurred. So a rule that always tags &amp;quot;like&amp;quot; as a disfluency would hurt rather than help the system.2</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Rule Templates
</SectionTitle>
      <Paragraph position="0"> The system was given a set of 33 rule templates, which were used to generate the set of possible rules. Not all rule templates generated rules that were chosen by the system. Below is a representative subset of rule templates  chosen by the system. Change the label of: 1. word X from L1 to L2.</Paragraph>
      <Paragraph position="1"> 2. word sequence X Y to L1.</Paragraph>
      <Paragraph position="2"> 3. left side of simple repeat to L1.</Paragraph>
      <Paragraph position="3"> 4. word with POS X from L1 to L2 if followed by word with POS Y.</Paragraph>
      <Paragraph position="4"> 5. word from L1 to L2 if followed by words X Y. 6. word X with POS Y from L1 to L2.</Paragraph>
      <Paragraph position="5"> 7. A to L1 in the pattern A POS X B A, where A and B can be any words.</Paragraph>
      <Paragraph position="6"> 8. left side of repeat with POS X in the middle to L1. 9. word with POS X from L1 to L2 if followed by silence and followed by word with POS Y.</Paragraph>
      <Paragraph position="7"> 10. word X that is a high frequency word for the speaker from L1 to L2.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> All of the results in this section are from training and evaluation on data produced by the Linguistic Data Consortium (LDC) for the EARS Metadata community. There were 491,543 tokens in the CTS training set and 189,766 tokens in the BNEWS training set. The CTS evaluation set contained 33,670 tokens and the BNEWS evaluation set contained 14,544 tokens.</Paragraph>
    <Paragraph position="1"> We compare our System A to two other systems that were designed for the same task, System B and System C. System C was only applied to conversational speech, so there are no results for it on broadcast news transcripts. Our system was also given the same speech recognition output as System C for the conversational speech condition, whereas System B used transcripts produced by a different speech recognition system.</Paragraph>
    <Paragraph position="2"> 2We use a POS tagger (Ratnaparkhi, 1996) trained on switchboard data with the additional tags of FP (filled pause) and FRAG (word fragment).</Paragraph>
    <Paragraph position="3"> System B used both prosodic cues and lexical information to detect disfluencies. The prosodic cues were modeled by a decision tree classifier, whereas the lexical information was modeled using a 4-gram language model, separately trained for both CTS and BNEWS.</Paragraph>
    <Paragraph position="4"> System C first inserts IPs into the text using a decision-tree classifier based on both prosodic and lexical features and then uses TBL. In addition to POS, System C's feature set also includes whether the word is commonly used as a filler, edit, back-channel word, or is part of a short repeat. Turn and segment boundary flags were also used by the system. Whereas System A only attempted to learn three labels (filler, edit and fluent), System C attempted to learn many subtypes of disfluencies, which were not distinguished in the evaluation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Lexeme Error Rate
</SectionTitle>
      <Paragraph position="0"> We use Lexeme Error Rate (LER) as a measure of recognition effectiveness. This measure is the same as the traditional word-error rate used in speech recognition, except that filled pauses and fragments are not optionally deletable. The LERs of the speech transcripts used by the three systems were all fairly similar (about 25% for CTS and 12% for BNEWS).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Top Rules Learned
</SectionTitle>
      <Paragraph position="0"> A total of 106 rules were learned by the system for CTS-the top 10 rules learned are:  1. Label all fluent filled pauses as fillers. 2. Label the left side of a simple repeat as an edit. 3. Label &amp;quot;you know&amp;quot; as a filler.</Paragraph>
      <Paragraph position="1"> 4. Label fluent &amp;quot;well&amp;quot;s with a UH part-of-speech as a filler. 5. Label fluent fragments as edits.</Paragraph>
      <Paragraph position="2"> 6. Label &amp;quot;I mean&amp;quot; as a filler.</Paragraph>
      <Paragraph position="3"> 7. Label the left side of a simple repeat separated by a filled pause as an edit.</Paragraph>
      <Paragraph position="4"> 8. Label the left side of a simple repeat separated by a fragment as an edit.</Paragraph>
      <Paragraph position="5"> 9. Label edit filled pauses as fillers.</Paragraph>
      <Paragraph position="6"> 10. Label edit fragments at end of sentence as fluent.  Of the errors that system was able to fix in the CTS training data, the top 5 rules were responsible for correcting 86%, the top ten rules, for 94% and the top twenty, for 96%.</Paragraph>
      <Paragraph position="7"> All systems were evaluated using rteval (Rich Transcription Evaluation) version 2.3 (Kubala and Srivastava, 2003). Rteval aligns the system output to the annotated reference transcripts in such a way as to minimize the lexeme error rate. The error rate is the number of disfluency errors (insertions and deletions) divided by the number of disfluent tokens in the reference transcript. Edit and filler errors are calculated separately. The results of the evaluation are shown in Table 1. Most of the small differences in the CTS results were not found to be significantly different. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Error Analysis
</SectionTitle>
    <Paragraph position="0"> It is clear from the discrepancies between the reference and speech condition that a large portion of the errors (a majority except in the case of edit detection for CTS) are due to errors in the STT (Speech-To-Text). This is most notable for fillers in broadcast news where the error rate for our system increases from 6.5% to 57.2%. Such a trend can be seen for the other systems, indicating that-even with prosodic models--the other systems were not more robust to the lexical errors.</Paragraph>
    <Paragraph position="1"> All three systems produced comparable results on all of the conditions, with the only large exception being edit detection for CTS Reference, where System B had an error rate of 59% compared to our system's error rate of  The speech output condition suffers from several types of errors due to errors in the transcript produced by the speech transcription system. First, the system can output the wrong word causing it to be misannotated. 27% of our edit errors in CTS and 19% of our filler errors occurred when the STT system misrecognized the word. If a filled pause is hallucinated, the disfluency detection system will always annotate it as a filler. Errors also occur (19% of our edit and 12% of our filler error) when the recognizer deletes a word that was an edit or a filler. Finally, errors in the context words surrounding disfluencies can affect disfluency detection as well.</Paragraph>
    <Paragraph position="2"> One possible method to correct for the STT errors would be to train our system on speech output from the recognizer rather than on reference transcripts. Another option would be to use a word recognition confidence score from the recognizer as a feature in the TBL system; these were not used. A more systematic analysis of the errors caused by the recognizer and their effect on  conditions.</Paragraph>
    <Paragraph position="3"> detect edits. Consider the following word sequence: &amp;quot;[ and whenever they come out with a warning ] you know they were coming out with a warning about trains &amp;quot;. The portion within square brackets is the edit to be detected. The difficulty in finding such regions is that the edit itself appears very fluent. One can identify these regions by examining what comes after the edit and finding that is highly similar in content to the edit region. Prosodic features can be useful in identifying the interruption point at which the edit ends, but the degree to which the edit extends backwards from this point still needs to be identified. Long distance dependencies should reveal the edit region, and it is possible that parsing or semantic analysis of the text would be a useful technique to employ. In addition there are other cues such as the filler &amp;quot;you know&amp;quot; after the edit which can be used to locate these edit regions. Long edit regions (of length four or more) are responsible for 48% of the edit errors in the CTS reference condition for our system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML