File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2019_metho.xml

Size: 9,422 bytes

Last Modified: 2025-10-06 14:10:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2019">
  <Title>Early Deletion of Fillers In Processing Conversational Speech</Title>
  <Section position="3" start_page="73" end_page="73" type="metho">
    <SectionTitle>
2 Disfluency in Brief
</SectionTitle>
    <Paragraph position="0"> In this section we give a brief introduction to disfluency, providing an excerpt from Switchboard (Graff and Bird, 2000) that demonstrates typical production of repairs and fillers in conversational speech.</Paragraph>
    <Paragraph position="1"> We follow previous work (Shriberg, 1994) in describing a repair in terms of three parts: the reparandum (the material repaired), the corrected alteration, and between these an optional interregnum (or editing term) consisting of one or more fillers. Our notion of fillers encompasses filled pauses (e.g. uh, um, ah) as well as other vocalized space-fillers annotated by LDC (Taylor, 1995), such as you know, i mean, like, so, well, etc. Annotations shown here are typeset with the following conventions: fillers are bold, [reparanda] are squarebracketed, and alterations are underlined.</Paragraph>
    <Paragraph position="2"> S1: Uh first um i need to know uh how do you feel [about] uh about sending uh an elderly uh family member to a nursing home S2: Well of course [it's] you know it's one of the last few things in the world you'd ever want to do you know unless it's just you know really you know uh [for their] uh you know for their own good Though disfluencies rarely complicate understanding for an engaged listener, deleting them from transcripts improves readability with no reduction in reading comprehension (Jones et al., 2003). For automated analysis of speech data, this means we may freely explore processing alternatives which delete disfluencies without compromising meaning.</Paragraph>
  </Section>
  <Section position="4" start_page="73" end_page="75" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> This section reports parsing experiments studying the effect of early deletion under in-domain and out-of-domain parser training conditions using the August 2005 release of the Charniak parser (2000). We describe data and evaluation metrics used, then proceed to describe the experiments.</Paragraph>
    <Section position="1" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> Conversational speech data was drawn from the Switchboard corpus (Graff and Bird, 2000), which annotates disfluency (Taylor, 1995) as well as syntax. Our division of the corpus follows that used in (Charniak and Johnson, 2001). Speech recognizer (ASR) output is approximated by removing punctuation, partial words, and capitalization, but we do use reference words, representing an upperbound condition of perfect ASR. Likewise, annotated sentence boundaries are taken to represent oracle boundary detection. Because fillers are annotated only in disfluency markup, we perform an automatic tree transform to merge these two levels of annotation: each span of contiguous filler words were pruned from their corresponding tree and then reinserted at the same position under a flat FILLER constituent, attached as highly as possible. Transforms were achieved using TSurgeon2 and Lingua::Treebank3.</Paragraph>
      <Paragraph position="1"> For our out-of-domain training condition, the parser was trained on sections 2-21 of the Wall Street Journal (WSJ) corpus (Marcus et al., 1993). Punctuation and capitalization were removed to bleach our our textual training data to more closely resemble speech (Rosenfeld et al., 1995). We also tried automatically changing numbers, symbols, and abbreviations in the training text to match how they would be read (Roark, 2002), but this did not improve accuracy and so is not discussed further.</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
3.2 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> As discussed earlier (SS1), Charniak and Johnson (2001) have argued that speech repairs do not  contribute to meaning and so there is little value in syntactically analyzing repairs or evaluating our ability to do so. Consequently, they relaxed standard PARSEVAL (Black et al., 1991) to treat EDITED constituents like punctuation: adjacent EDITED constituents are merged, and the internal structure and attachment of EDITED constituents is not evaluated.</Paragraph>
      <Paragraph position="1"> We propose generalizing this approach to disfluency at large, i.e. fillers as well as repairs. Note that the details of appropriate evaluation metrics for parsed speech data is orthogonal to the parsing methods proposed here: however parsing is performed, we should avoid wasting metric attention evaluating syntax of words that do not contribute toward meaning and instead evaluate only how well such words can be identified.</Paragraph>
      <Paragraph position="2"> Relaxed metric treatment of disfluency was achieved via simple parameterization of the SParseval tool (Harper et al., 2005). SParseval also has the added benefit of calculating a dependency-based evaluation alongside PARSEVAL's bracket-based measure. The dependency metric performs syntactic head-matching for each word using a set of given head percolation rules (derived from Charniak's parser (2000)), and its relaxed formulation ignores terminals spanned by FILLER and EDITED constituents. We found this metric offered additional insights in analyzing some of our results.</Paragraph>
    </Section>
    <Section position="3" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> In the first set of experiments, we train the parser on Switchboard and contrast early deletion of disfluencies (identified by an oracle) versus parsing in the more usual fashion. Our method for early deletion generalizes the approach used with repairs in (Charniak and Johnson, 2001): contiguous filler and edit words are deleted from the input strings, the strings are parsed, and the removed words are reinserted into the output trees under the appropriate flat constituent, FILLER or EDITED.</Paragraph>
      <Paragraph position="1"> Results in Table 1 give F-scores for PARSEVAL and dependency-based parse accuracy (SS3.2), as well as per-word edit and filler detection accuracy (i.e.</Paragraph>
      <Paragraph position="2"> how well the parser does in identifying which terminals should be spanned by EDITED and FILLER constituents when early deletion is not performed). We see that the parser correctly identifies filler words with 93.1% f-score, and that early deletion of fillers  provement in parsing accuracy (87.8% to 88.9% bracket-based, 87.9% to 88.5% dependency-based).</Paragraph>
      <Paragraph position="3"> We conclude from this that for in-domain training, early deletion of fillers has limited potential to improve parsing accuracy relative to what has been seen with repairs. It is still worth noting, however, that the parser does perform better when fillers are absent, consistent with Engel et al.'s findings (2002). While fillers have been reported to often occur at major clause boundaries (Shriberg, 1994), suggesting their presence may benefit parsing, we do not find this to be the case. Results shown for repair detection accuracy and its impact on parsing are consistent with previous work (Charniak and Johnson, 2001; Kahn et al., 2005; Harper et al., 2005).</Paragraph>
      <Paragraph position="4"> Our second set of experiments reports the effect of deleting fillers early when the parser is trained on text only (WSJ, SS3.1). Our motivation here is to see if disfluency modeling, particularly filler detection, can help bleach speech data to more closely resemble text, thereby improving our ability to process it using text-based methods and training data (Rosenfeld et al., 1995). Again we contrast standard parsing with deleting disfluencies early (via oracle knowledge). Given our particular interest in fillers, we also report the effect of detecting them via a state-of-the-art system (Johnson et al., 2004).</Paragraph>
      <Paragraph position="5"> Results appear in Table 2. It is worth noting that since our text-trained parser never produces FILLER or EDITED constituents, the bracket-based metric penalizes it for each such constituent appearing in the gold trees. Similarly, since the dependency metric ignores terminals occurring under these constituents in the gold trees, the metric penalizes the parser for producing dependencies for these termi- null nals. Taken together, the two metrics provide a complementary perspective in interpreting results.</Paragraph>
      <Paragraph position="6"> The trend observed across metrics and edit detection conditions shows that early deletion of systemdetected fillers improves parsing accuracy 5-10%. As seen with in-domain training, early deletion of repairs is again seen to have a significant effect.</Paragraph>
      <Paragraph position="7"> Given that state-of-the-art edit detection performs at about 80% f-measure (Johnson and Charniak, 2004), much of the benefit derived here from oracle repair detection should be realizable in practice. The broader conclusion we draw from these results is that disfluency modeling has significant potential to improve text-based processing of speech data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML