File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4032_metho.xml
Size: 5,555 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4032"> <Title>Parsing Conversational Speech Using Enhanced Segmentation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. The choice </SectionTitle> <Paragraph position="0"> between the two schemes is made according to the identity of the a15 label on the left-hand-side of a rewrite rule.</Paragraph> <Paragraph position="1"> An N-best EM variant is employed to jointly reestimate the model parameters so that perplexity on training data is decreased, i.e. increasing likelihood. Experimentally, the reduction in perplexity carries over to the test set.</Paragraph> <Paragraph position="2"> The SLM can be used for parsing either as a generative model a2a4a3 a5a8a7a35a9 a11 or as a conditional model a2a4a3 a9 a4a5 a11 . In the latter case, the a2a4a3a47a14a42a25 a4a5 a25a52a79a80a23a64a9a80a25a52a79a80a23a32a11 prediction is omitted in Eq. (1). For further details on the SLM, see (Chelba and Jelinek, 2000).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 SU and IP Detection </SectionTitle> <Paragraph position="0"> The system used here for SU and IP detection is (Kim et al., 2004), modulo differences in training data. It combines decision tree models of prosody with a hidden event language model in a hidden Markov model (HMM) framework for detecting events at each word boundary, similar to (Liu et al., 2003). Differences include the use of lexical pattern matching features (sequential matching words or POS tags) as well as prosody cues in the decision tree, and having a joint representation of SU and IP boundary events rather than separate detectors. On the DARPA RT-03F metadata test set (NIST, 2003), the model has 35.0% slot error rate (SER) for SUs (75.7% recall, 87.7% precision), and 68.8% SER for edit IPs (41.8% recall, 79.8% precision) on reference transcripts, using the rt eval scoring tool.2 While these error rates are relatively high, it is a difficult task and the SU performance is at the state of the art.</Paragraph> <Paragraph position="1"> Since early work on &quot;sentence&quot; segmentation simply looked at pause duration, we designed a decision tree classifier to predict SU events based only on the pause duration after a word boundary. This model served as a baseline condition, referred to here as the &quot;na&quot;ive&quot; predictor since it makes no use of other prosodic or lexical cues that are important for preventing IPs or hesitations from triggering false SU detection. The na&quot;ive predictor has SU SER of 68.8%, roughly twice that of the HMM, with a large loss in recall (43.2% recall, 79.0% precision).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpus </SectionTitle> <Paragraph position="0"> The data used in this work is the treebank (TB3) portion of the Switchboard corpus of conversational telephone speech, which includes sentence boundaries as well as the reparandum and interruption point of disfluencies.</Paragraph> <Paragraph position="1"> The data consists of 816 hand-transcribed conversation sides (566K words), of which we reserve 128 conversation sides (61K words) for evaluation testing according to the 1993 NIST evaluation choices.</Paragraph> <Paragraph position="2"> We use a subset of Switchboard data - hand-annotated for SUs and IPs - for training the SU/IP boundary event detector, and for providing the oracle versions of these events as a control in our experiments. The annotation conventions for this data, referred to as V5 (Strassel, 2003), are slightly different from that used in the TB3 an2Note that the IP performance figures are not comparable to those in the DARPA evaluation, since we restrict the focus to IPs associated with edit disfluencies.</Paragraph> <Paragraph position="3"> notations in a few important ways. Notably for this work, V5 annotates IPs for both conversational fillers (such as filled pauses and discourse markers) and self-edit disfluencies, while TB3 represents only edit-related IPs. This difference is addressed by explicitly distinguishing between these types in the IP detection model. In addition, the V5 conventions define an SU as including only one independent main clause, so the size of the &quot;segments&quot; available for parsing is sometimes smaller than in TB3.</Paragraph> <Paragraph position="4"> Further, the SU boundaries were determined by annotators who actually listened to the speech signal, vs. annotated from text alone as for TB3. One consequence of the differences is a small amount of additional error due to train/test mismatch. More importantly, the &quot;ground truth&quot; for the syntactic structure must be mapped to the SU segmentation, both for training and test.</Paragraph> <Paragraph position="5"> In many cases, the original syntactic constituents span multiple SUs, but we follow a simple rule in generating this new SU-based truth: only those constituents contained entirely within an SU would be retained. In practice, this means eliminating a few high-level constituents. The effect is usually to change the interpretation of some sentence-level conjunctions to be discourse markers, rather than conjoining two main clauses. This change is arguably an improvement, since the SU annotation relies on a human annotation that takes into consideration acoustic information (not only the words).</Paragraph> </Section> class="xml-element"></Paper>