File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4032_evalu.xml
Size: 6,800 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4032"> <Title>Parsing Conversational Speech Using Enhanced Segmentation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In all experiments, the SLM parser was trained on the baseline truth Switchboard corpus described above, with hand-annotated SUs and optionally IPs. For testing, the system was presented with conversation sides segmented according to the various SU-predictions, and evaluated on its performance in predicting the true syntactic structure.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Variables </SectionTitle> <Paragraph position="0"> We seek to explore how much impact current metadata detection algorithms have over the na&quot;ive pause-based segmentation. To this end, we test along two experimental dimensions: SU segmentation and IP detection.</Paragraph> <Paragraph position="1"> Some type of segmentation is critical to most parsers.</Paragraph> <Paragraph position="2"> In the SU dimension, we tested three conditions. Across these conditions, the parser training was held constant, but the test segmentation varied across three cases: (i) oracle, hand-labeled SU segmentation; (ii) automatic, SU segmentation from the automatic detection system using both prosody and lexical cues (Kim et al., 2004); and (iii) na&quot;ive, SU segmentation from a decision tree predictor using only pause duration cues. The SUs are included as words, similar to sentence boundaries in prior SLM work.</Paragraph> <Paragraph position="3"> By varying the SU segmentation of the test data for our system, we gain insight into how the performance of SU detection changes the overall accuracy of the parser.</Paragraph> <Paragraph position="4"> We expect interruption points to be useful to parsing, since edit points often indicate a restart point, and the preceding syntactic phrase should attach to the tree differently. In the IP dimension, we examined two conditions (present and absent). For each condition, we re-trained the parser including hand-labeled IPs, since the vocabulary of available &quot;words&quot; is different when the IP is included as an input token. The two IP conditions are: (a) No IP, training the parser on syntax that did not include IPs as words, and testing on segmented input that also did not include IP tokens; and (b) IP, training and testing on input that includes IPs as words. The incorporation of IPs as words may not be ideal, since it reduces the number of true words available to an N-gram model at a given point, but it has the advantages of simplicity and consistency with SU treatment. Because the na&quot;ive system does not predict IPs, we only have experiments for 5 of the 6 possible combinations.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated parser performance by using bracket precision and recall scores, as well as bracket-crossing, using the parseval metric (Sekine and Collins, 1997; Black et al., 1991). This bracket-counting metric for parsers, requires that the input words (and, by implication, sentences) be held constant across test conditions. Since our experiments deliberately vary the segmentation, we needed to evaluate each conversation side as a single &quot;sentence&quot; in order to obtain meaningful results across different segmentations. We construct this top-level sentence by attaching the parser's proposed constituents for each SU to a new top-level constituent (labeled TIPTOP).</Paragraph> <Paragraph position="1"> Thus, we can compare two different segmentations of the same data, because it ensures that the segmentations will agree at least at the beginning and end of the conversation. Segmentation errors will of course cause some mismatches, but that possibility is what we are investigating.</Paragraph> <Paragraph position="2"> For evaluation, we ignore the TIPTOP bracket (which always contains the entire conversation side), so this technique does not interfere with accurate bracket counting, but allows segmentation errors to be evaluated at the level of bracket-counting. The SLM parser uses binary trees, but the syntactic structures we are given as truth often branch in N-ary ways, where a85a1a0 a13 . The parse trees used for training the SLM use bar-level nodes to transform N-ary trees into binary ones; the reverse mapping of SLM-produced binary trees back to N-ary trees is done by simply removing the bar-level constituents. Finally, to compare the IP-present conditions with the non-IP conditions, we ignore IP tokens when counting brackets.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> Table 1 shows the parser performance: average bracket-crossing (lower is better), precision and recall (higher is better). The number of bracket-crossings per &quot;sentence&quot; is quite high, due to evaluating all text from a given conversation side into one &quot;TIPTOP sentence&quot;. Precision and recall are regarding the bracketing of all the tokens under consideration (i.e., not including bar-level brackets, and not including IP token labeling). All differences are highly significant ( according to a sign test at conversation level) except for comparing oracle results with and without IPs.</Paragraph> <Paragraph position="1"> We find that the HMM-based SU detection system achieves a 7% improvement in precision and recall over the na&quot;ive pause-based system, and an 18% reduction in average bracket crossing. Further, the use of IPs as input tokens improves parser performance, especially when the segmentation is imperfect. While segmentation has an impact on parsing, it is not the limiting factor: the best possible bracketing respecting the automatic segmentation has a 96.50% recall and 99.35% precision. Adding punctuation to the oracle case (Oracle+P) improves performance, as seen more clearly with the F-measure because of changes in the precision-recall balance. The F-measures goes from 72.6 for oracle/no-IP to 74.3 for oracle+P/no-IP to 74.7 for oracle+P/IP. The fact that punctuation is useful on top of the oracle segmentation suggests that a richer representation of structural metadata would be beneficial. The reasons why IPs do not have much of an impact in the oracle case are not clear - it could be a modeling issue or it could be simply that the IPs add robustness to the automatic segmentation.</Paragraph> </Section> </Section> class="xml-element"></Paper>