File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3502_evalu.xml
Size: 8,562 bytes
Last Modified: 2025-10-06 13:59:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3502"> <Title>Backbone Extraction and Pruning for Speeding Up a Deep Parser for Dialogue Systems</Title> <Section position="6" start_page="12" end_page="14" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> The purpose of our evaluation is to explore the extent to which we can achieve a better balance between parse time and coverage using backbone parsing with pruning compared to the original best- rst algorithm. For our comparison we used an excerpt from the Monroe corpus that has been used in previous TRIPS research on parsing speed and accuracy (Swift et al., 2004) consisting of dialogues s2, s4, s16 and s17. Dialogue s2 was a hold out set used for pilot testing and setting parameters. The other three dialogues were set aside for testing. Altogether, the test set contained 1042 utterances, ranging from 1 to 45 words in length (mean 5.38 words/utt, st. dev. 5.7 words/utt). Using our hold-out set, we determined that a beam width of three was an optimal setting.</Paragraph> <Paragraph position="1"> Thus, we compared TFLEX using a beam width of 3 to three different versions of TRIPS that varied only in terms of the maximum chart size, giving us a version that is signi cantly faster than TFLEX overall, one that has parse times that are statistically indistinguishable from TFLEX, and one that is signi cantly slower. We show that while lower chart sizes in TRIPS yield speed ups in parse time, they come with a cost in terms of coverage.</Paragraph> <Section position="1" start_page="13" end_page="13" type="sub_section"> <SectionTitle> 5.1 Evaluation Methodology </SectionTitle> <Paragraph position="0"> Because our goal is to explore the parse time versus coverage trade-offs of two different parsing architectures, the two evaluation measures that we report are average parse time per sentence and probability of nding at least one parse, the latter being a measure estimating the effect of parse algorithm on parsing coverage.</Paragraph> <Paragraph position="1"> Since the scoring model is the same in TRIPS and TFLEX, then as long as TFLEX can nd at least one parse (which happened in all but 1 instances on our held-out set), the set returned will include the one produced by TRIPS. We spot-checked the TFLEX utterances in the test set for which TRIPS could not nd a parse to verify that the parses produced were reasonable. The parses produced by TFLEX on these sentences were typically acceptable, with errors mainly stemming from attachment disambiguation problems.</Paragraph> </Section> <Section position="2" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> We rst compared parsers in terms of probability of producing at least one parse (see Figure 2). Since the distribution of sentence lengths in the test corpus was heavily skewed toward shorter sentences, we grouped sentences into equivalence classes based on a range of sentence lengths with a 5-word increment, with all of the sentences over 20 words aggregated in the same class. Given a large number of short sentences, there was no signi cant difference overall in likelihood to nd a parse. However, on sentences greater than 10 words long, TFLEX is signi cantly more likely to produce a parse than any of the TRIPS parsers (evaluated using a binary logistic regression, N = 334, G = 16.8, DF = 1, p < .001). Fur- null TFLEX on utterances 6 words or more.</Paragraph> <Paragraph position="1"> thermore, for sentences greater than 20 words long, no form of TRIPS parser ever returned a complete parse.</Paragraph> <Paragraph position="2"> Next we compared the parsers in terms of average parse time on the whole data set across equivalence classes of sentences, assigned based on Aggregated Sentence Length (see Figure 2 and Table 1). An ANOVA with Parser and Aggregated Sentence Length as independent variables and Parse Time as the dependent variable showed a signi cant effect of Parser on Parse Time (F(3, 4164) = 270.03, p < .001). Using a Bonferroni post-hoc analysis, we determined that TFLEX is signi cantly faster than TRIPS-10000 (p < .001), statistically indistinguishable in terms of parse time from TRIPS-5000, and signi cantly slower than TRIPS-1500 (p < .001).</Paragraph> <Paragraph position="3"> Since none of the TRIPS parsers ever returned a parse for sentences greater than 20 words long, we recomputed this analysis excluding the latter. We still nd a signi cant effect of Parser on Parse Time (F(3, 4068) = 18.6, p < .001). However, a post-hoc analysis reveals that parse times for TFLEX, TRIPS-1500, and TRIPS-5000 are statistically indistinguishable for this subset, whereas TFLEX is signi cantly faster than TRIPS-10000 (p < .001).</Paragraph> <Paragraph position="4"> See Table 1 for for parse times of all four parsers.</Paragraph> <Paragraph position="5"> Since TFLEX and TRIPS both spent 95% of their computational effort on sentences with 6 or more words, we also include results for this subset of the corpus.</Paragraph> <Paragraph position="6"> Thus, TFLEX presents a superior balance of coverage and ef ciency especially for long sentences (10 words or more) since for these sentences it is signi cantly more likely to nd a parse than any version of TRIPS, even a version where the chart size is expanded to an extent that it becomes signi cantly slower (i.e., TRIPS-10000). And while TRIPS-1500 is consistently faster than the other parsers, it is not signi cantly faster than TFLEX on sentences 20 denotes sentences with 5 or fewer words, 25 sentences with more than 20 words. words long or less, which is the subset of sentences for which it is able to nd a parse.</Paragraph> </Section> <Section position="3" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 5.3 Discussion and Future Work </SectionTitle> <Paragraph position="0"> The most obvious lesson learned in this experience is that the speed up techniques developed for speci c grammars and uni cation formalisms do not transfer easily to other uni cation grammars. The features that make TRIPS interesting the inclusion of lexical semantics, and the rules for parsing fragments also make it less amenable to using existing ef ciency techniques.</Paragraph> <Paragraph position="1"> Grammars with an explicit CFG backbone normally restrict the grammar writer from writing grammar loops, a restriction not imposed by general uni cation grammars. As we showed, there can be a substantial number of loops in a CFG due to the need to cover various elliptical constructions, which makes CFG parsing not interleaved with unication less attractive in cases where we want to avoid expensive CFG precompilation. Moreover, as we found with the TRIPS grammar, in the context of robust parsing with lexical semantics the ambiguity in a CFG backbone grows large enough to make CFG parsing followed by uni cation inef cient. We described an alternative technique that uses pruning based on a parse selection model.</Paragraph> <Paragraph position="2"> Another option for speeding up parsing that we have not discussed in detail is using a typed grammar without disjunction and speeding up uni cation as done in HPSG grammars (Kiefer et al., 1999). In order to do this, we must rst address the issue of integrating the type of lexical semantics that we require with HPSG's type system. Adding lexical semantics while retaining the speed bene ts obtained through this type system would require that the semantic type ontology be expressed in the same formalism as the syntactic types. We plan to further explore this option in our future work.</Paragraph> <Paragraph position="3"> Though longer sentences were relatively rare in our test set, using the system in an educational domain (our ultimate goal) means that the longer sentences are particularly important, because they often correspond to signi cant instructional events, speci cally answers to deep questions such as why and how questions. Our evaluation has been designed to show system performance with utterances of different length, which would roughly correspond to the performance in interpreting short and long student answers. Since delays in responding can de-motivate the student and decrease the quality of the dialogue, improving handling of long utterances can have an important effect on the system performance. Evaluating this in practice is a possible direction for future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>