File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/89/p89-1031_concl.xml
Size: 3,086 bytes
Last Modified: 2025-10-06 13:56:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P89-1031"> <Title>EVALUATING DISCOURSE PROCESSING ALGORITHMS</Title> <Section position="6" start_page="257" end_page="258" type="concl"> <SectionTitle> 4 Conclusion </SectionTitle> <Paragraph position="0"> We can benefit in two ways from performing such evaluations: (a) we get general results on a methodology for doing evaluation, (b) we discover ways we can improve current theories. A split of evaluation efforts into quantitative versus qualitative is incoherent. We cannot trust the results of a quantitative evaluation without doing a considerable amount of qualitative analyses and we should perform our qualitative analyses on those components that make a significant contribution to the quantitative results; we need to be able to measure the effect of various factors. These measurements must be made by doing comparisons at the data level.</Paragraph> <Paragraph position="1"> In terms of general results, we have identified some factors that make evaluations of this type more complicated and which might lead us to evaluate solely quantitative results with care. These are: (a) To decide how to evaluate UNDERSPECIFICATIONS and the contribution of ASSUMPTIONS, and (b) To determine the effects of FALSE POSITIVES and ERKOR CHAINING.</Paragraph> <Paragraph position="2"> We advocate an approach in which the contribution of each underspeeification and assumption is tabulated as well as the effect of error chains. If a principled way could be found to identify false positives, their effect should be reported as well as part of any quantitative evaluation.</Paragraph> <Paragraph position="3"> In addition, we have takeri a few steps towards determining the relative importance of different factors to the successful operation of discourse modules. The percent of successes that both algorithms get indicates that syntax has a strong influence, and that at the very least we can reduce the amount of inference required. In 590PS to 82% of the cases both algorithms get the correct result. This probably means that in a large number of cases there was no potential conflict of co-specifiers. In addition, this analysis has shown, that at least for task-oriented dialogues global focus is a significant factor, and in general discourse structure is more important in the task dialogues. However simple devices such as cue words may go a long way toward determining this structure.</Paragraph> <Paragraph position="4"> Finally, we should note that doing evaluations such as this allows us to determine the GENERALITY of our approaches. Since the performance of both Hobbs and BFP varies according to the type of the text, and in fact was significantly worse on the task dialogues than on the texts, we might question how their performance would vary on other inputs. An annotated corpus comprising some of the various NL input types such as those I discussed in the introduction would go a long way towards giving us a basis against whichwe could evaluate the generality of our theories.</Paragraph> </Section> class="xml-element"></Paper>