File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1619_intro.xml
Size: 4,860 bytes
Last Modified: 2025-10-06 14:03:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1619"> <Title>The Types and Distributions of Errors in a Wide Coverage Surface Realizer Evaluation</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"> Undertaking a large-scale evaluation for a symbolic surface realizer requires a large corpus of sentence plans. Since text planners cannot generate either the requisite syntactic variation or quantity of text, [Langkilde-Geary, 2002] developed an evaluation strategy for HALOGEN employing a substitute: sentence parses from the Penn TreeBank [Marcus et al., 1993], a corpus that includes texts from newspapers such as the Wall Street Journal, and which have been hand-annotated for syntax by linguists.</Paragraph> <Paragraph position="1"> However, surface realizers typically have idiosyncratic input representations, and none use the Penn TreeBank parse representation. Thus a transformer is needed to convert the TreeBank notation into the language accepted by the surface realizer. As we were interested in comparing the coverage and accuracy of FUF/SURGE with Langkilde's HALOGEN system, we implemented a similar transformer [Callaway, 2003] to convert Penn TreeBank notation into the representation used by FUF/SURGE .</Paragraph> <Paragraph position="2"> As with the HALOGEN evaluation, we used Simple String Accuracy [Doddington, 2002] and BLEU [Papineni et al., 2001] to determine the average accuracy for FUF/SURGE .</Paragraph> <Paragraph position="3"> To obtain a meaningful comparison, we utilized the same approach as HALOGEN, treating Section 23 of the TreeBank as an unseen test set. A recent evaluation showed that the combination of the transformer and an augmented version of FUF/SURGE had higher coverage and accuracy (Table 1) compared to both HALOGEN and version 2.2 of FUF/SURGE .</Paragraph> <Paragraph position="4"> The difference between the two versions of FUF/SURGE was especially striking, with the augmented version almost doubling coverage and more than doubling exact match accuracy. One example of these differences is the addition of grammatical rules for direct and indirect written dialogue, which comprise approximately 15% of Penn TreeBank sentences, and which is vital for domains such as written fiction. However, this evaluation method does not allow for seamless comparisons. Inserting a transformation component between the corpus and realizer means that not only is the surface realizer being evaluated, but also the accompanying transformation component itself. We were thus interested in determining what proportion of reported errors could be attributed to the surface realizer as opposed to the transformer or to the corpus. To do this, as well as to assist application designers in better interpreting the results of these formal evaluations, we needed to identify more precisely what types of failures are involved.</Paragraph> <Paragraph position="5"> We thus undertook a manual analysis of errors in Sections 20-22 by hand, individually examining 629 erroneous sentences to determine the reason for their failure. Although more than 629 out of the 5,383 sentences in these development sections produced accuracy errors, we eliminated approximately 600 others from consideration: AF Simple string accuracy less than 10 characters: Sentences with very small error rates are almost always incorrect due to errors in morphology, punctuation, and capitalization. For instance, a single incorrect placement of quotation marks has a penalty of 4. Thus it made little sense to manually examine them all in the off chance a handful had true syntactic errors.</Paragraph> <Paragraph position="6"> AF Sentences of less than 10 or more than 35 words: Sentences with 9 or fewer words were extremely unlikely to contain complex syntactic constructs, and collectively had an accuracy of over 99.1%. Sentences larger than 35 words with errors typically had more than one major grammatical error, making it very difficult to determine a single &quot;exact&quot; cause. 96% of sentences within the range had a single grammatical cause, and the remaining 4% had only 2 syntactic errors.</Paragraph> <Paragraph position="7"> Each error instance in the resulting 629 sentences was classified twice: first to find the source of the error (corpus, transformer, or grammar), and then in the case of grammatical errors, to note the syntactic rule that caused the error. These two classifications are discussed in the following two sections.</Paragraph> <Paragraph position="8"> We were not realistically able to perform a similar comparison for coverage errors, because out of the 4,240 sentences satisfying the second criterion of sentence length, only 17 of them did not generate some string.</Paragraph> </Section> class="xml-element"></Paper>