File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1619_metho.xml

Size: 17,057 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1619">
  <Title>The Types and Distributions of Errors in a Wide Coverage Surface Realizer Evaluation</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Types of Methodological Errors
</SectionTitle>
    <Paragraph position="0"> Errors in the surface realizer evaluation, which can manifest themselves either as empty sentences or as generated sentences which do not exactly match the target string, can arise from the corpus itself, the transformation component, or the surface realizer, which consists of the grammar, linearization rules, and the morphology component.</Paragraph>
    <Paragraph position="1"> The corpus itself can be a source of errors due to two main reasons: (1) the corpus annotators have incorrectly analyzed the syntactic structure, for instance, attaching prepositions to the wrong head, or including grammatically impossible rules, such as NP AX VB CC VB, or (2) the parts of speech were mistagged by the automatic POS tagger and were not corrected during the supervision process, as in (NP (NN petition) (VBZ drives)).</Paragraph>
    <Paragraph position="2"> Unfortunately, the corpus cannot easily be cleaned up to remove these errors, as this significantly complicates the comparison of results across corpus versions. We must thus subtract the proportion of corpus errors from the results, creating a &amp;quot;topline&amp;quot; which defines a maximum performance measure for the realizer. Manually analyzing incorrect sentences produced from the corpus allows this topline to be determined with reasonable accuracy.</Paragraph>
    <Paragraph position="3"> In addition to errors in the corpus, other types of errors originate in the transformation component, as it attempts to match TreeBank annotations with rules that produce the requisite input notation for the surface realization. While such transformers are highly idiosyncratic due to differing input and output notations, the following categories are abstract, and thus likely to apply to many different transformers.</Paragraph>
    <Paragraph position="4"> AF Missing Tag: While there is a standardized set of tags, the semantic subtags and coreference identifiers can combine to create unpredictable tags, such as PP-LOC-PRD-TPC-3 or PP-EXT=2.</Paragraph>
    <Paragraph position="5"> AF Missing Rule: Often each of the individual tags are recognized, but no rule exists to be selected for a given ordered combination of tags, like ADVP-MNR AX RB</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
COMMA RB RB .
AF Incorrect Rule: The transformation component may se-
</SectionTitle>
    <Paragraph position="0"> lect the wrong rule to apply, or the rule itself may be written incorrectly or may not have been written with all possible combinations of tag sequences in mind.</Paragraph>
    <Paragraph position="1"> AF Ordering: Some phrasal elements such as adverbial clauses can be placed in five or even six different positions in the matrix clause. Choosing the wrong position will result in errors reported by the automatic accuracy metrics, as discussed in [Callaway, 2003]. An important note is that the order can be incorrect but still make sense semantically.</Paragraph>
    <Paragraph position="2"> Finally, even given a correct input representation, a surface realizer can also produce errors during the realization process. Of the four main surface realizer functions below, only syntactic rules provide a significant source of accuracy errors from the point of view of averaged metrics: AF Syntactic Rules: The grammar may be missing a particular syntactic rule or set of features, or may have been encoded incorrectly. For instance, the stock version of FUF/SURGE did not have a rule allowing noun phrases to terminate in an adverb like &amp;quot;ago&amp;quot; as in &amp;quot;five years ago&amp;quot;, which occurs frequently in the Penn TreeBank, causing the word to be missing from the generated sentence.</Paragraph>
    <Paragraph position="3"> AF Morphology: While morphological errors occasionally appear, they are usually very small and do not contribute much to the overall accuracy score. The most common problems are irregular verbs, foreign plural nouns, and the plurals of acronyms, as well as the marking of acronyms and y/u initial letters with indefinite a/an.</Paragraph>
    <Paragraph position="4"> AF Punctuation: While most errors involving punctuation marks also contribute very little statistically to the over-all score of a sentence (e.g., a missing comma), the Tree-Bank also contains combinations of punctuation like long dashes followed by quotation marks. Additionally, incorrect generation of mixed quotations can lead to repeated penalties when incorrectly determining the boundaries of the quoted speech, and large penalties if the multiple forms of punctuation occur at the same boundary [Callaway, 2003].</Paragraph>
    <Paragraph position="5"> AF Linear Precedence: In our analysis of realizer errors, no examples of obligatory precedence violations were found (as opposed to &amp;quot;Ordering&amp;quot; problems described above.)</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Types of Syntactic Errors
</SectionTitle>
    <Paragraph position="0"> While general types of errors in the evaluation process are helpful for improving future evaluations, a more pressing question for those wishing to use an off-the-shelf surface realizer is how well it will work in their own application domain. The coverage and accuracy metrics used by Langkilde are very broad measures which say nothing about the effectiveness of a surface realizer when generating individual syntactic constructions. The advantage of these metrics are that they are easy to compute over the entire corpus, but lose this capability when the same question is asked about particular subsets of a general corpus, such as all sentences containing an indirect question.</Paragraph>
    <Paragraph position="1"> When performing an analysis of the types of syntactic errors produced by FUF/SURGE when given correct inputs, we found nine syntactic constructions that resulted in at least two or more sentences being generated incorrectly. The analysis allows us to conclude that FUF/SURGE is either not reliably capable or else incapable of correctly producing the following syntactic constructions (a manual analysis of all 5,383 sentences to find all correct instances of these constructions is impractical, although we were able to automate some corpus searches based on particular semantic tags): AF Inversion: Pragmatic inversions of auxiliaries in embedded questions [Green, 2001] or in any other construction besides questions, negations, and quoted speech. Thus a TreeBank sentence like &amp;quot;This is the best time to buy, as was the case two years ago.&amp;quot; cannot be generated.</Paragraph>
    <Paragraph position="2"> AF Missing verb tense: While FUF/SURGE has 36 pre-defined verb tenses, the corpus contained several instances of another tense: &amp;quot;. . . which fellow officers remember as having been $300.&amp;quot; AF Mixed conjunctions: Often in the Penn TreeBank the UCP tag (unlike coordinated phrase) marks conjunctions where the constituents are not all of the same grammatical category, but in compound verb phrases, they are often marked as simple conjunctions of mixed types.</Paragraph>
    <Paragraph position="3"> But FUF/SURGE requires verb phrases in conjunctions to be compatible on certain clausal features with all constituents, which is violated in the following example: [VP AX VP CC VP COMMA SBAR-ADV] &amp;quot;Instead, they bought on weakness and sold into the strength, which kept the market orderly.&amp;quot; AF Mixed type NP modifiers:FUF/SURGE's NP system assumes that cardinal numbers will precede adjective modifiers, which will precede nominal modifiers, although the newspaper texts in the Penn TreeBank have more complex NPs than were considered during the design of the NP system: &amp;quot;a $100 million Oregon general obligation veterans' tax note issue&amp;quot;.</Paragraph>
    <Paragraph position="4"> AF Direct questions: Direct questions are not very common in newspaper text, in fact there are only 61 of them out of the entire 5,383 sentences of Sections 20-22. More complex questions involving negations and modal auxiliaries are not handled well, for example &amp;quot;Couldn't we save $20 billion by shifting it to the reserves?&amp;quot; though simpler questions are generated correctly.</Paragraph>
    <Paragraph position="5"> AF Indirect questions: The Penn TreeBank contains a roughly equivalent number of instances of indirect questions as direct, such as &amp;quot;It's a question of how much credibility you gain.&amp;quot; and again the reliability of generating this construction depends on the complexity of the verbal clause and the question phrase.</Paragraph>
    <Paragraph position="6"> AF Mixed level quotations: One of the most difficult syntactic phenomena to reproduce is the introduction of symmetric punctuation that cuts across categorial boundaries [Doran, 1998]. For instance, in the following sentence, the first pair of quote marks are at the beginning of an adverbial phrase, and the second pair are in the middle, separating two of its constituents: . . . the U.S. would stand by its security commitments &amp;quot;as long as there is a threat&amp;quot; from Communist North Korea.</Paragraph>
    <Paragraph position="7"> AF Complex relative pronouns: While simple relatives are almost always handled correctly except in certain conjunctions, complex relatives like partitive relatives (&amp;quot;all of which&amp;quot;, &amp;quot;some of which&amp;quot;), relatives of indirect objects or peripheral verbal arguments like locatives (&amp;quot;to whom&amp;quot;, &amp;quot;in which&amp;quot;), complex possessives (&amp;quot;whose $275-a-share offer&amp;quot;) and raised NP relatives (&amp;quot;...the swap, details of which...&amp;quot;) were not considered when FUF/SURGE was designed.</Paragraph>
    <Paragraph position="8"> AF Topicalization: The clausal system of FUF/SURGE is based on functional grammar [Halliday, 1976], and so does not expressly consider syntactic phenomena such as left dislocation or preposing of prepositional phrases.</Paragraph>
    <Paragraph position="9"> Thus sentences like &amp;quot;Among those sighing with relief was John H. Gutfreund&amp;quot; may generate correctly depending on their clausal thematic type, like material or equative.</Paragraph>
    <Paragraph position="10"> While we present the results of a manual analysis of the data in the next section, it is important to remember that the large majority of syntactic constructions, punctuation and morphology worked flawlessly in the evaluation of FUF/SURGE as described in [Callaway, 2004]. As described earlier, almost 7 out of every 10 sentences in the unseen test set were exact matches, including punctuation and capitalization. Additionally, most errors that did occur were in the transformation component rather than the surface realizer, as we will describe shortly. Finally, some well-studied but rare syntactic constructions did not occur in the sections of the Penn TreeBank that we examined, such as left dislocation and negative NP preposing.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Data Analysis
</SectionTitle>
    <Paragraph position="0"> As mentioned previously, we undertook a manual analysis of Sections 20-22 of the Penn TreeBank by hand to determine specific reasons behind the failure of 629 sentences out of 4,240 that met the criteria of having between 15 and 44 words, and having a character error rate of more than 9 as determined by the SSA metric.</Paragraph>
    <Paragraph position="1"> Table 2 presents the results for high-level error types as described in Section 3. It shows that the greatest proportion of errors is due to the transformation process: 390 sentences (62.0%) or 15,733 (63.4%) of the character-based accuracy error. This is expected given that the transformation component has been developed in a year or so, while FUF/SURGE has been in use for around 15 years. Each of the 166 sentences that were incorrect due to inaccurate transformer rules was verified by ensuring that the sentence would correctly generate with minor changes to the automatically produced  functional description. The error rate of the Penn TreeBank annotation is a reasonably well-known quantity, and there is a specialized literature describing automatic correction methods (e.g., [Dickinson and Meurers, 2003]).</Paragraph>
    <Paragraph position="2"> One surprise though is that while the number of errors due to the ordering of floating constituents is the same, the error in accuracy is skewed to semantically acceptable interpretations. And while the distribution of the order seems like random chance, it should be remembered that there can potentially be up to 10 acceptable placements when there are multiple floating constituents. Additionally, unrecognized annotation tags seem to invoke the heaviest average penalty for any error type, but have the lowest rate of occurrence.</Paragraph>
    <Paragraph position="3"> Some advice then for future evaluations of this type would be to systematically ensure that all tags are normalized in the corpus before writing transformation rules. Missing transformation rules were always single-case errors, and a large amount of effort would need to be expended to account for them, following the well-known 80/20 rule. The data in Table 3 then allows other surface realizer researchers to prioritize their time when developing their own evaluations.</Paragraph>
    <Paragraph position="4"> Finally, slightly over a quarter of the reduction in accuracy is due to syntactic phenomena that are not handled correctly by the surface realizer. Given that this error category is most of interest in determining which surface realizer has the necessary coverage for a particular domain, we investigated further the interactions between error rates and individual syntactic phenomena.</Paragraph>
    <Paragraph position="5"> Table 3 presents the number of occurrences of errors for each of the syntactic phenomena presented in the previous section. We can see that topicalizations, direct questions and inversions were on average most likely to produce the largest error per instance, at 73.18, 56.00 and 50.08 edit distances each. The most frequent error types were mixed NP modifiers, but such constructions were small enough (often involving only two words in switched order) that they had the second lowest SSA penalty.</Paragraph>
    <Paragraph position="6"> Knowing the ratios of errors allows those weighing different surface realizers for a new project to select based on a number of criteria. For instance, in some domains, it may be undesirable to have the reader see a large number of surface language errors where the extent of each error is unimportant, whereas in other situations, large mistakes that completely obscure the intent of the sentence are more of a problem.</Paragraph>
    <Paragraph position="7"> While Table 3 tells us which syntactic type is most likely to produce the largest accuracy penalty, it does not tell us which syntactic types are most frequent in the corpus, since this would require also counting all correct instances, which would be very prohibitive to do manually and inaccurate to do automatically. Knowing this quantity would be of greatest help to an NLG application designer wanting to compare surface realizers, but is difficult to do in practice.</Paragraph>
    <Paragraph position="8"> We thus decided to look at correct instances of a small number of rare phenomena which can easily be found by searching for tags in the TreeBank. For instance, it-clefts are marked with the annotation S-CLF, of which there are 4 in the 5,383 sentences in Sections 20-22. However, by searching through the text representations with the regular expression it is * that and it was * that, we found an additional 2 it-clefts that were incorrectly marked (although all 6 examples were exact matches when generated by the surface realizer). By a similar process, we discovered 7 marked and 1 unmarked wh-clefts, which also were exact matches.</Paragraph>
    <Paragraph position="9"> A further investigation for topicalized sentences uncovered 6 instances that were correctly generated versus the 11 incorrectly generated.</Paragraph>
    <Paragraph position="10"> The number of errors in the Penn TreeBank annotations on these rare constructions should give pause to those who want to create statistical selection algorithms from such data, given that the signal-to-noise ratio may be very high. Additionally, all of the data presented above reflects only this corpus; spoken dialogue corpora may vary significantly in frequencies of topicalization and left dislocation, for example.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML