File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1045_metho.xml

Size: 18,132 bytes

Last Modified: 2025-10-06 14:07:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1045">
  <Title>Improving Testsuites via Instrumentation</Title>
  <Section position="3" start_page="325" end_page="326" type="metho">
    <SectionTitle>
2 Grammar Instrumentation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="325" end_page="325" type="sub_section">
      <SectionTitle>
Measures from Software Engineering cannot be sim-
</SectionTitle>
      <Paragraph position="0"> ply transferred to Grammar Engineering, because the structure of programs is different from that of unification grammars. Nevertheless, the structure of a grammar allows the derivation of suitable measures, similar to the structure of programs; this is discussed in Sec.2.1. The actual instrumentation of the grammar depends on the formalism used, and is discussed in Sec.2.2.</Paragraph>
    </Section>
    <Section position="2" start_page="325" end_page="325" type="sub_section">
      <SectionTitle>
2.1 Coverage Criteria
</SectionTitle>
      <Paragraph position="0"> Consider the LFG grammar rule in Fig. 1. 2 On first view, one could require of a testsuite that each such rule is exercised at least once. ~rther thought will indicate that there are hidden alternatives, namely the optionality of the NP and the PP. The rule can only be said to be thoroughly tested if test cases exist which test both presence and absence of optional constituents (requiring 4 test cases for this rule).</Paragraph>
      <Paragraph position="1"> In addition to context-free rules, unification grammars contain equations of various sorts, as illustrated in Fig.1. Since these annotations may also contain disjunctions, a testsuite with complete rule coverage is not guaranteed to exercise all equation alternatives. The phrase-structure-based criterion defined above must be refined to cover all equation alternatives in the rule (requiring two test cases for the PP annotation). Even if we assume that (as, e.g., in LFG) there is at least one equation associated with each constituent, equation coverage does not subsume rule coverage: Optional constituents introduce a rule disjunct (without the constituent) that is not characterizable by an equation. A measure might thus be defined as follows: disjunct coverage The disjunct coverage of a test-suite is the quotient number of disjuncts tested Tdis = number of disjuncts in grammar 2Notation: ?/*/+ represent optionality/iteration including/excluding zero occurrences on categories. Annotations to a category specify equality (=) or set membership (6) of feature values, or non-existence of features (-1); they are terminated by a semicolon ( ; ). Disjunctions are given in braces ({... I-.. }). $ ($) are metavariables representing the feature structure corresponding to the mother (daughter) of the rule. Comments are enclosed in quotation marks (&amp;quot;... &amp;quot;). Cf. (Kaplan and Bresnan, 1982) for an introduction to LFG notation. where a disjunct is either a phrase-structure alternative, or an annotation alternative. Optional constituents (and equations, if the formalism allows them) have to be treated as a disjunction of the constituent and an empty category (cf. the instrumented rule in Fig.2 for an example).</Paragraph>
      <Paragraph position="2"> Instead of considering disjuncts in isolation, one might take their interaction into account. The most complete test criterion, doing this to the fullest extent possible, can be defined as follows: interaction coverage The interaction coverage of a testsuite is the quotient number of disjunct combinations tested Tinter = number of legal disjunct combinations There are methodological problems in this criterion, however. First, the set of legal combinations may not be easily definable, due to far-reaching dependencies between disjuncts in different rules, and second, recursion leads to infinitely many legal disjunct combinations as soon as we take the number of usages of a disjunct into account. Requiring complete interaction coverage is infeasible in practice, similar to the path coverage criterion in Software Engineering. null We will say that an analysis (and the sentence receiving this analysis) relies on a grammar disjunct if this disjunct was used in constructing the analysis.</Paragraph>
    </Section>
    <Section position="3" start_page="325" end_page="326" type="sub_section">
      <SectionTitle>
2.2 Instrumentation
</SectionTitle>
      <Paragraph position="0"> Basically, grammar instrumentation is identical to program instrumentation: For each disjunct in a given source grammar, we add grammar code that will identify this disjunct in the solution produced, iff that disjunct has been used in constructing the solution.</Paragraph>
      <Paragraph position="1"> Assuming a unique numbering of disjuncts, an annotation of the form DISJUNCT-nn = + can be used for marking. To determine whether a certain disjunct was used in constructing a solution, one only needs to check whether the associated feature occurs (at some level of embedding) in the solution. Alternatively, if set-valued features are available, one can use a set-valued feature DISJUNCTS to collect atomic symbols representing one disjunct each: DISJUNCT-nn 6 DISJUNCTS.</Paragraph>
      <Paragraph position="2"> One restriction is imposed by using the unification formalism, though: One occurrence of the mark cannot be distinguished from two occurrences, since the second application of the equation introduces no new information. The markers merely unify, and there is no way of counting.</Paragraph>
      <Paragraph position="4"> Therefore, we have used a special feature of our grammar development environment: Following the LFG spirit of different representation levels associated with each solution (so-called projections), it provides for a multiset of symbols associated with the complete solution, where structural embedding plays no role (so-called optimality projection; see (Frank et al., 1998)). In this way, from the root node of each solution the set of all disjuncts used can be collected, together with a usage count.</Paragraph>
      <Paragraph position="5"> Fig. 2 shows the rule from Fig.1 with such an instrumentation; equations of the form DISJUNCT-nnE o* express membership of the disjunct-specific atom DISJUNCT-nn in the sentence's multiset of disjunct markers.</Paragraph>
    </Section>
    <Section position="4" start_page="326" end_page="326" type="sub_section">
      <SectionTitle>
2.3 Processing Tools
</SectionTitle>
      <Paragraph position="0"> Tool support is mandatory for a scenario such as instrumentation: Nobody will manually add equations such as those in Fig. 2 to several hundred rules. Based on the format of the grammar rules, an algorithm instrumenting a grammar can be written down easily.</Paragraph>
      <Paragraph position="1"> Given a grammar and a testsuite or corpus to compare, first an instrumented grammar must be constructed using such an algorithm. This instrumented grammar is then used to parse the testsuite, yielding a set of solutions associated with information about usage of grammar disjuncts. Up to this point, the process is completely automatic. The following two sections discuss two possibilities to evaluate this information. null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="326" end_page="5480" type="metho">
    <SectionTitle>
3 Quality of Testsuites
</SectionTitle>
    <Paragraph position="0"> This section addresses the aspects of completeness (&amp;quot;does the testsuite exercise all disjuncts in the grammar?&amp;quot;) and economy of a testsuite (&amp;quot;is it minimal?&amp;quot;). null Complementing other work on testsuite construction (cf. Sec.5), we will assume that a grammar is already available, and that a testsuite has to be constructed or extended. While one may argue that grammar and testsuite should be developed in parallel, such that the coding of a new grammar disjunct is accompanied by the addition of suitable test cases, and vice versa, this is seldom the case. Apart from the existence of grammars which lack a testsuite, and to which this procedure could be usefully applied, there is the more principled obstacle of the evolution of the grammar, leading to states where previously necessary rules silently loose their usefulness, because their function is taken over by some other rules, structured differently. This is detectable by instrumentation, as discussed in Sec.3.1.</Paragraph>
    <Paragraph position="1"> On the other hand, once there is a testsuite, you want to use it in the most economic way, avoiding redundant tests. Sec.3.2 shows that there are different levels of redundancy in a testsuite, dependent on the specific grammar used. Reduction of this redundancy can speed up the test activity, and give a clearer picture of the grammar's performance.</Paragraph>
    <Section position="1" start_page="326" end_page="327" type="sub_section">
      <SectionTitle>
3.1 Testsuite Completeness
</SectionTitle>
      <Paragraph position="0"> If the disjunct coverage of a testsuite is 1 for some grammar, the testsuite is complete w.r.t, this grammar. Such a testsuite can reliably be used to monitor changes in the grammar: Any reduction in the grammar's coverage will show up in the failure of some test case (for negative test cases, cf. Sec.4).</Paragraph>
      <Paragraph position="1"> If there is no complete testsuite, one can - via instrumentation - identify disjuncts in the grammar for which no test case exists. There might be either (i) appropriate, but untested, disjuncts calling for the addition of a test case, or (ii) inappropriate disjuncts, for which one cannot construct a grammatical test case relying on them (e.g., left-overs from rearranging the grammar). Grammar instrumentation singles out all untested disjuncts automatically, but cases (i) and (ii) have to be distinguished manually. null Checking completeness of our local testsuite of 1787 items, we found that only 1456 out of 3730 grammar disjuncts ir~ our German grammar were tested, yielding Tdis = 0.39 (the TSNLP testsuite containing 1093 items tests only 1081 disjuncts, yielding Tdis = 0.28). 3 Fig.3 shows an example of a gap in our testsuite (there are no examples of circumpositions), while Fig.4 shows an inapproppriate disjunct thus discovered (the category ADVadj has been eliminated in the lexicon, but not in all rules). Another error class is illustrated by Fig.5, which shows a rule that can never be used due to an LFG coherence violation; the grammar is inconsistent here. 4 3There are, of course, unparsed but grammatical test cases in both testsuites, which have not been taken into account in these figures. This explains the difference to the overall number of 1582 items in the German TSNLP testsuite. 4Test cases using a free dative pronoun may be in the testsuite, but receive no analysis since the grammatical function FREEDAT is not defined as such in the configuration section.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="2" start_page="327" end_page="5480" type="sub_section">
      <SectionTitle>
3.2 Testsuite Economy
</SectionTitle>
      <Paragraph position="0"> Besides being complete, a testsuite must be economical, i.e., contain as few items as possible without sacrificing its diagnostic capabilities. Instrumentation can identify redundant test cases. Three criteria can be applied in determining whether a test case is redundant: similarity There is a set of other test cases which jointly rely on all disjunct on which the test case under consideration relies.</Paragraph>
      <Paragraph position="1"> equivalence There is a single test case which relies on exactly the same combination(s) of disjuncts.</Paragraph>
      <Paragraph position="2"> strict equivalence There is a single test case which is equivalent to and, additionally, relies on the disjuncts exactly as often as, the test case under consideration.</Paragraph>
      <Paragraph position="3"> For all criteria, lexical and structural ambiguities must be taken into account. Fig.6 shows some equivalent test cases derived from our testsuite: Example 1 illustrates the distinction between equivalence and strict equivalence; the test cases contain different numbers of attributive adjectives, but are nevertheless considered equivalent. Example 2 shows that our grammar does not make any distinction between adverbial usage and secondary (subject or object) predication. Example 3 shows test cases which should not be considered equivalent, and is discussed below.</Paragraph>
      <Paragraph position="4"> The reduction we achieved in size and processing time is shown in Table 1, which contains measurements for a test run containing only the parseable test cases, one without equivalent test cases (for every set of equivalent test cases, one was arbitrar- null The last was constructed using a simple heuristic: Starting with the sentence relying on the most disjuncts, working towards sentences relying on fewer disjuncts, a sentence was selected only if it relied on a disjunct on which no previously selected sentence relied. Assuming that a disjunct working correctly once will work correctly more than once, we did not consider strict equivalence.</Paragraph>
      <Paragraph position="5"> We envisage the following use of this redundancy detection: There clearly are linguistic reasons to distinguish all test cases in example 2, so they cannot simply be deleted from the testsuite. Rather, their equivalence indicates that the grammar is not yet perfect (or never will be, if it remains purely syntactic). Such equivalences could be interpreted as  a reminder which linguistic distinctions need to be incorporated into the grammar. Thus, this level of redundancy may drive your grammar development agenda. The level of equivalence can be taken as a limited interaction test: These test cases represent one complete selection of grammar disjuncts, and (given the grammar) there is nothing we can gain by checking a test case if an equivalent one was tested. Thus, this level of redundancy may be used for ensuring the quality of grammar changes prior to their incorporation into the production version of the grammar. The level of similarity contains much less test cases, and does not test any (systematic) interaction between disjuncts. Thus, it may be used during development as a quick rule-of-thumb procedure detecting serious errors only.</Paragraph>
      <Paragraph position="6"> Coming back to example 3 in Fig.6, building equivalence classes also helps in detecting grammar errors: If, according to the grammar, two cases are equivalent which actually aren't, the grammar is incorrect. Example 3 shows two test cases which are syntactically different in that the first contains the adverbial oft, while the other doesn't. The reason why they are equivalent is an incorrect rule that assigns an incorrect reading to the second test case, where the infinitival particle &amp;quot;zu&amp;quot; functions as an adverbial.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="5480" end_page="5480" type="metho">
    <SectionTitle>
4 Negative Test Cases
</SectionTitle>
    <Paragraph position="0"> To control overgeneration, appropriately marked ungrammatical sentences are important in every testsuite. Instrumentation as proposed here only looks at successful parses, but can still be applied in this context: If an ungrammatical test case receives an analysis, instrumentation informs us about the disjuncts used in the incorrect analysis. One (or more) of these disjuncts must be incorrect, or the sentence would not have received a solution. We exploit this information by accumulation across the entire test suite, looking for disjuncts that appear in unusually high proportion in parseable ungrammatical test cases.</Paragraph>
    <Paragraph position="1"> In this manner, six grammar disjuncts are singled out by the parseable ungrammatical test cases in the TSNLP testsuite. The most prominent disjunct appears in 26 sentences (listed in Fig.7), of which group 1 is really grammatical and the rest fall into two groups: A partial VP with object NP, interpreted as an imperative sentence (group 2), and a weird interaction with the tokenizer incorrectly handling capitalization (group 3).</Paragraph>
    <Paragraph position="2"> Far from being conclusive, the similarity of these sentences derived from a suspicious grammar disjunct, and the clear relation of the sentences to only two exactly specifiable grammar errors make it plausible that this approach is very promising in reducing overgeneration.</Paragraph>
  </Section>
  <Section position="6" start_page="5480" end_page="5480" type="metho">
    <SectionTitle>
5 Other Approaches to Testsuite
Construction
</SectionTitle>
    <Paragraph position="0"> Although there are a number of efforts to construct reusable large-coverage testsuites, none has to my knowledge explored how existing grammars could be used for this purpose.</Paragraph>
    <Paragraph position="1"> Starting with (Flickinger et al., 1987), testsuites have been drawn up from a linguistic viewpoint, &amp;quot;in\]ormed by \[the\] study of linguistics and \[reflecting\] the grammatical issues that linguists have concerned themselves with&amp;quot; (Flickinger et al., 1987, , p.4). Although the question is not explicitly addressed in (Balkan et al., 1994), all the testsuites reviewed there also seem to follow the same methodology. The TSNLP project (Lehmann and Oepen, 1996) and its successor DiET (Netter et al., 1998), which built large multilingual testsuites, likewise fall into this category.</Paragraph>
    <Paragraph position="2"> The use of corpora (with various levels of annotation) has been studied, but even here the recommendations are that much manual work is required to turn corpus examples into test cases (e.g., (Balkan and Fouvry, 1995)). The reason given is that corpus sentences neither contain linguistic phenomena in isolation, nor do they contain systematic variation. Corpora thus are used only as an inspiration. (Oepen and Flickinger, 1998) stress the inter-dependence between application and testsuite, but don't comment on the relation between grammar and testsuite.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML