File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1095_metho.xml

Size: 20,839 bytes

Last Modified: 2025-10-06 14:14:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1095">
  <Title>Towards a More Careful Evaluation of Broad Coverage Parsing Systems</Title>
  <Section position="3" start_page="0" end_page="562" type="metho">
    <SectionTitle>
2 Problems with Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="562" type="sub_section">
      <SectionTitle>
Metrics
</SectionTitle>
      <Paragraph position="0"> Until now a number of problems with evaluation have been pointed out. One well known problem is that measures based only on the absence of crossing errors on sentence level, such as Sentence Accuracy and Viterbi Consistency, are not usable for parsing systems that apply a partial bracketing, since a sparse bracketing improves the score. For example (Lin, 1995) discusses some other problems, but suggests an alternative that is difficult to apply. It is based on transferring constituency trees to dependency trees, but that introduces many ad hoc choices, and treebanks with dependency trees are hardly available.</Paragraph>
      <Paragraph position="1"> Also, a treebank usually contains arbitrary choices (besides errors) made by humans, in cases where it was not clear what brackets correctly reflect the syntactical structure of the sentence.</Paragraph>
      <Paragraph position="2"> We also mention some less discussed problems.</Paragraph>
      <Paragraph position="3"> First of all, given a test result such as Bracket Accuracy, it is necessary to know the confidance interval. In other words, if a parsing system scores 81.2% on a test, in what range should we assume the estimate to be? Basically the same problem arises with the statistical significance of tile difference between the test score of two different parsers. If one scores 81.2% and the other 82.5%, should we conclude the second one is really doing better? This is particularly important when developing a parsing system by trying various modifications, and choosing the one that performs the best on a test set. If the differences between scores become too small in relation to the test set, one will just  be making a parser for the test set and the performance will drop as soon as other data is used. There are several problems for deciding significance for Bracket Accuracy and Bracket Recall. There is a strong variation between brackets, because some brackets are very easy and SOlIle are very hard. Also one mistake may lead other mistakes, making them not independent. As an example of the last problem, think of the indicated bracket pair in the sentence &amp;quot;The dog waited for /his master on the bridge\].&amp;quot; This would probably produce a crossing error, since the treebank would probably contain the pair &amp;quot;The dog/waited for his maste~\] on the bridge.&amp;quot; The parser is now almost certain to make a second mistake, namely &amp;quot;The dog waited \[for his master on the bridge\].&amp;quot; Consequently two crossing errors are counted, whereas correcting one would imply correcting the other.</Paragraph>
      <Paragraph position="4"> In this article we will show that this makes it impossible to calculate the significance in a straight-forward way and suggest two solutions.</Paragraph>
      <Paragraph position="5"> Another 1)roblem is that we only get a very general picture, whereas it would be interesting to know much more details. For example, how many of the bracket-pairs that constituted a crossing error when compared to the treebank would be acceptable to a human? (In other words, how often do arbitrary choices influence the result?) And, how many brackets that the parser produces are not in the treebank nor constitute a crossing error, and how many of those are not acceptable to humans? Bracket Accuracy is often lower than it should be when the treebank does not indicate all brackets (so-called skeleton parsing). This may also make Bracket Recall seem too low.</Paragraph>
      <Paragraph position="6"> In this paper we suggest giving more specific information about test results, and develop methods to estimate the statistical significance for test scores.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="562" end_page="563" type="metho">
    <SectionTitle>
3 More Careful Measures
</SectionTitle>
    <Paragraph position="0"> The data resulting from the test may be (a) general data fl'om all bracket pairs, or (b) data on specific structures (i.e. prepositional phrases). The measures we give can be applied to either one.</Paragraph>
    <Paragraph position="1"> We suggest perforlning two types of tests: regular tests and tests with a human check. The regular test should include a nuinber of figures that we describe below, which are much more informative than the usual Bracket Recall or Bracket Precision. The more elaborate one includes a human check on certain items, which not only gives more exact information on the test result, but in particular shows the quality of the regular test. This is particularly useful if the parsing system was made independently from the treebank.</Paragraph>
    <Paragraph position="2"> The items for the regular test are listed here.</Paragraph>
    <Paragraph position="3"> The last four items only apply to a comparison of two parsing systems (for example two modifications of the same system), here referred to as A and B.</Paragraph>
    <Paragraph position="4"> * TTB: Total Treebank Brackets, number of brackets in the treebank.</Paragraph>
    <Paragraph position="5"> . TPB: Total Parse Brackets, number of brackets produced by the parsing system.</Paragraph>
    <Paragraph position="6"> . EM: Exact Match, the nmnber of bracket-pairs produced by the parsing system that are equal to a pair in the treebank.</Paragraph>
    <Paragraph position="7"> . CE: Crossing Error, the number of bracket-pairs produced by the parsing system that constitute a crossing error against the treebank. null . SP: Spurious, number of bracket pairs prodated by the parsing system that were not in the treebank but also do not constitute a crossing error.</Paragraph>
    <Paragraph position="8"> * PINH: Parse-error Inherited, the number of bracket-pairs produced by the parsing system that constitute a crossing error and have a direct parent bracket-pair that also constitutes a crossing error.</Paragraph>
    <Paragraph position="9"> . PNINII: Parse-error Non-Inherited, the number of bracket-pairs produced by the parsing system that constitute a crossing error, but were not counted for PINH.</Paragraph>
    <Paragraph position="10">  * TINH: Treebank Inherited, the number of bracket-pairs in the treebank that were reproduced by the parsing system and have a direct parent bracket-pair in the treebank that was also reproduced.</Paragraph>
    <Paragraph position="11"> * TNINH: Treebank Non-Inherited, the nmnber of bracket-pairs in the treebank that were reproduced by the parsing system but were not counted for TINH.</Paragraph>
    <Paragraph position="12"> . YY: Number of brackets in the treebank that  were reproduced by A and B.</Paragraph>
    <Paragraph position="13"> * YN: Number of brackets in the treebank that were reproduced by A but not by B.</Paragraph>
    <Paragraph position="14"> * NY: Number of brackets in the treebank that were reproduced by B but not by A.</Paragraph>
    <Paragraph position="15"> * NN: Number of brackets in the treebank that were not reproduced by both A and B.</Paragraph>
    <Paragraph position="16"> As an example, we take this 2 sentence test: Treebank: \[He \[walks to \[the house\[\]\] \[\[The president\] \[gave \[a long speech\]\]\] Parser: \[He \[walks \[to \[the housel\]\]\] \[The \[\[president gavel \[a \[long speech\]\]l\] The number of exactly matching brackets (EM) is 3+2 = 5. The number of crossing errors (CE) is  2, both in the second sentence. The rest, 1-1-1 = 2 is spurious (SP). Further, TTB is 7, TPB is 9, PINH is 1 and PNINH is 1, TINH is 1 and TNINH is 4.</Paragraph>
    <Paragraph position="17">  This already gives more detailed information, but we can take things a step further by having a human evaluate the most important brackets. If the test set is large, it would be undesirable or impossible to have a human evaluate every single bracket, but we can seriously reduce the workload by not considering the exact matching bracket pairs; they are simply marked as 'accepted.' The only result of evaluating these brackets would be a few errors in the treebank, which is often not really worth the trouble (unless the treebank is suspected to contain many errors). This leaves only the crossing errors and spurious brackets to be evaluated.</Paragraph>
    <Paragraph position="18"> This leaves a much smaller amount of work, especially if there are many exact matches. Nevertheless we suggest doing a human check only on important tests, such as final evaluations.</Paragraph>
    <Paragraph position="19"> In the human evaluation, crossing error and spurious bracket pairs are to be counted as 'acceptable' if they would fit into the correct interpretation using the style of bracketing that the parsing system aims at, ignoring the style of bracketing of the treebank.</Paragraph>
    <Paragraph position="20"> The result of this process is that EM, CE and SP will be divided in accepted and rejected, giving six groups. We will refer to them as EMA, EMR, CEA, CER, SPA and SPR. If the check on EM is not performed, as we suggest, EMR will be 0.</Paragraph>
    <Paragraph position="21"> If YN and NY are both relatively high, this shows that there are structures on which A is better than B and vice versa (the systems 'complement' each other). In that case we would recommend testing on (more) specific structures, because otherwise the general result will be misleading. null</Paragraph>
  </Section>
  <Section position="5" start_page="563" end_page="1432" type="metho">
    <SectionTitle>
4 A Practical Example
</SectionTitle>
    <Paragraph position="0"> To show the difference between the usual evaluation and our evaluation method we give the results for two parsing systems we evaluated in the course of our research. We do not intend to make any particular claims about these parsing systems, nor about the treebank we used (the test was not designed to draw conclusions about the treebank), we only use it to discuss the issues involved in evaluation. null The treebank we used was the EDR corpus (EDR, 1995), a Japanese treebank with mainly newspaper sentences. We compared two versions of a grammar based parsing system developed at our laboratory, using a stochastical grammar to select one parse for every sentence. Having two variations of the same parser, we were interested in the difference between them. We performed a test on 600 sentences from the corpus (which were not used for training).</Paragraph>
    <Paragraph position="1"> Our evaluation was as follows:  1. Unrelevant elements such as punctuation are eliminated fl'om both the treebank tree and the parse tree.</Paragraph>
    <Paragraph position="2"> 2. Next, all (resulting) empty bracket-pairs are removed. This was done recursively, therefore, if removing an empty bracket-pair caused its parent to become empty, the parent is also removed.</Paragraph>
    <Paragraph position="3"> 3. Double bracket-pairs are removed. For example &amp;quot;The l/old man\]\]&amp;quot; is turned into &amp;quot;The \[old magi'.</Paragraph>
    <Paragraph position="4"> 4. The crossing error bracket-pairs and spurious  bracket-pairs were evaluated by hand. This took about three person-hours.</Paragraph>
    <Paragraph position="5"> In this process one step is missing, we namely wanted to remove trivial brackets before evaluating. In English there is a simple strategy for this: remove all brackets that enclose only one word. In Japanese this is not so easy. Since Japanese is an agglutinating language and words are not separated, it is difficult to say what the 'words' are in the first place. We decided on a certain level to permit brackets, and the tree from the treebank also stopped at some level so that remaining, more precise bracket-pairs were amongst those counted as spurious.</Paragraph>
    <Paragraph position="6"> The resulting figures are in table 1 and table 2 gives the comparative items.</Paragraph>
  </Section>
  <Section position="6" start_page="1432" end_page="1432" type="metho">
    <SectionTitle>
5 New Measures
</SectionTitle>
    <Paragraph position="0"> We claim that the items listed in the previous paragraph allows a nmre flexible framework for ewduatim~. In this paragraph we will show some examt)les of measures that can be used. They can be calculated with these items so there is no need to discuss every one of them all the time. Table 3 gives the measures and table 4 gives the results in percentages. ~l?he measures in the lower part of this tabh&gt; are more directed at the test than at the parsers.</Paragraph>
    <Paragraph position="2"> The generation rate shows that both systems arc rather modest in producing brackets.</Paragraph>
    <Paragraph position="3"> We give two types of recall. We suggest using recall-lmrd, but when the treebank does not indicate all brackets recall-soft may give an indication of the proper recall.</Paragraph>
    <Paragraph position="4"> We also present two types of precision. B scores better on precision-soft, but there is not much difference for precision-hard. This shows that B is better at teeM1 but also generates nmre spnrious brackets. The spuriousness also indicates this.</Paragraph>
    <Paragraph position="5"> The other measures tell us more about the test itself. A would have been treated slightly favorable without a human cheek, since relatively more errors go 'undetected.' False Error shows that almost I out of 4 crossing errors is not really wrong, which indicates there is much difference in bracketing-style between the treebank and the parsing system. 'rest Noise shows how many bracket-pairs were not tested properly. Problem Rate shows the real 'myopia' of the test.</Paragraph>
    <Paragraph position="6"> 3'he inheritance data shows that in our test; crossing errors are often related (P-inheritance).</Paragraph>
    <Paragraph position="7"> Also, reproducing a particular bracket-pair from the treebank increases the chances on reproducing its parent (T-inheritance).</Paragraph>
  </Section>
  <Section position="7" start_page="1432" end_page="1432" type="metho">
    <SectionTitle>
6 Significance
</SectionTitle>
    <Paragraph position="0"> Things would be easy if we could assmne that the chance of apl)lying a bracket is correctly modeled as a binomial experiment. We begin by mentioning two reasons why that is not possible.</Paragraph>
    <Paragraph position="1"> * Errors that are related, sneh a~s one wrong attachment that causes a number of crossing errors, as was shown in our test by Pinheritance. null * For a binomial process we mast assume that the chance on success is the same for every bracket pair. It is not, in fact there are both very easy and very hard bracket pairs, with chances w~rying from very small to very high.</Paragraph>
    <Paragraph position="2"> The significance levels of all differences are worth knowing, but our main interest is the dis ference between A and B in recall and precision.</Paragraph>
    <Paragraph position="3"> Because of space limitations we only discuss a strategy for estimating the significance level of the measure recall-hard.</Paragraph>
    <Paragraph position="4"> Significance for Recall-Hard First we will check whether the distribution can be modeled properly with a binomial experiment. We do this by looking at the comparative items YY, YN, NY and NN.</Paragraph>
    <Paragraph position="5"> From these values the problem is intuitively clear: there are many easy bracket pairs that both always produce correctly, and many that both almost never produce because they are too hard, or the parsing systems simply never produce a certain type of bracket pair. Also, we have tested two rather similar parsing systems often giving the same answer, after all that is often just what one is interested in because one wants to measnre improvement. We will use statistical distributions to confirm this problem occurs, and to find a solution to the significance problem.</Paragraph>
    <Paragraph position="6"> We do not have tile space to go into tile details of the relations between the distributions, but if A and B would behave like a binomial variable with test size N, with Pa and P~ as respective chance on success, the distribution of YY should again be a binomial variable for test size N, with chance Pry = PaPb. The expected value and variance of YY would be</Paragraph>
    <Paragraph position="8"> For NN the distril~ution is the same with the opposite probabilities, a binomial variable for test size Nand P,~, = (1- Pa)(1- Pb). If we take Pa = l - Pa and Pb = 1 -- Pb, the expected value and variance of NN become</Paragraph>
    <Paragraph position="10"> We will later put this to more use, but for now we just use it to conclude that YY is expected to be around 4063, and NN is expected to be around 1851. Using the variation we find that the observed values are both extremely rare, so we can reject the hypothesis that we are comparing two binomial variables.</Paragraph>
    <Paragraph position="11"> Our strategy to solve this problem is assuming there are three types of brackets, namely brackets that are ahnost always reproduced, those that are almost never reproduced, and those that are sometimes reproduced and therefore constitute the 'real test' between the two parsing systems.</Paragraph>
    <Paragraph position="12"> Note that the first two types do not tell us anything about the difference between the parsing systems. By assuming the rest is similar to a binomial distribution, we can calculate the significance. Of conrse this assumption simplifies the situation, but it is closer to the truth than assuming the whole test can be modeled by a binominal distribution. And, if this assumption is not justified the whole test is not appropriate without testing on more specific phenomena.</Paragraph>
    <Paragraph position="13"> Guessing the Real Test Size The idea behind this method is that some brackets are almost always produced, and some are never, and those should be discarded so the real test remains. Ignoring certain bracket pairs corresponds with the fact that some constituents relate to little and some to much ambiguity, making some suitable for comparison and others not. We look at the number of equal answers to estimate the number of bracket-pairs that were not too easy or too hard.</Paragraph>
    <Paragraph position="14"> This is a theoretical operation, thus there is no need to do this in practice. We only need to estimate two parameters: M1 being the number of bracket-pairs that is discarded because they are always reproduced, and M2 being the number of bracket-pairs discarded because they are not reproduced. We reduce YY by M1, and NN by M2 (the test size is thus reduced by M1 4- M2). This indicates an imaginary real test, namely the part of the test that really served to compare the parsing systems.</Paragraph>
    <Paragraph position="15"> We calculate these quantities by a~suming a binomial distribution for the real test, and making sure that the corrected values for YY and NN become equal to their expected value. Let observed YY in real test = E(YY) in real test = real test size x Pa in real test x Pb in real test then we get</Paragraph>
    <Paragraph position="17"> We do not give the derivation, but when doing the same for NN and combining the equations the following relation between M 1 and M2 holds:</Paragraph>
    <Paragraph position="19"> There are usually rnany values for M1 and M2 that satisfy this condition. In practice M1 and M2 have to be discrete values, so they often are not satisfying the condition exactly, but are close enough.</Paragraph>
    <Paragraph position="20"> It may seem logical to find the proper values for M1 and M2 as a next step, in other words deciding how many brackets were 'too easy' and how many were 'too hard.' But our experience is that there is no need to do that, because we are only interested in the significance level of the difference between A and B, and the significance level is practically the same for all values of M1 and M2 that satisfy the condition.</Paragraph>
    <Paragraph position="21"> As for our test, M1 and M2 can be, for example, 6234 and 4027 respectively. Whatever value we take, the significance level o{&amp;quot; the difference between A and B corresponds to being 4.7 standard variations away from the expected value. This means that we can safely conclude that B really performs better than A. The real test is a lot smMler, only 1139 bracket pairs, but that is still enough to be meaningful. (If the nmnber of eqnM answers would be extremely high, the real test size ruay become too small, indicating the test is meaningless.)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML